Russian TTS via API: How to Improve Pronunciation Quality and Natural Intonation?

Hi everyone!

I'm using HeyGen API to generate Russian videos with my avatar via n8n automation. My voice is cloned via ElevenLabs integration, but I'm still facing significant issues with output quality.

My payload:

json
{
"video_inputs": [{
"character": {"type": "avatar", "avatar_id": "6e52ac0265774bd6b82689bbb2da0c69", "avatar_style": "normal"},
"voice": {"type": "text", "input_text": "Три задачи в Эксель, которые ИИ делает за вас...", "voice_id": "16577ec77db4432284f3324d1855e8d8", "speed": 0.9}
}],
"dimension": {"width": 720, "height": 1280}
}
Problems I'm experiencing:

Robotic/monotone output — Despite using my cloned ElevenLabs voice, the speech sounds flat and unnatural, lacking emotional variation

Punctuation doesn't create proper pauses — Periods, commas, and ellipsis (...) don't produce the expected pause lengths. Everything flows together

Wrong word stress — TTS says "за-мОк" instead of "зА-мок" (completely different meanings in Russian)

Foreign words — "ChatGPT" → "ЧатДжиПиТи" sounds robotic. Tried hyphens — slightly better but still unnatural

No intonation variety — Questions don't sound like questions, exclamations lack energy. Everything sounds the same

What I've already tried:

Different punctuation (. , ... ! ?)

Hyphens for syllable separation

Speed adjustments (0.8 - 1.0)

Writing numbers as words

None of these significantly improved the natural feel of the speech.

My questions:

What options exist to improve pronunciation quality via API? Are there any parameters I'm missing?

How to make speech less monotone? Any techniques for adding emotional variation?

Why doesn't punctuation create proper pauses? Is this a known limitation with ElevenLabs voices?

Are there better approaches for Russian language? Maybe specific voice settings or alternative methods?

Would uploading my own audio recording work better? Support mentioned lip-sync option — is this the only way to get natural Russian speech?

Does API support SSML or any prosody controls? Pitch, rate, emphasis markers?

I want the output to sound like a real person speaking naturally, not a robot reading text. Any guidance on achieving this would be greatly appreciated!

Thanks!