Discussions

Ask a Question
Back to all

Does LiveAvatar LITE mode play agent.speak audio chunks incrementally or buffer until agent.speak_end?

Hi team,

I'm building a real-time AI voice agent using LiveAvatar in LITE mode. My pipeline streams audio from a TTS provider in small chunks (960 bytes / 20ms each at 24kHz PCM16) and sends each chunk as an agent.speak message via the WebSocket, followed by agent.speak_end when all audio is sent.

My question: Does LiveAvatar start lip-syncing and playing audio as soon as the first agent.speak chunk arrives? Or does it buffer all chunks internally and only begin playback after receiving agent.speak_end?

In my testing, the avatar appears to wait for agent.speak_end before it starts speaking. I'm measuring ~2.5s from the first chunk sent to agent.speak_end, and the avatar consistently starts speaking at the speak_end timestamp rather than when the first chunk was sent.

This matters for my use case because I'm streaming LLM output into TTS concurrently and would like the avatar to start speaking as soon as the first audio arrives, rather than waiting for the full response to be generated.

If LiveAvatar does buffer until speak_end, is there a recommended approach for minimizing perceived latency? For example:

Is there a minimum chunk size that triggers immediate playback?
Is there a configuration option to enable incremental playback?
Would sending fewer, larger chunks help?
Any guidance would be appreciated. Thanks!