Discussions
Streaming Avatar latency: 2.5–3.6s delay between speak() request and AVATAR_START_TALKING
Hi HeyGen Support Team,
I’m integrating HeyGen avatars in a Next.js app using @heygen/streaming-avatar (Streaming Avatar SDK) with the default WebRTC/LiveKit streaming.
I’m seeing consistently high perceived latency even after optimizing my own pipeline. I added detailed timing logs to isolate where the delay is happening, and it looks like the main bottleneck is inside HeyGen between when I request the avatar to speak and when the avatar actually begins talking.
What I’m doing
-
Create a session token server-side (
/v1/streaming.create_token) and pass it to the client -
Start the avatar session with
createStartAvatar() -
On each user turn, I call:
await avatar.speak({ text: assistantText, task_type: TaskType.REPEAT, }); -
I listen for:
StreamingEvents.STREAM_READY(attach stream to a hidden<video>element)StreamingEvents.AVATAR_START_TALKING/AVATAR_STOP_TALKING
My measured timings (from real logs)
The user speech end → turn accepted is fast:
speechEndToAcceptedMs: ~280–300ms
My server response is also fast:
SERVER_DONE: ~225–950ms (varies by route / prompt)
But the major delay is after I call speak():
msFromSpeakRequest(time from “speak request sent” →AVATAR_START_TALKING) is consistently ~2.5s to 3.6s- Total time from turn accepted → avatar starts talking is typically ~3.7s to 3.8s
Example A:
SERVER_DONE: 957msHEYGEN_SPEAK_REQUEST_SENT: 1120ms after turn acceptedAVATAR_START_TALKING:msFromSpeakRequest2574ms- Total
latencyFromTurnAcceptedMs: 3693ms
Example B:
SERVER_DONE: 225msHEYGEN_SPEAK_REQUEST_SENT: 226ms after turn acceptedAVATAR_START_TALKING:msFromSpeakRequest3587ms- Total
latencyFromTurnAcceptedMs: 3813ms
What I already tried
- Lowered avatar quality (High → Medium)
- Throttled chroma key processing to reduce client CPU usage
- Removed unnecessary waits in my own code
- Confirmed this isn’t caused by my server/LLM latency
Even after the above changes, the HeyGen “start talking delay” remained ~2.5–3.6s.
Question
Is ~2.5–3.6s between the speak() request and AVATAR_START_TALKING expected behavior for Streaming Avatar sessions?
If not, could you advise on any recommended settings or approaches to reduce time-to-first-audio/video, such as:
- Best
quality/ configuration for minimum latency - Any way to reduce buffering / startup latency
- Whether
taskMode: ASYNCor other speak/session parameters change time-to-start-speaking - Region/endpoint considerations (I’m in Canada; testing locally)
If helpful, I can share full log snippets for multiple turns or a short screen recording showing the delay.
Thanks a lot for your help!