Streaming Avatar latency: 2.5–3.6s delay between speak() request and AVATAR_START

Hi HeyGen Support Team,

I’m integrating HeyGen avatars in a Next.js app using @heygen/streaming-avatar (Streaming Avatar SDK) with the default WebRTC/LiveKit streaming.

I’m seeing consistently high perceived latency even after optimizing my own pipeline. I added detailed timing logs to isolate where the delay is happening, and it looks like the main bottleneck is inside HeyGen between when I request the avatar to speak and when the avatar actually begins talking.

What I’m doing

Create a session token server-side (/v1/streaming.create_token) and pass it to the client
Start the avatar session with createStartAvatar()

On each user turn, I call:

await avatar.speak({
  text: assistantText,
  task_type: TaskType.REPEAT,
});

I listen for:
- StreamingEvents.STREAM_READY (attach stream to a hidden <video> element)
- StreamingEvents.AVATAR_START_TALKING / AVATAR_STOP_TALKING

My measured timings (from real logs)

The user speech end → turn accepted is fast:

speechEndToAcceptedMs: ~280–300ms

My server response is also fast:

SERVER_DONE: ~225–950ms (varies by route / prompt)

But the major delay is after I call speak():

msFromSpeakRequest (time from “speak request sent” → AVATAR_START_TALKING) is consistently ~2.5s to 3.6s
Total time from turn accepted → avatar starts talking is typically ~3.7s to 3.8s

Example A:

SERVER_DONE: 957ms
HEYGEN_SPEAK_REQUEST_SENT: 1120ms after turn accepted
AVATAR_START_TALKING: msFromSpeakRequest 2574ms
Total latencyFromTurnAcceptedMs: 3693ms

Example B:

SERVER_DONE: 225ms
HEYGEN_SPEAK_REQUEST_SENT: 226ms after turn accepted
AVATAR_START_TALKING: msFromSpeakRequest 3587ms
Total latencyFromTurnAcceptedMs: 3813ms

What I already tried

Lowered avatar quality (High → Medium)
Throttled chroma key processing to reduce client CPU usage
Removed unnecessary waits in my own code
Confirmed this isn’t caused by my server/LLM latency

Even after the above changes, the HeyGen “start talking delay” remained ~2.5–3.6s.

Question

Is ~2.5–3.6s between the speak() request and AVATAR_START_TALKING expected behavior for Streaming Avatar sessions?

If not, could you advise on any recommended settings or approaches to reduce time-to-first-audio/video, such as:

Best quality / configuration for minimum latency
Any way to reduce buffering / startup latency
Whether taskMode: ASYNC or other speak/session parameters change time-to-start-speaking
Region/endpoint considerations (I’m in Canada; testing locally)

If helpful, I can share full log snippets for multiple turns or a short screen recording showing the delay.

Thanks a lot for your help!

Discussions