Discussions

Ask a Question
Back to all

Streaming Avatar latency: 2.5–3.6s delay between speak() request and AVATAR_START_TALKING

Hi HeyGen Support Team,


I’m integrating HeyGen avatars in a Next.js app using @heygen/streaming-avatar (Streaming Avatar SDK) with the default WebRTC/LiveKit streaming.

I’m seeing consistently high perceived latency even after optimizing my own pipeline. I added detailed timing logs to isolate where the delay is happening, and it looks like the main bottleneck is inside HeyGen between when I request the avatar to speak and when the avatar actually begins talking.

What I’m doing

  • Create a session token server-side (/v1/streaming.create_token) and pass it to the client

  • Start the avatar session with createStartAvatar()

  • On each user turn, I call:

    await avatar.speak({
      text: assistantText,
      task_type: TaskType.REPEAT,
    });
    
  • I listen for:

    • StreamingEvents.STREAM_READY (attach stream to a hidden <video> element)
    • StreamingEvents.AVATAR_START_TALKING / AVATAR_STOP_TALKING

My measured timings (from real logs)

The user speech end → turn accepted is fast:

  • speechEndToAcceptedMs: ~280–300ms

My server response is also fast:

  • SERVER_DONE: ~225–950ms (varies by route / prompt)

But the major delay is after I call speak():

  • msFromSpeakRequest (time from “speak request sent” → AVATAR_START_TALKING) is consistently ~2.5s to 3.6s
  • Total time from turn accepted → avatar starts talking is typically ~3.7s to 3.8s

Example A:

  • SERVER_DONE: 957ms
  • HEYGEN_SPEAK_REQUEST_SENT: 1120ms after turn accepted
  • AVATAR_START_TALKING: msFromSpeakRequest 2574ms
  • Total latencyFromTurnAcceptedMs: 3693ms

Example B:

  • SERVER_DONE: 225ms
  • HEYGEN_SPEAK_REQUEST_SENT: 226ms after turn accepted
  • AVATAR_START_TALKING: msFromSpeakRequest 3587ms
  • Total latencyFromTurnAcceptedMs: 3813ms

What I already tried

  • Lowered avatar quality (High → Medium)
  • Throttled chroma key processing to reduce client CPU usage
  • Removed unnecessary waits in my own code
  • Confirmed this isn’t caused by my server/LLM latency

Even after the above changes, the HeyGen “start talking delay” remained ~2.5–3.6s.

Question

Is ~2.5–3.6s between the speak() request and AVATAR_START_TALKING expected behavior for Streaming Avatar sessions?

If not, could you advise on any recommended settings or approaches to reduce time-to-first-audio/video, such as:

  • Best quality / configuration for minimum latency
  • Any way to reduce buffering / startup latency
  • Whether taskMode: ASYNC or other speak/session parameters change time-to-start-speaking
  • Region/endpoint considerations (I’m in Canada; testing locally)

If helpful, I can share full log snippets for multiple turns or a short screen recording showing the delay.

Thanks a lot for your help!