Discussions

Ask a Question
Back to all

nteractive Avatar Streaming — programmatic control of expressions/gestures.

Hi team,
I’m building a research prototype for empathetic, real-time conversations using the Interactive Avatar Streaming SDK (Node/Vite/React-TS). Following support guidance, my current understanding is:

  • Expressions & gestures: not controllable in real time for Interactive Avatars; movements/expressions come from the original recording.
  • Voice settings: emotion (EXCITED | SERIOUS | FRIENDLY | SOOTHING | BROADCASTER) and rate (0.5–1.5) must be set when starting the session and cannot be changed mid-session.
  • What we can control live: send text to speak, start/stop listening, interrupt speaking, choose task types (TALK/REPEAT), and session management.
    Undocumented area: how to capture the avatar’s generated replies as text during streaming.

To help me integrate this cleanly, could you please confirm or correct the points below and, where possible, provide message/event names, JSON schemas, or sample code.

1- Runtime non-verbal control (Streaming):
Are there any supported commands for setting facial expressions, gestures, or posture at runtime for Interactive Avatars (even private/beta)?

2- SSML / style tokens:
Does the Streaming “speak” path accept SSML (e.g., , , , style tags or equivalents)?
If yes, which tags are supported, and how should they be embedded?

3- Capturing the avatar’s reply text:
When using TALK with a knowledge base or LLM behind the scenes, is the exact text that the avatar speaks available via:

  • a WebSocket event (e.g., avatar_message, tts_text, assistant_reply)
  • a REST callback/webhook
  • or a “transcript” API?
  • if available, please share event names, payload schema, and a minimal code snippet.
    If not available, what’s the recommended workaround to reliably capture the final text used for TTS?

Thank you.
Mohammed