Skip to main content
Engineering

Why We Chose WebSocket for Our Real-Time Voice Stack

9 min read
Why We Chose WebSocket for Our Real-Time Voice Stack

01 — Problem Shape

What we're actually building

Before picking a transport protocol, it helps to be precise about the data flow. This is not browser-to-browser calling. There is no peer on the other side. The product shape is:

Browser microphone -> Your server -> AI services -> Browser audio playback

The key insight: this is a backend-mediated, bidirectional stream. The browser sends audio chunks up. The server sends transcripts, LLM tokens, and TTS audio chunks back. WebSocket maps directly to this shape.

mermaid

WebRTC is designed for peer media. We don't have a browser peer or call room — we have a backend AI pipeline.


02 — Transport Comparison

Three protocols, one winner

We evaluated WebSocket, SSE, and WebRTC. Here's how they stack up for the specific job of "send mic audio to a server and stream AI output back."

Transport Direction Binary audio upload Server complexity Verdict
WebSocket Bidirectional ✓ Native Low–medium Best default
SSE Server → browser only ✗ Needs separate POST Low Text streaming only
WebRTC media track Bidirectional ✓ As RTP High If already on WebRTC infra
WebRTC data channel Bidirectional High Valid but overkill
HTTP POST chunks Request / response Low Too much overhead per-chunk for live voice

SSE is great for streaming LLM text tokens back to a browser — but it's one-directional. You'd need a second channel for audio upload, which defeats the simplicity. WebRTC adds signaling, ICE negotiation, and STUN/TURN overhead that only pays off when you need actual peer media or are already committed to a WebRTC backend like LiveKit or mediasoup.

Capability matrix

Capability WebSocket SSE WebRTC Data Channel WebRTC Media Track
Browser sends mic chunks Strong Needs separate POST/fetch Strong Strong
Server sends transcripts/tokens Strong Strong Strong Needs data channel
Server sends TTS audio chunks Strong Possible but awkward Strong Strong if WebRTC peer
Barge-in / interrupt Strong Awkward Strong Strong
Auth with app session Straightforward Straightforward More custom More custom
Load balancing Straightforward Straightforward Harder Harder
Observability/debugging Straightforward Straightforward Harder Harder
STUN/TURN required No No Sometimes Often (depends on network topology and whether the server is publicly reachable or ICE-lite capable)

03 — Audio Capture

Use browser media capture APIs regardless

Even though the transport is WebSocket, the browser-side capture story is unchanged. getUserMedia is part of the Media Capture and Streams API — shared with WebRTC, but not tied to it as a transport. Always use it with the full audio constraint set. You get browser-native echo cancellation, noise suppression, and auto gain control for free — before a single byte hits your backend.

js

Then use MediaRecorder or an AudioWorklet to emit small Opus-encoded chunks over your WebSocket connection.

Safari note: audio/webm;codecs=opus is well-supported in Chrome and Firefox. Safari may require audio/mp4 as a fallback — check MediaRecorder.isTypeSupported() before committing to a codec.

Chunking recommendations

Setting Value
Chunk duration 100–250ms
Codec Prefer audio/webm;codecs=opus; detect fallback with MediaRecorder.isTypeSupported()
Silence handling VAD before billing-heavy STT
Upload channel WebSocket binary messages
Backpressure Check ws.bufferedAmount
Turn-taking VAD + server final transcript events

04 — End-to-End Production Pipeline

The full flow from speech to response:

mermaid

Sequence across services

mermaid

05 — Latency

Overlap every stage

The fastest-feeling voice assistants don't wait for stage N to finish before starting stage N+1. They overlap. STT starts processing before the user finishes speaking. The LLM can start from stable partials or final utterance segments — kicking off before the full transcript is confirmed. TTS starts synthesizing before the LLM completes. Audio starts playing before TTS finishes.

                 0ms        500ms       1000ms      1500ms      2000ms
                 |           |           |           |           |
User speech      ████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░
Capture + chunk  ████████████████████████████░░░░░░░░░░░░░░░░░░░░
Streaming STT    ░░██████████████████████████████░░░░░░░░░░░░░░░░
Streaming LLM    ░░░░░░░░░░░░░░░░░░██████████████████░░░░░░░░░░░░
Streaming TTS    ░░░░░░░░░░░░░░░░░░░░░░░░████████████████░░░░░░░░
Audio playback   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████████████

The anti-pattern to avoid:

record full audio -> transcribe full audio -> generate full answer -> synthesize full answer -> play audio

Each step may be individually fast, but the compounding wait time feels broken.


06 — Message Protocol

A typed WebSocket protocol

Use a single persistent WebSocket connection with typed message envelopes. This keeps your backend stateful per-session and makes barge-in handling straightforward.

Control and text frames are sent as JSON. Audio is sent as raw binary frames — don't try to embed binary data inside a JSON envelope. A practical split:

  • JSON frames — all control messages (session_start, audio_end, interrupt) and server-to-browser text events
  • Binary framesaudio_chunk (mic data up) and tts_audio_chunk (TTS audio down); use a small binary header or a parallel JSON message to carry metadata if needed
Direction Type Payload
Browser → Server session_start locale, voice, auth/session metadata
Browser → Server audio_chunk binary Opus chunk or JSON metadata + binary frame
Browser → Server audio_end user stopped speaking or VAD turn ended
Browser → Server interrupt user barged in while AI audio was playing
Server → Browser transcript_partial partial STT text
Server → Browser transcript_final final turn text
Server → Browser llm_token response text token
Server → Browser tts_audio_chunk binary audio output
Server → Browser turn_end server completed the assistant turn
Server → Browser error recoverable error information
js

07 — When to Pick WebRTC

The WebRTC exceptions

There are legitimate reasons to choose WebRTC:

Reason Why it changes the decision
You are already using LiveKit, mediasoup, Janus, or Pion The hard backend WebRTC work is already solved.
You need live calls, rooms, or peer media WebRTC media tracks are the native model.
You need one unified session for audio/video/data WebRTC can carry media and data together.
You need RTP-level media behavior WebRTC gives jitter buffering, packet loss concealment, and media negotiation.

If none of those are true, WebSocket is the simpler production path. For reference, here's what the WebRTC backend stack looks like:

mermaid

08 — Final Decision

Decision tree

mermaid

The stack

getUserMedia capture · WebSocket transport · Streaming STT · Streaming LLM · Streaming TTS · Browser audio queue · VAD turn-taking

Area Decision Reason
Mic capture Browser media capture APIs (getUserMedia) Browser-native echo cancellation, noise suppression, auto gain — independent of transport
Transport WebSocket Bidirectional, binary support, easy auth, simple scaling
Server pipeline Streaming STT → LLM → TTS Enables overlapped stages and low perceived latency
Audio playback Streaming audio queue Chunks play before full synthesis completes
Turn-taking VAD + server final transcript events Clean separation of detecting speech vs. confirming it

Use WebSocket for real-time production transport, use getUserMedia for browser mic quality, add VAD for turn-taking, and use streaming STT/LLM/TTS services for true real-time voice.