Engineering

Why We Chose WebSocket for Our Real-Time Voice Stack

June 3, 20269 min read

01 — Problem Shape

What we're actually building

Before picking a transport protocol, it helps to be precise about the data flow. This is not browser-to-browser calling. There is no peer on the other side. The product shape is:

Browser microphone -> Your server -> AI services -> Browser audio playback

The key insight: this is a backend-mediated, bidirectional stream. The browser sends audio chunks up. The server sends transcripts, LLM tokens, and TTS audio chunks back. WebSocket maps directly to this shape.

mermaid

WebRTC is designed for peer media. We don't have a browser peer or call room — we have a backend AI pipeline.

02 — Transport Comparison

Three protocols, one winner

We evaluated WebSocket, SSE, and WebRTC. Here's how they stack up for the specific job of "send mic audio to a server and stream AI output back."

Transport	Direction	Binary audio upload	Server complexity	Verdict
WebSocket	Bidirectional	✓ Native	Low–medium	Best default
SSE	Server → browser only	✗ Needs separate POST	Low	Text streaming only
WebRTC media track	Bidirectional	✓ As RTP	High	If already on WebRTC infra
WebRTC data channel	Bidirectional	✓	High	Valid but overkill
HTTP POST chunks	Request / response	✓	Low	Too much overhead per-chunk for live voice

SSE is great for streaming LLM text tokens back to a browser — but it's one-directional. You'd need a second channel for audio upload, which defeats the simplicity. WebRTC adds signaling, ICE negotiation, and STUN/TURN overhead that only pays off when you need actual peer media or are already committed to a WebRTC backend like LiveKit or mediasoup.

Capability matrix

Capability	WebSocket	SSE	WebRTC Data Channel	WebRTC Media Track
Browser sends mic chunks	Strong	Needs separate POST/fetch	Strong	Strong
Server sends transcripts/tokens	Strong	Strong	Strong	Needs data channel
Server sends TTS audio chunks	Strong	Possible but awkward	Strong	Strong if WebRTC peer
Barge-in / interrupt	Strong	Awkward	Strong	Strong
Auth with app session	Straightforward	Straightforward	More custom	More custom
Load balancing	Straightforward	Straightforward	Harder	Harder
Observability/debugging	Straightforward	Straightforward	Harder	Harder
STUN/TURN required	No	No	Sometimes	Often (depends on network topology and whether the server is publicly reachable or ICE-lite capable)

03 — Audio Capture

Use browser media capture APIs regardless

Even though the transport is WebSocket, the browser-side capture story is unchanged. getUserMedia is part of the Media Capture and Streams API — shared with WebRTC, but not tied to it as a transport. Always use it with the full audio constraint set. You get browser-native echo cancellation, noise suppression, and auto gain control for free — before a single byte hits your backend.

Then use MediaRecorder or an AudioWorklet to emit small Opus-encoded chunks over your WebSocket connection.

Safari note: audio/webm;codecs=opus is well-supported in Chrome and Firefox. Safari may require audio/mp4 as a fallback — check MediaRecorder.isTypeSupported() before committing to a codec.

Chunking recommendations

Setting	Value
Chunk duration	100–250ms
Codec	Prefer `audio/webm;codecs=opus`; detect fallback with `MediaRecorder.isTypeSupported()`
Silence handling	VAD before billing-heavy STT
Upload channel	WebSocket binary messages
Backpressure	Check `ws.bufferedAmount`
Turn-taking	VAD + server final transcript events

04 — End-to-End Production Pipeline

The full flow from speech to response:

mermaid

Sequence across services

mermaid

05 — Latency

Overlap every stage

The fastest-feeling voice assistants don't wait for stage N to finish before starting stage N+1. They overlap. STT starts processing before the user finishes speaking. The LLM can start from stable partials or final utterance segments — kicking off before the full transcript is confirmed. TTS starts synthesizing before the LLM completes. Audio starts playing before TTS finishes.

                 0ms        500ms       1000ms      1500ms      2000ms
                 |           |           |           |           |
User speech      ████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░
Capture + chunk  ████████████████████████████░░░░░░░░░░░░░░░░░░░░
Streaming STT    ░░██████████████████████████████░░░░░░░░░░░░░░░░
Streaming LLM    ░░░░░░░░░░░░░░░░░░██████████████████░░░░░░░░░░░░
Streaming TTS    ░░░░░░░░░░░░░░░░░░░░░░░░████████████████░░░░░░░░
Audio playback   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████████████

The anti-pattern to avoid:

record full audio -> transcribe full audio -> generate full answer -> synthesize full answer -> play audio

Each step may be individually fast, but the compounding wait time feels broken.

06 — Message Protocol

A typed WebSocket protocol

Use a single persistent WebSocket connection with typed message envelopes. This keeps your backend stateful per-session and makes barge-in handling straightforward.

Control and text frames are sent as JSON. Audio is sent as raw binary frames — don't try to embed binary data inside a JSON envelope. A practical split:

JSON frames — all control messages (session_start, audio_end, interrupt) and server-to-browser text events
Binary frames — audio_chunk (mic data up) and tts_audio_chunk (TTS audio down); use a small binary header or a parallel JSON message to carry metadata if needed

Direction	Type	Payload
Browser → Server	`session_start`	locale, voice, auth/session metadata
Browser → Server	`audio_chunk`	binary Opus chunk or JSON metadata + binary frame
Browser → Server	`audio_end`	user stopped speaking or VAD turn ended
Browser → Server	`interrupt`	user barged in while AI audio was playing
Server → Browser	`transcript_partial`	partial STT text
Server → Browser	`transcript_final`	final turn text
Server → Browser	`llm_token`	response text token
Server → Browser	`tts_audio_chunk`	binary audio output
Server → Browser	`turn_end`	server completed the assistant turn
Server → Browser	`error`	recoverable error information

07 — When to Pick WebRTC

The WebRTC exceptions

There are legitimate reasons to choose WebRTC:

Reason	Why it changes the decision
You are already using LiveKit, mediasoup, Janus, or Pion	The hard backend WebRTC work is already solved.
You need live calls, rooms, or peer media	WebRTC media tracks are the native model.
You need one unified session for audio/video/data	WebRTC can carry media and data together.
You need RTP-level media behavior	WebRTC gives jitter buffering, packet loss concealment, and media negotiation.

If none of those are true, WebSocket is the simpler production path. For reference, here's what the WebRTC backend stack looks like:

mermaid

08 — Final Decision

Decision tree

mermaid

The stack

getUserMedia capture · WebSocket transport · Streaming STT · Streaming LLM · Streaming TTS · Browser audio queue · VAD turn-taking

Area	Decision	Reason
Mic capture	Browser media capture APIs (`getUserMedia`)	Browser-native echo cancellation, noise suppression, auto gain — independent of transport
Transport	WebSocket	Bidirectional, binary support, easy auth, simple scaling
Server pipeline	Streaming STT → LLM → TTS	Enables overlapped stages and low perceived latency
Audio playback	Streaming audio queue	Chunks play before full synthesis completes
Turn-taking	VAD + server final transcript events	Clean separation of detecting speech vs. confirming it

Use WebSocket for real-time production transport, use getUserMedia for browser mic quality, add VAD for turn-taking, and use streaming STT/LLM/TTS services for true real-time voice.