
Calls— the control plane. Originate, fetch, transfer, hang up. HTTP-shaped, idempotent, returns aCallhandle keyed by a singlesid.AudioBridge— the data plane. Bidirectional encoded audio over MoQT tracks atvoice/<sid>/{uplink,downlink}.
Agents — bind a server-side AI agent to a sid so the engine wires the
bridge end-to-end and drives the conversation for you.
One sid is the only identifier you ever address. The PSTN leg, the SIP leg, the
browser softphone leg, and the AI-agent leg all hang off it.
When to use it
Outbound + inbound calling
Originate to E.164 over a SIP trunk; accept inbound calls terminated by the
SIP gateway. One control surface for both directions.
AI voice agents
Attach a speech-to-speech agent to a live call. The engine bridges audio
both ways; you write the agent, not the plumbing.
Programmable audio
Tap caller audio (uplink) into your own ASR and push synthesized audio
(downlink) back, frame by frame, with full codec control.
Browser softphone
Place and answer calls from the browser with encoded Opus over MoQT — no
WebRTC transport, no SFU, no gateway round-trips for media.
Wire model
A call’s media is two MoQT audio tracks under a per-call namespace:voice/<codec> (for example
voice/opus). The relay routes on that capability: a recording sidecar, an ASR
consumer, or an AI-agent attach can subscribe to a call’s audio without the
publisher knowing they exist. You route on intent, not on a hardcoded peer.
“Uplink” and “downlink” are defined from the caller’s point of view.
When you
attach() as the server you subscribe uplink and publish
downlink; when you attachCaller() as the browser you do the mirror image —
publish uplink (mic) and subscribe downlink (playback).Lanes and QoS
Audio rides MoQT’s reliable, ordered delivery by default — a 20 ms Opus frame is small, and the relay fans each object out per group with bounded queueing. Late frames are dropped at playout, not retransmitted forever, so a transient loss costs one packet of jitter, not a growing backlog. The bridge holds one publication (your direction) and one subscription (the peer direction); releasing the handle tears down both tracks and ends the bridge.Codecs
The bridge transcodes between what the call leg negotiated and what your code asked for. PSTN µ-law in, Opus to your code, Opus from your code, µ-law back out — no media tooling in your process.| Codec | When to use it |
|---|---|
opus | Default. Best quality-per-bit; native in the browser path. |
pcm16 | When an AI model wants raw 16-bit PCM (many realtime models do). |
g711_ulaw | PSTN-direct, no transcoding. µ-law (North America / Japan). |
g711_alaw | PSTN-direct, no transcoding. A-law (most of the rest of the world). |
sampleRate (default 48000), channels (default 1) and frameMs (default 20)
round out the bridge’s audio shape.
Call lifecycle
The control plane reports aCallStatus that moves monotonically toward a
terminal state:
originate() returns as soon as the call is dialing; re-fetch with
voice.calls.get({ sid }) to read the latest status. transfer() re-points the
live audio at a new PSTN number (a SIP REFER under the hood) or re-attaches a
different agent, keeping the same sid. hangup() ends the call and drops both
tracks.
Inbound calls and the SIP gateway
Inbound PSTN calls are terminated by a SIP gateway acting as a back-to-back user agent (B2BUA). It negotiates SIP signalling with the carrier, answers, and publishes the call’s audio onto the samevoice/<sid>/{uplink,downlink} tracks —
so an inbound call is indistinguishable from an outbound one at the SDK surface.
You attach an audio bridge or an agent to its sid exactly the same way.
For calls with no AI bridge, the gateway terminates RTP locally and runs
voice-activity detection on-box, so the same sid can take either an RTP fast
path or a MoQT bridge without your code changing.
The browser softphone path
The browser places audio on the same MoQT tracks as everything else — there is no WebRTC transport on this path and no SFU. The flow:Capture + encode
captureMicrophone() runs the browser’s echo-cancellation, gain control and
noise suppression, then diverts the encoded Opus frames (via insertable
streams / encoded transform) onto the uplink track. Raw PCM never crosses the
wire.Publish over QUIC/MoQT
Each encoded frame is written to
voice/<sid>/uplink as a MoQT object. The
relay fans it to whatever the engine bridged the call to — the SIP leg, an
agent, or a recorder.Architecture
The media plane is built for telephony-grade latency at fleet scale:- One QUIC connection, many tracks. Audio multiplexes over the same MoQT substrate as every other modality; one auth token authorizes the whole connection.
- Kernel-bypass RTP fast path. The SIP/RTP leg uses an AF_XDP zero-copy fast path — packets move between the NIC and userspace through an XSK ring over UMEM, skipping the kernel networking stack for steady-state media.
- Shard-per-core engine. A thread-per-core reactor binds the media port with SO_REUSEPORT; an eBPF classifier steers each call’s packets to the shard that owns it. Cross-shard hand-off uses lock-free SPSC/SPMC rings (the mcache / dcache rings), so a call never takes a lock on the hot path.
- io_uring drives async I/O; NUMA pinning and busy-poll keep the media shards off the scheduler’s critical path.
- TLS 1.3 with FIPS-validated crypto secures every QUIC connection; call handshakes use ECDSA for low per-core signing cost.
Related
SDK methods
The full typed surface:
Voice, Calls, Call, AudioBridge, Agents,
and the browser helpers.Cookbook
Short copy-paste recipes for originate, transfer, attach, codecs, and the
softphone.
Turn detection
VAD and barge-in for AI voice agents.
Realtime tracks
The MoQT publish/subscribe primitive the bridge is built on.

