voice data flow
The voice modality is the modality-shaped successor to ClutchCall’s original RPC client. It splits a call into two clean primitives:
  • Calls — the control plane. Originate, fetch, transfer, hang up. HTTP-shaped, idempotent, returns a Call handle keyed by a single sid.
  • AudioBridge — the data plane. Bidirectional encoded audio over MoQT tracks at voice/<sid>/{uplink,downlink}.
Plus Agents — bind a server-side AI agent to a sid so the engine wires the bridge end-to-end and drives the conversation for you. One sid is the only identifier you ever address. The PSTN leg, the SIP leg, the browser softphone leg, and the AI-agent leg all hang off it.

When to use it

Outbound + inbound calling

Originate to E.164 over a SIP trunk; accept inbound calls terminated by the SIP gateway. One control surface for both directions.

AI voice agents

Attach a speech-to-speech agent to a live call. The engine bridges audio both ways; you write the agent, not the plumbing.

Programmable audio

Tap caller audio (uplink) into your own ASR and push synthesized audio (downlink) back, frame by frame, with full codec control.

Browser softphone

Place and answer calls from the browser with encoded Opus over MoQT — no WebRTC transport, no SFU, no gateway round-trips for media.

Wire model

A call’s media is two MoQT audio tracks under a per-call namespace:
voice/<sid>/uplink     caller / device → cloud   (what the caller is saying)
voice/<sid>/downlink   cloud → caller / device   (what is played back)
Each track carries a capability tag of the form voice/<codec> (for example voice/opus). The relay routes on that capability: a recording sidecar, an ASR consumer, or an AI-agent attach can subscribe to a call’s audio without the publisher knowing they exist. You route on intent, not on a hardcoded peer.
“Uplink” and “downlink” are defined from the caller’s point of view. When you attach() as the server you subscribe uplink and publish downlink; when you attachCaller() as the browser you do the mirror image — publish uplink (mic) and subscribe downlink (playback).

Lanes and QoS

Audio rides MoQT’s reliable, ordered delivery by default — a 20 ms Opus frame is small, and the relay fans each object out per group with bounded queueing. Late frames are dropped at playout, not retransmitted forever, so a transient loss costs one packet of jitter, not a growing backlog. The bridge holds one publication (your direction) and one subscription (the peer direction); releasing the handle tears down both tracks and ends the bridge.

Codecs

The bridge transcodes between what the call leg negotiated and what your code asked for. PSTN µ-law in, Opus to your code, Opus from your code, µ-law back out — no media tooling in your process.
CodecWhen to use it
opusDefault. Best quality-per-bit; native in the browser path.
pcm16When an AI model wants raw 16-bit PCM (many realtime models do).
g711_ulawPSTN-direct, no transcoding. µ-law (North America / Japan).
g711_alawPSTN-direct, no transcoding. A-law (most of the rest of the world).
sampleRate (default 48000), channels (default 1) and frameMs (default 20) round out the bridge’s audio shape.

Call lifecycle

The control plane reports a CallStatus that moves monotonically toward a terminal state:
dialing → ringing → in_progress → completed | failed | no_answer
originate() returns as soon as the call is dialing; re-fetch with voice.calls.get({ sid }) to read the latest status. transfer() re-points the live audio at a new PSTN number (a SIP REFER under the hood) or re-attaches a different agent, keeping the same sid. hangup() ends the call and drops both tracks.

Inbound calls and the SIP gateway

Inbound PSTN calls are terminated by a SIP gateway acting as a back-to-back user agent (B2BUA). It negotiates SIP signalling with the carrier, answers, and publishes the call’s audio onto the same voice/<sid>/{uplink,downlink} tracks — so an inbound call is indistinguishable from an outbound one at the SDK surface. You attach an audio bridge or an agent to its sid exactly the same way. For calls with no AI bridge, the gateway terminates RTP locally and runs voice-activity detection on-box, so the same sid can take either an RTP fast path or a MoQT bridge without your code changing.

The browser softphone path

The browser places audio on the same MoQT tracks as everything else — there is no WebRTC transport on this path and no SFU. The flow:
1

Capture + encode

captureMicrophone() runs the browser’s echo-cancellation, gain control and noise suppression, then diverts the encoded Opus frames (via insertable streams / encoded transform) onto the uplink track. Raw PCM never crosses the wire.
2

Publish over QUIC/MoQT

Each encoded frame is written to voice/<sid>/uplink as a MoQT object. The relay fans it to whatever the engine bridged the call to — the SIP leg, an agent, or a recorder.
3

Subscribe + play back

Downlink Opus arrives on voice/<sid>/downlink; OpusPlayer decodes with WebCodecs and renders through an AudioWorklet ring buffer with silence padding on underrun.

Architecture

The media plane is built for telephony-grade latency at fleet scale:
  • One QUIC connection, many tracks. Audio multiplexes over the same MoQT substrate as every other modality; one auth token authorizes the whole connection.
  • Kernel-bypass RTP fast path. The SIP/RTP leg uses an AF_XDP zero-copy fast path — packets move between the NIC and userspace through an XSK ring over UMEM, skipping the kernel networking stack for steady-state media.
  • Shard-per-core engine. A thread-per-core reactor binds the media port with SO_REUSEPORT; an eBPF classifier steers each call’s packets to the shard that owns it. Cross-shard hand-off uses lock-free SPSC/SPMC rings (the mcache / dcache rings), so a call never takes a lock on the hot path.
  • io_uring drives async I/O; NUMA pinning and busy-poll keep the media shards off the scheduler’s critical path.
  • TLS 1.3 with FIPS-validated crypto secures every QUIC connection; call handshakes use ECDSA for low per-core signing cost.

SDK methods

The full typed surface: Voice, Calls, Call, AudioBridge, Agents, and the browser helpers.

Cookbook

Short copy-paste recipes for originate, transfer, attach, codecs, and the softphone.

Turn detection

VAD and barge-in for AI voice agents.

Realtime tracks

The MoQT publish/subscribe primitive the bridge is built on.