# Voice — Details

> Calls plus a bidirectional audio bridge over MoQT. PSTN, SIP, and AI-agent attach from one typed client.

<img src="/images/diagrams/voice.png" alt="voice data flow" />

The **voice** modality is the modality-shaped successor to ClutchCall's
original RPC client. It splits a call into two clean primitives:

- **`Calls`** — the **control plane**. Originate, fetch, transfer, hang up. HTTP-shaped,
  idempotent, returns a `Call` handle keyed by a single `sid`.
- **`AudioBridge`** — the **data plane**. Bidirectional encoded audio over MoQT tracks
  at `voice/<sid>/{uplink,downlink}`.

Plus **`Agents`** — bind a server-side AI agent to a `sid` so the engine wires the
bridge end-to-end and drives the conversation for you.

One `sid` is the only identifier you ever address. The PSTN leg, the SIP leg, the
browser softphone leg, and the AI-agent leg all hang off it.

## When to use it

  - **Outbound + inbound calling** — Originate to E.164 over a SIP trunk; accept inbound calls terminated by the
    SIP gateway. One control surface for both directions.
  - **AI voice agents** — Attach a speech-to-speech agent to a live call. The engine bridges audio
    both ways; you write the agent, not the plumbing.
  - **Programmable audio** — Tap caller audio (uplink) into your own ASR and push synthesized audio
    (downlink) back, frame by frame, with full codec control.
  - **Browser softphone** — Place and answer calls from the browser with encoded Opus over MoQT — no
    WebRTC transport, no SFU, no gateway round-trips for media.

## Wire model

A call's media is two MoQT audio tracks under a per-call namespace:

```
voice/<sid>/uplink     caller / device → cloud   (what the caller is saying)
voice/<sid>/downlink   cloud → caller / device   (what is played back)
```

Each track carries a **capability** tag of the form `voice/<codec>` (for example
`voice/opus`). The relay routes on that capability: a recording sidecar, an ASR
consumer, or an AI-agent attach can subscribe to a call's audio without the
publisher knowing they exist. You route on **intent**, not on a hardcoded peer.

> **NOTE:**
> "Uplink" and "downlink" are defined from the **caller's** point of view.
> When you `attach()` as the server you **subscribe** uplink and **publish**
> downlink; when you `attachCaller()` as the browser you do the mirror image —
> **publish** uplink (mic) and **subscribe** downlink (playback).

## Lanes and QoS

Audio rides MoQT's **reliable, ordered** delivery by default — a 20 ms Opus
frame is small, and the relay fans each object out per group with bounded
queueing. Late frames are dropped at playout, not retransmitted forever, so a
transient loss costs one packet of jitter, not a growing backlog. The bridge
holds one publication (your direction) and one subscription (the peer
direction); releasing the handle tears down both tracks and ends the bridge.

## Codecs

The bridge transcodes between what the call leg negotiated and what your code
asked for. PSTN µ-law in, Opus to your code, Opus from your code, µ-law back
out — no media tooling in your process.

| Codec        | When to use it                                                       |
| ------------ | -------------------------------------------------------------------- |
| `opus`       | Default. Best quality-per-bit; native in the browser path.           |
| `pcm16`      | When an AI model wants raw 16-bit PCM (many realtime models do).      |
| `g711_ulaw`  | PSTN-direct, no transcoding. µ-law (North America / Japan).          |
| `g711_alaw`  | PSTN-direct, no transcoding. A-law (most of the rest of the world).  |

`sampleRate` (default 48000), `channels` (default 1) and `frameMs` (default 20)
round out the bridge's audio shape.

## Call lifecycle

The control plane reports a `CallStatus` that moves monotonically toward a
terminal state:

```
dialing → ringing → in_progress → completed | failed | no_answer
```

`originate()` returns as soon as the call is `dialing`; re-fetch with
`voice.calls.get({ sid })` to read the latest status. `transfer()` re-points the
live audio at a new PSTN number (a SIP `REFER` under the hood) or re-attaches a
different agent, keeping the same `sid`. `hangup()` ends the call and drops both
tracks.

## Inbound calls and the SIP gateway

Inbound PSTN calls are terminated by a **SIP gateway** acting as a back-to-back
user agent (B2BUA). It negotiates SIP signalling with the carrier, answers, and
publishes the call's audio onto the same `voice/<sid>/{uplink,downlink}` tracks —
so an inbound call is indistinguishable from an outbound one at the SDK surface.
You attach an audio bridge or an agent to its `sid` exactly the same way.

For calls with no AI bridge, the gateway terminates RTP locally and runs
voice-activity detection on-box, so the same `sid` can take either an RTP fast
path or a MoQT bridge without your code changing.

## The browser softphone path

The browser places audio on the **same MoQT tracks** as everything else — there
is **no WebRTC transport** on this path and no SFU. The flow:

  1. **Capture + encode**
`captureMicrophone()` runs the browser's echo-cancellation, gain control and
    noise suppression, then diverts the **encoded** Opus frames (via insertable
    streams / encoded transform) onto the uplink track. Raw PCM never crosses the
    wire.
  2. **Publish over QUIC/MoQT**
Each encoded frame is written to `voice/<sid>/uplink` as a MoQT object. The
    relay fans it to whatever the engine bridged the call to — the SIP leg, an
    agent, or a recorder.
  3. **Subscribe + play back**
Downlink Opus arrives on `voice/<sid>/downlink`; `OpusPlayer` decodes with
    WebCodecs and renders through an AudioWorklet ring buffer with silence
    padding on underrun.

## Architecture

The media plane is built for telephony-grade latency at fleet scale:

- **One QUIC connection, many tracks.** Audio multiplexes over the same
  [MoQT](https://datatracker.ietf.org/wg/moq/about/) substrate as every other
  modality; one auth token authorizes the whole connection.
- **Kernel-bypass RTP fast path.** The SIP/RTP leg uses an **AF_XDP** zero-copy
  fast path — packets move between the NIC and userspace through an **XSK** ring
  over **UMEM**, skipping the kernel networking stack for steady-state media.
- **Shard-per-core engine.** A **thread-per-core** reactor binds the media port
  with **SO_REUSEPORT**; an **eBPF** classifier steers each call's packets to the
  shard that owns it. Cross-shard hand-off uses **lock-free SPSC/SPMC rings**
  (the **mcache / dcache** rings), so a call never takes a lock on the hot path.
- **io_uring** drives async I/O; **NUMA pinning** and **busy-poll** keep the
  media shards off the scheduler's critical path.
- **TLS 1.3** with FIPS-validated crypto secures every QUIC connection; call
  handshakes use ECDSA for low per-core signing cost.

## Related

  - **[SDK methods](/modalities/voice/sdk-methods)** — The full typed surface: `Voice`, `Calls`, `Call`, `AudioBridge`, `Agents`,
    and the browser helpers.
  - **[Cookbook](/modalities/voice/cookbook)** — Short copy-paste recipes for originate, transfer, attach, codecs, and the
    softphone.
  - **[Turn detection](/concepts/turn-detection)** — VAD and barge-in for AI voice agents.
  - **[Realtime tracks](/concepts/realtime-tracks)** — The MoQT publish/subscribe primitive the bridge is built on.
