# Inference — Details

> Duplex speech-to-speech inference over one QUIC connection. Turn detection, barge-in, and the client-commit backchannel gate.

<img src="/images/diagrams/inference.png" alt="inference data flow" />

The **inference** modality is how ClutchCall serves a real-time AI voice
agent: caller audio goes **up**, the agent's voice comes **down**, and a single
**speech-to-speech model** sits in the loop. It is not a separate product
surface — it **rides the [Voice](/modalities/voice/details) path**. You
originate (or answer) a call, attach an agent, and ClutchCall bridges the
call's audio legs to the model over the same QUIC connection everything else in
the SDK uses.

What makes this modality its own page is the *duplex real-time* problem it
solves: who has the floor, when the model is allowed to answer, and how a
backchannel ("mhm", "right", "ok") is kept from derailing the turn. That logic
lives in ClutchCall, not in your code.

## Why one model, not a pipeline

The classic voice-agent stack is a **cascade**: streaming ASR → an LLM → a TTS
voice, three services chained together. ClutchCall supports that path too (see
[Turn Detection](/concepts/turn-detection)). The inference modality is the
**single-model alternative**: one **speech-to-speech model** that takes audio in
and emits audio out directly, with no transcript hop in the middle.

| | Cascade (ASR → LLM → TTS) | Speech-to-speech (this modality) |
| --- | --- | --- |
| Stages on the wire | 3 sequential request/responses | 1 duplex stream |
| Turn-taking owner | gateway VAD or the realtime LLM | **the client** (commit-driven) |
| First-audio latency | ASR finalize + LLM TTFT + TTS TTFA | model emits audio as it hears you |
| Prosody / tone | lost at the transcript boundary | preserved end-to-end |

Over one QUIC connection the audio frames stream both ways on cheap multiplexed
streams, so there is no head-of-line blocking between "what the caller just
said" and "what the agent is starting to say." That overlap is the whole point.

## The wire model

The model leg reuses the Voice modality's track convention. A call carries two
audio tracks, and the agent attach binds the model to both:

```
voice/<sid>/uplink     caller audio  → model   (PCM16 / Opus frames)
voice/<sid>/downlink   model audio   → caller  (PCM16 / Opus frames)
```

Frames flow continuously. On the uplink, ClutchCall streams caller audio to
the model **at 1x real-time as it arrives** — the model perceives the utterance
*during* the listening window, not in a burst after it ends. On the downlink,
the model's audio deltas stream straight back out to the caller as they are
produced.

> **NOTE:**
> There is no "send the whole utterance, then wait" round-trip. Continuous
>   prefill plus the commit gate (below) is what gives a near-zero gap between
>   end-of-speech and the first audio-out frame.

## Turn-taking: the client-commit gate

A server-side VAD model decides on its own when your turn ended and starts
talking. A **speech-to-speech model has no server VAD** — so the decision of
*when the user's turn ended* is ClutchCall's to make, locally. This is a
feature, not a gap: it is the protocol-clean place to gate backchannels.

ClutchCall runs a local **turn detector** over the inbound caller audio and
drives the model with explicit signals:

  1. **Append (continuous)**
Caller audio frames are appended to the model's input buffer at 1x
    real-time. The model is already "listening" while the caller talks.
  2. **Prime (open turn)**
A non-final commit keeps a turn primed without asking for a reply yet — the
    listening window stays open.
  3. **Commit (end of turn)**
When the turn detector decides a **real** turn ended, a final commit flushes
    the tail and the model generates its reply. Only a real turn produces a
    commit.

### Backchannel suppression

Short continuers — "ok", "mhm", "yeah", "right" — are exactly what you do *not*
want to treat as a turn. The turn detector filters any speech burst shorter than
`response_min_speech_ms` (default **600 ms**): it never produces a final commit,
so the model never wakes up to answer it, and the agent keeps its floor.

### Barge-in: hold-and-confirm

When the caller speaks *over* the agent, naive logic cancels the agent on the
very first speech frame — so a muttered "mhm" mid-sentence kills the whole reply.
ClutchCall instead **arms** a pending barge-in on speech onset and only fires
it once speech has been **sustained for `barge_confirm_ms`** (default **300
ms**). A backchannel ends before the window elapses → no cancel, the agent keeps
talking. A real interruption runs past the window → the agent is cancelled,
~`barge_confirm_ms` later than a naive cut (the only cost).

> **WARNING:**
> On a transport with no echo cancellation (raw SIP/PSTN), the agent's own audio
>   bleeds back through the mic and can self-trigger barge-in. Raise the during-TTS
>   mic gate above the quiet gate for those legs. On a browser leg with client-side
>   AEC, a lower during-TTS gate keeps barge-in snappy. Both are just turn-detector
>   params.

## Codecs

The model wants raw audio; the caller leg may be anything. ClutchCall
transcodes at the bridge, so you pick what the model expects and the call leg is
handled for you.

| Codec | When |
| --- | --- |
| `pcm16` | Default for the model leg. Raw 16-bit PCM, what most speech-to-speech models ingest. ClutchCall resamples between the caller rate (commonly 8 kHz on PSTN) and the model's rate (commonly 16 kHz in, 24 kHz out). |
| `opus` | Browser-native and bandwidth-friendly. Use on the browser caller leg; the bridge decodes to PCM16 for the model. |
| `g711_ulaw` / `g711_alaw` | PSTN-direct caller legs. Transcoded to PCM16 for the model with no ffmpeg in your path. |

## The metric that matters: turn latency

The number a caller actually *feels* is **turn latency**:

> **turn latency** = end-of-user-speech (last inbound audio frame) → start-of-agent-speech (first outbound audio-out frame)

Report **p50 / p95 / p99**, not just the mean — the tail is the product story.
Two properties of this modality drive it down:

- **Continuous prefill.** Because the model hears the utterance as it streams,
  the final commit only has to flush the tail — first audio out can land within
  tens of milliseconds of end-of-speech.
- **QUIC under load and loss.** Over a WAN-emulated link (50 ms RTT, ~1% loss),
  steady-state QUIC first-token latency is roughly **half** of TCP+TLS, with a
  much cleaner p99 tail — no head-of-line blocking when a packet drops mid-turn.

## Architecture

The serving path is the same substrate as every other modality, tuned for the
inference workload:

- **One QUIC connection, multiplexed.** Uplink and downlink audio plus control
  signals (commit / cancel) ride independent MoQT streams — a lost packet on one
  doesn't stall the others.
- **A QUIC front at the edge.** A tuned QUIC ingress terminates the caller's
  connection and forwards to the model-serving fabric, keeping the model's
  routing brain (cache-aware / prefix-aware worker selection) untouched behind
  it. 0-RTT resumption keeps reconnect churn cheap for edge clients.
- **Shard-per-core data plane.** The audio fast path runs an AF_XDP
  kernel-bypass NIC path with lock-free mcache / dcache rings on a
  thread-per-core reactor and io_uring, so per-turn transport overhead stays in
  the low single-digit milliseconds even at hundreds of concurrent turns.

> **TIP:**
> The single biggest contributor to a responsive agent is **stage overlap** —
>   the model working on what it has already heard while the caller is still
>   talking. The connection model and the commit gate are what make that overlap
>   safe.

## When to use it

  - **Use inference when** — You want a single speech-to-speech model in the loop, lowest possible
    turn latency, prosody preserved end-to-end, and ClutchCall to own
    turn-taking and barge-in.
  - **Use the cascade when** — You need a specific ASR vendor, a text LLM with tools/RAG, or a particular
    TTS voice — the [Turn Detection](/concepts/turn-detection) path chains
    ASR → LLM → TTS with the same barge-in policy.

## Related

- [Voice — Details](/modalities/voice/details) — the call control + audio bridge this modality rides on
- [Turn Detection & Barge-In](/concepts/turn-detection) — VAD modes, the cascade path, and the per-vendor support matrix
- [Inference — SDK Methods](/modalities/inference/sdk-methods) — attaching an agent and the turn-detection knobs