Inference — Details

The inference modality is how ClutchCall serves a real-time AI voice agent: caller audio goes up, the agent’s voice comes down, and a single speech-to-speech model sits in the loop. It is not a separate product surface — it rides the Voice path. You originate (or answer) a call, attach an agent, and ClutchCall bridges the call’s audio legs to the model over the same QUIC connection everything else in the SDK uses. What makes this modality its own page is the duplex real-time problem it solves: who has the floor, when the model is allowed to answer, and how a backchannel (“mhm”, “right”, “ok”) is kept from derailing the turn. That logic lives in ClutchCall, not in your code.

Why one model, not a pipeline

The classic voice-agent stack is a cascade: streaming ASR → an LLM → a TTS voice, three services chained together. ClutchCall supports that path too (see Turn Detection). The inference modality is the single-model alternative: one speech-to-speech model that takes audio in and emits audio out directly, with no transcript hop in the middle.

	Cascade (ASR → LLM → TTS)	Speech-to-speech (this modality)
Stages on the wire	3 sequential request/responses	1 duplex stream
Turn-taking owner	gateway VAD or the realtime LLM	the client (commit-driven)
First-audio latency	ASR finalize + LLM TTFT + TTS TTFA	model emits audio as it hears you
Prosody / tone	lost at the transcript boundary	preserved end-to-end

Over one QUIC connection the audio frames stream both ways on cheap multiplexed streams, so there is no head-of-line blocking between “what the caller just said” and “what the agent is starting to say.” That overlap is the whole point.

The wire model

The model leg reuses the Voice modality’s track convention. A call carries two audio tracks, and the agent attach binds the model to both:

voice/<sid>/uplink     caller audio  → model   (PCM16 / Opus frames)
voice/<sid>/downlink   model audio   → caller  (PCM16 / Opus frames)

Frames flow continuously. On the uplink, ClutchCall streams caller audio to the model at 1x real-time as it arrives — the model perceives the utterance during the listening window, not in a burst after it ends. On the downlink, the model’s audio deltas stream straight back out to the caller as they are produced.

There is no “send the whole utterance, then wait” round-trip. Continuous prefill plus the commit gate (below) is what gives a near-zero gap between end-of-speech and the first audio-out frame.

Turn-taking: the client-commit gate

A server-side VAD model decides on its own when your turn ended and starts talking. A speech-to-speech model has no server VAD — so the decision of when the user’s turn ended is ClutchCall’s to make, locally. This is a feature, not a gap: it is the protocol-clean place to gate backchannels. ClutchCall runs a local turn detector over the inbound caller audio and drives the model with explicit signals:

Append (continuous)

Caller audio frames are appended to the model’s input buffer at 1x real-time. The model is already “listening” while the caller talks.

Prime (open turn)

A non-final commit keeps a turn primed without asking for a reply yet — the listening window stays open.

Commit (end of turn)

When the turn detector decides a real turn ended, a final commit flushes the tail and the model generates its reply. Only a real turn produces a commit.

Backchannel suppression

Short continuers — “ok”, “mhm”, “yeah”, “right” — are exactly what you do not want to treat as a turn. The turn detector filters any speech burst shorter than response_min_speech_ms (default 600 ms): it never produces a final commit, so the model never wakes up to answer it, and the agent keeps its floor.

Barge-in: hold-and-confirm

When the caller speaks over the agent, naive logic cancels the agent on the very first speech frame — so a muttered “mhm” mid-sentence kills the whole reply. ClutchCall instead arms a pending barge-in on speech onset and only fires it once speech has been sustained for barge_confirm_ms (default 300 ms). A backchannel ends before the window elapses → no cancel, the agent keeps talking. A real interruption runs past the window → the agent is cancelled, ~barge_confirm_ms later than a naive cut (the only cost).

On a transport with no echo cancellation (raw SIP/PSTN), the agent’s own audio bleeds back through the mic and can self-trigger barge-in. Raise the during-TTS mic gate above the quiet gate for those legs. On a browser leg with client-side AEC, a lower during-TTS gate keeps barge-in snappy. Both are just turn-detector params.

Codecs

The model wants raw audio; the caller leg may be anything. ClutchCall transcodes at the bridge, so you pick what the model expects and the call leg is handled for you.

Codec	When
`pcm16`	Default for the model leg. Raw 16-bit PCM, what most speech-to-speech models ingest. ClutchCall resamples between the caller rate (commonly 8 kHz on PSTN) and the model’s rate (commonly 16 kHz in, 24 kHz out).
`opus`	Browser-native and bandwidth-friendly. Use on the browser caller leg; the bridge decodes to PCM16 for the model.
`g711_ulaw` / `g711_alaw`	PSTN-direct caller legs. Transcoded to PCM16 for the model with no ffmpeg in your path.

The metric that matters: turn latency

The number a caller actually feels is turn latency:

turn latency = end-of-user-speech (last inbound audio frame) → start-of-agent-speech (first outbound audio-out frame)

Report p50 / p95 / p99, not just the mean — the tail is the product story. Two properties of this modality drive it down:

Continuous prefill. Because the model hears the utterance as it streams, the final commit only has to flush the tail — first audio out can land within tens of milliseconds of end-of-speech.
QUIC under load and loss. Over a WAN-emulated link (50 ms RTT, ~1% loss), steady-state QUIC first-token latency is roughly half of TCP+TLS, with a much cleaner p99 tail — no head-of-line blocking when a packet drops mid-turn.

Architecture

The serving path is the same substrate as every other modality, tuned for the inference workload:

One QUIC connection, multiplexed. Uplink and downlink audio plus control signals (commit / cancel) ride independent MoQT streams — a lost packet on one doesn’t stall the others.
A QUIC front at the edge. A tuned QUIC ingress terminates the caller’s connection and forwards to the model-serving fabric, keeping the model’s routing brain (cache-aware / prefix-aware worker selection) untouched behind it. 0-RTT resumption keeps reconnect churn cheap for edge clients.
Shard-per-core data plane. The audio fast path runs an AF_XDP kernel-bypass NIC path with lock-free mcache / dcache rings on a thread-per-core reactor and io_uring, so per-turn transport overhead stays in the low single-digit milliseconds even at hundreds of concurrent turns.

The single biggest contributor to a responsive agent is stage overlap — the model working on what it has already heard while the caller is still talking. The connection model and the commit gate are what make that overlap safe.

When to use it

Use inference when

You want a single speech-to-speech model in the loop, lowest possible turn latency, prosody preserved end-to-end, and ClutchCall to own turn-taking and barge-in.

Use the cascade when

You need a specific ASR vendor, a text LLM with tools/RAG, or a particular TTS voice — the Turn Detection path chains ASR → LLM → TTS with the same barge-in policy.

Voice — Details — the call control + audio bridge this modality rides on
Turn Detection & Barge-In — VAD modes, the cascade path, and the per-vendor support matrix
Inference — SDK Methods — attaching an agent and the turn-detection knobs

​Why one model, not a pipeline

​The wire model

​Turn-taking: the client-commit gate

​Backchannel suppression

​Barge-in: hold-and-confirm

​Codecs

​The metric that matters: turn latency

​Architecture

​When to use it

Use inference when

Use the cascade when

​Related

Why one model, not a pipeline

The wire model

Turn-taking: the client-commit gate

Backchannel suppression

Barge-in: hold-and-confirm

Codecs

The metric that matters: turn latency

Architecture

When to use it

Related