Longer, end-to-end examples that combine agent attach, turn-detection tuning,
the audio bridge, and call status into realistic mini-apps. All of it rides the
Voice surface.
Recipe 1 — Outbound speech-to-speech agent on a PSTN call
Dial a number, bridge a speech-to-speech model the moment it answers, and let
ClutchCall own turn-taking for the whole call. This is the canonical inference
app: no ASR/LLM/TTS pipeline to wire — one model, one connection.
Create the client
One tenant token authorizes the call control and the agent leg.import { Voice } from "@clutchcall/sdk/voice";
const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });
Originate with the agent inline
Passing agent to originate bridges the model when the callee answers — no
second round-trip.const call = await voice.calls.originate({
to: "+15551234567",
from: "+15558675309",
trunkId: "trunk_main",
agent: {
mode: "speech_to_speech",
modelCodec: "pcm16", // model ingests/emits raw PCM16
inputRateHz: 16000,
outputRateHz: 24000,
turnDetection: {
responseMinSpeechMs: 600, // drop backchannels
silenceThresholdMs: 500, // end-of-turn after 500ms silence
bargeConfirmMs: 300, // hold-and-confirm barge-in
},
},
});
Follow the call to completion
The model drives the conversation; you just watch status.call.onStatus((s) => {
console.log("status:", s);
if (s === "completed" || s === "failed" || s === "no_answer") {
console.log("call ended:", s);
}
});
That’s the whole app. The PSTN leg is µ-law; ClutchCall transcodes to PCM16
for the model and back, resampling 8 kHz ↔ 16/24 kHz at the bridge. The caller
hears the agent’s first words within tens of milliseconds of finishing their
own sentence, because the model heard the utterance as it streamed.
Recipe 2 — A barge-in profile tuned per transport
Barge-in behavior that feels great on a browser leg (snappy, cut-on-first-frame)
self-triggers on a raw PSTN leg (the agent reads its own echo as the caller
speaking). Pick the turn-detection profile from the leg type at attach time.
import { Voice } from "@clutchcall/sdk/voice";
const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });
// Two profiles: one for echo-cancelled legs, one for raw audio.
const profiles = {
browser: { // client-side AEC: cut instantly, low guard
bargeConfirmMs: 0,
ttsGuardMs: 150,
responseMinSpeechMs: 500,
},
pstn: { // no AEC: confirm before cutting, high guard
bargeConfirmMs: 350,
ttsGuardMs: 400,
responseMinSpeechMs: 700,
},
} as const;
async function attachAgent(callSid: string, leg: "browser" | "pstn") {
await voice.agents.attach(callSid, {
mode: "speech_to_speech",
modelCodec: leg === "browser" ? "opus" : "pcm16",
turnDetection: profiles[leg],
});
}
The two settings that flip with the transport are bargeConfirmMs (how long
speech must be sustained before the agent is cancelled) and ttsGuardMs (how
long the mic gate stays raised after the agent emits audio). On a no-AEC leg,
raise both so the agent’s own voice can’t trigger barge-in; on an AEC leg, drop
both for a snappy interrupt.
Wire it from your call setup:
const browserCall = await voice.calls.originate({ to: "+1...", agent: undefined });
await attachAgent(browserCall.sid, "browser");
const pstnCall = await voice.calls.originate({ to: "+1...", trunkId: "trunk_main" });
await attachAgent(pstnCall.sid, "pstn");
Recipe 3 — Measure turn latency on a live call
Turn latency — end-of-user-speech to first agent audio-out frame — is the number
that defines “responsiveness.” Tap the audio bridge alongside the agent, mark the
last inbound frame, and time to the first outbound frame.
Attach the agent and a passive bridge
The agent drives the conversation; the bridge is a read-only tap on both legs.const call = await voice.calls.originate({ to: "+15551234567", trunkId: "trunk_main" });
await voice.agents.attach(call.sid, {
mode: "speech_to_speech",
turnDetection: { silenceThresholdMs: 500, responseMinSpeechMs: 600 },
});
const bridge = await voice.audioBridge.attach(call.sid, {
codec: "pcm16",
onUplink: (frame, tsUs) => onUplink(frame, tsUs),
});
Detect end-of-speech locally
Track when caller frames stop arriving for longer than the silence threshold
— that approximates the turn detector’s commit point.let lastVoiceUs = 0n;
let turnOpen = false;
const SILENCE_US = 500_000n; // mirror silenceThresholdMs
function onUplink(frame: Uint8Array, tsUs?: bigint) {
const ts = tsUs ?? BigInt(Date.now()) * 1000n;
if (rms(frame) > VOICE_FLOOR) { lastVoiceUs = ts; turnOpen = true; }
}
Time to the first agent frame
The first downlink frame after a closed turn marks start-of-agent-speech.const samples: number[] = [];
bridge.onDownlink((frame, tsUs) => {
const now = tsUs ?? BigInt(Date.now()) * 1000n;
if (turnOpen && now - lastVoiceUs > SILENCE_US && rms(frame) > VOICE_FLOOR) {
const turnLatencyMs = Number(now - (lastVoiceUs + SILENCE_US)) / 1000;
samples.push(turnLatencyMs);
turnOpen = false;
}
});
Report the tail, not the mean
p95 / p99 is the product story — a good mean with a bad tail still feels
broken.function percentile(xs: number[], p: number) {
const s = [...xs].sort((a, b) => a - b);
return s[Math.floor((s.length - 1) * p)];
}
call.onStatus((st) => {
if (st === "completed") {
console.log("turn latency p50 ms:", percentile(samples, 0.50));
console.log("turn latency p95 ms:", percentile(samples, 0.95));
console.log("turn latency p99 ms:", percentile(samples, 0.99));
}
});
Over a WAN-emulated link (50 ms RTT, ~1% loss), the QUIC substrate keeps
steady-state first-audio latency roughly half that of TCP+TLS with a far
cleaner p99 — because a dropped packet on one stream doesn’t stall the audio on
another. If your p99 balloons under load, look at the caller leg’s loss first,
not the model.
Recipe 4 — Hand off from the model to a human
Run the speech-to-speech agent first, then transfer to a live agent when the
caller asks. The model leg detaches; the call continues on the new leg.
import { Voice } from "@clutchcall/sdk/voice";
const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });
const call = await voice.calls.originate({
to: "+15551234567", trunkId: "trunk_main", agent: "support-s2s",
});
// Your own logic decides when a human is needed (a tool call, a keyword, a
// supervisor request). When it does, transfer the live audio to a human leg.
async function escalate() {
await call.transfer({ to: "+15558675309", trunkId: "trunk_main" });
// The model is no longer driving; the new leg owns the audio.
}
transfer performs a SIP REFER (or HTTP fallback) to hand off the live
audio. The original sid stays in history; the new leg gets its own sid. The
speech-to-speech agent stops driving the conversation the moment the transfer
completes.