Inference — Recipes

Longer, end-to-end examples that combine agent attach, turn-detection tuning, the audio bridge, and call status into realistic mini-apps. All of it rides the Voice surface.

Recipe 1 — Outbound speech-to-speech agent on a PSTN call

Dial a number, bridge a speech-to-speech model the moment it answers, and let ClutchCall own turn-taking for the whole call. This is the canonical inference app: no ASR/LLM/TTS pipeline to wire — one model, one connection.

Create the client

One tenant token authorizes the call control and the agent leg.

import { Voice } from "@clutchcall/sdk/voice";

const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });

Originate with the agent inline

Passing agent to originate bridges the model when the callee answers — no second round-trip.

const call = await voice.calls.originate({
  to:      "+15551234567",
  from:    "+15558675309",
  trunkId: "trunk_main",
  agent: {
    mode:          "speech_to_speech",
    modelCodec:    "pcm16",       // model ingests/emits raw PCM16
    inputRateHz:   16000,
    outputRateHz:  24000,
    turnDetection: {
      responseMinSpeechMs: 600,   // drop backchannels
      silenceThresholdMs:  500,   // end-of-turn after 500ms silence
      bargeConfirmMs:      300,   // hold-and-confirm barge-in
    },
  },
});

Follow the call to completion

The model drives the conversation; you just watch status.

call.onStatus((s) => {
  console.log("status:", s);
  if (s === "completed" || s === "failed" || s === "no_answer") {
    console.log("call ended:", s);
  }
});

That’s the whole app. The PSTN leg is µ-law; ClutchCall transcodes to PCM16 for the model and back, resampling 8 kHz ↔ 16/24 kHz at the bridge. The caller hears the agent’s first words within tens of milliseconds of finishing their own sentence, because the model heard the utterance as it streamed.

Recipe 2 — A barge-in profile tuned per transport

Barge-in behavior that feels great on a browser leg (snappy, cut-on-first-frame) self-triggers on a raw PSTN leg (the agent reads its own echo as the caller speaking). Pick the turn-detection profile from the leg type at attach time.

import { Voice } from "@clutchcall/sdk/voice";

const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });

// Two profiles: one for echo-cancelled legs, one for raw audio.
const profiles = {
  browser: {            // client-side AEC: cut instantly, low guard
    bargeConfirmMs: 0,
    ttsGuardMs:     150,
    responseMinSpeechMs: 500,
  },
  pstn: {               // no AEC: confirm before cutting, high guard
    bargeConfirmMs: 350,
    ttsGuardMs:     400,
    responseMinSpeechMs: 700,
  },
} as const;

async function attachAgent(callSid: string, leg: "browser" | "pstn") {
  await voice.agents.attach(callSid, {
    mode:          "speech_to_speech",
    modelCodec:    leg === "browser" ? "opus" : "pcm16",
    turnDetection: profiles[leg],
  });
}

The two settings that flip with the transport are bargeConfirmMs (how long speech must be sustained before the agent is cancelled) and ttsGuardMs (how long the mic gate stays raised after the agent emits audio). On a no-AEC leg, raise both so the agent’s own voice can’t trigger barge-in; on an AEC leg, drop both for a snappy interrupt.

Wire it from your call setup:

const browserCall = await voice.calls.originate({ to: "+1...", agent: undefined });
await attachAgent(browserCall.sid, "browser");

const pstnCall = await voice.calls.originate({ to: "+1...", trunkId: "trunk_main" });
await attachAgent(pstnCall.sid, "pstn");

Recipe 3 — Measure turn latency on a live call

Turn latency — end-of-user-speech to first agent audio-out frame — is the number that defines “responsiveness.” Tap the audio bridge alongside the agent, mark the last inbound frame, and time to the first outbound frame.

Attach the agent and a passive bridge

The agent drives the conversation; the bridge is a read-only tap on both legs.

const call = await voice.calls.originate({ to: "+15551234567", trunkId: "trunk_main" });

await voice.agents.attach(call.sid, {
  mode: "speech_to_speech",
  turnDetection: { silenceThresholdMs: 500, responseMinSpeechMs: 600 },
});

const bridge = await voice.audioBridge.attach(call.sid, {
  codec:    "pcm16",
  onUplink: (frame, tsUs) => onUplink(frame, tsUs),
});

Detect end-of-speech locally

Track when caller frames stop arriving for longer than the silence threshold — that approximates the turn detector’s commit point.

let lastVoiceUs = 0n;
let turnOpen    = false;
const SILENCE_US = 500_000n;   // mirror silenceThresholdMs

function onUplink(frame: Uint8Array, tsUs?: bigint) {
  const ts = tsUs ?? BigInt(Date.now()) * 1000n;
  if (rms(frame) > VOICE_FLOOR) { lastVoiceUs = ts; turnOpen = true; }
}

Time to the first agent frame

The first downlink frame after a closed turn marks start-of-agent-speech.

const samples: number[] = [];
bridge.onDownlink((frame, tsUs) => {
  const now = tsUs ?? BigInt(Date.now()) * 1000n;
  if (turnOpen && now - lastVoiceUs > SILENCE_US && rms(frame) > VOICE_FLOOR) {
    const turnLatencyMs = Number(now - (lastVoiceUs + SILENCE_US)) / 1000;
    samples.push(turnLatencyMs);
    turnOpen = false;
  }
});

Report the tail, not the mean

p95 / p99 is the product story — a good mean with a bad tail still feels broken.

function percentile(xs: number[], p: number) {
  const s = [...xs].sort((a, b) => a - b);
  return s[Math.floor((s.length - 1) * p)];
}
call.onStatus((st) => {
  if (st === "completed") {
    console.log("turn latency p50 ms:", percentile(samples, 0.50));
    console.log("turn latency p95 ms:", percentile(samples, 0.95));
    console.log("turn latency p99 ms:", percentile(samples, 0.99));
  }
});

Over a WAN-emulated link (50 ms RTT, ~1% loss), the QUIC substrate keeps steady-state first-audio latency roughly half that of TCP+TLS with a far cleaner p99 — because a dropped packet on one stream doesn’t stall the audio on another. If your p99 balloons under load, look at the caller leg’s loss first, not the model.

Recipe 4 — Hand off from the model to a human

Run the speech-to-speech agent first, then transfer to a live agent when the caller asks. The model leg detaches; the call continues on the new leg.

import { Voice } from "@clutchcall/sdk/voice";

const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });

const call = await voice.calls.originate({
  to: "+15551234567", trunkId: "trunk_main", agent: "support-s2s",
});

// Your own logic decides when a human is needed (a tool call, a keyword, a
// supervisor request). When it does, transfer the live audio to a human leg.
async function escalate() {
  await call.transfer({ to: "+15558675309", trunkId: "trunk_main" });
  // The model is no longer driving; the new leg owns the audio.
}

transfer performs a SIP REFER (or HTTP fallback) to hand off the live audio. The original sid stays in history; the new leg gets its own sid. The speech-to-speech agent stops driving the conversation the moment the transfer completes.

Inference — Details — the commit gate, codecs, and turn-latency metric
Inference — Cookbook — short single-task snippets
Voice — Recipes — call control patterns this builds on

​Recipe 1 — Outbound speech-to-speech agent on a PSTN call

​Recipe 2 — A barge-in profile tuned per transport

​Recipe 3 — Measure turn latency on a live call

​Recipe 4 — Hand off from the model to a human

​Related

Recipe 1 — Outbound speech-to-speech agent on a PSTN call

Recipe 2 — A barge-in profile tuned per transport

Recipe 3 — Measure turn latency on a live call

Recipe 4 — Hand off from the model to a human

Related