# Inference — Recipes

> End-to-end worked examples: an outbound speech-to-speech agent, a tuned barge-in profile, and a turn-latency probe.

Longer, end-to-end examples that combine agent attach, turn-detection tuning,
the audio bridge, and call status into realistic mini-apps. All of it rides the
[Voice](/modalities/voice/sdk-methods) surface.

## Recipe 1 — Outbound speech-to-speech agent on a PSTN call

Dial a number, bridge a speech-to-speech model the moment it answers, and let
ClutchCall own turn-taking for the whole call. This is the canonical inference
app: no ASR/LLM/TTS pipeline to wire — one model, one connection.

  1. **Create the client**
One tenant token authorizes the call control and the agent leg.

```ts
import { Voice } from "@clutchcall/sdk/voice";

const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });
```
  2. **Originate with the agent inline**
Passing `agent` to `originate` bridges the model when the callee answers — no
    second round-trip.

```ts
const call = await voice.calls.originate({
  to:      "+15551234567",
  from:    "+15558675309",
  trunkId: "trunk_main",
  agent: {
    mode:          "speech_to_speech",
    modelCodec:    "pcm16",       // model ingests/emits raw PCM16
    inputRateHz:   16000,
    outputRateHz:  24000,
    turnDetection: {
      responseMinSpeechMs: 600,   // drop backchannels
      silenceThresholdMs:  500,   // end-of-turn after 500ms silence
      bargeConfirmMs:      300,   // hold-and-confirm barge-in
    },
  },
});
```
  3. **Follow the call to completion**
The model drives the conversation; you just watch status.

```ts
call.onStatus((s) => {
  console.log("status:", s);
  if (s === "completed" || s === "failed" || s === "no_answer") {
    console.log("call ended:", s);
  }
});
```

That's the whole app. The PSTN leg is µ-law; ClutchCall transcodes to PCM16
for the model and back, resampling 8 kHz ↔ 16/24 kHz at the bridge. The caller
hears the agent's first words within tens of milliseconds of finishing their
own sentence, because the model heard the utterance as it streamed.

## Recipe 2 — A barge-in profile tuned per transport

Barge-in behavior that feels great on a browser leg (snappy, cut-on-first-frame)
self-triggers on a raw PSTN leg (the agent reads its own echo as the caller
speaking). Pick the turn-detection profile from the leg type at attach time.

```ts
import { Voice } from "@clutchcall/sdk/voice";

const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });

// Two profiles: one for echo-cancelled legs, one for raw audio.
const profiles = {
  browser: {            // client-side AEC: cut instantly, low guard
    bargeConfirmMs: 0,
    ttsGuardMs:     150,
    responseMinSpeechMs: 500,
  },
  pstn: {               // no AEC: confirm before cutting, high guard
    bargeConfirmMs: 350,
    ttsGuardMs:     400,
    responseMinSpeechMs: 700,
  },
} as const;

async function attachAgent(callSid: string, leg: "browser" | "pstn") {
  await voice.agents.attach(callSid, {
    mode:          "speech_to_speech",
    modelCodec:    leg === "browser" ? "opus" : "pcm16",
    turnDetection: profiles[leg],
  });
}
```

> **NOTE:**
> The two settings that flip with the transport are `bargeConfirmMs` (how long
>   speech must be sustained before the agent is cancelled) and `ttsGuardMs` (how
>   long the mic gate stays raised after the agent emits audio). On a no-AEC leg,
>   raise both so the agent's own voice can't trigger barge-in; on an AEC leg, drop
>   both for a snappy interrupt.

Wire it from your call setup:

```ts
const browserCall = await voice.calls.originate({ to: "+1...", agent: undefined });
await attachAgent(browserCall.sid, "browser");

const pstnCall = await voice.calls.originate({ to: "+1...", trunkId: "trunk_main" });
await attachAgent(pstnCall.sid, "pstn");
```

## Recipe 3 — Measure turn latency on a live call

Turn latency — end-of-user-speech to first agent audio-out frame — is the number
that defines "responsiveness." Tap the audio bridge alongside the agent, mark the
last inbound frame, and time to the first outbound frame.

  1. **Attach the agent and a passive bridge**
The agent drives the conversation; the bridge is a read-only tap on both legs.

```ts
const call = await voice.calls.originate({ to: "+15551234567", trunkId: "trunk_main" });

await voice.agents.attach(call.sid, {
  mode: "speech_to_speech",
  turnDetection: { silenceThresholdMs: 500, responseMinSpeechMs: 600 },
});

const bridge = await voice.audioBridge.attach(call.sid, {
  codec:    "pcm16",
  onUplink: (frame, tsUs) => onUplink(frame, tsUs),
});
```
  2. **Detect end-of-speech locally**
Track when caller frames stop arriving for longer than the silence threshold
    — that approximates the turn detector's commit point.

```ts
let lastVoiceUs = 0n;
let turnOpen    = false;
const SILENCE_US = 500_000n;   // mirror silenceThresholdMs

function onUplink(frame: Uint8Array, tsUs?: bigint) {
  const ts = tsUs ?? BigInt(Date.now()) * 1000n;
  if (rms(frame) > VOICE_FLOOR) { lastVoiceUs = ts; turnOpen = true; }
}
```
  3. **Time to the first agent frame**
The first downlink frame after a closed turn marks start-of-agent-speech.

```ts
const samples: number[] = [];
bridge.onDownlink((frame, tsUs) => {
  const now = tsUs ?? BigInt(Date.now()) * 1000n;
  if (turnOpen && now - lastVoiceUs > SILENCE_US && rms(frame) > VOICE_FLOOR) {
    const turnLatencyMs = Number(now - (lastVoiceUs + SILENCE_US)) / 1000;
    samples.push(turnLatencyMs);
    turnOpen = false;
  }
});
```
  4. **Report the tail, not the mean**
p95 / p99 is the product story — a good mean with a bad tail still feels
    broken.

```ts
function percentile(xs: number[], p: number) {
  const s = [...xs].sort((a, b) => a - b);
  return s[Math.floor((s.length - 1) * p)];
}
call.onStatus((st) => {
  if (st === "completed") {
    console.log("turn latency p50 ms:", percentile(samples, 0.50));
    console.log("turn latency p95 ms:", percentile(samples, 0.95));
    console.log("turn latency p99 ms:", percentile(samples, 0.99));
  }
});
```

> **TIP:**
> Over a WAN-emulated link (50 ms RTT, ~1% loss), the QUIC substrate keeps
>   steady-state first-audio latency roughly half that of TCP+TLS with a far
>   cleaner p99 — because a dropped packet on one stream doesn't stall the audio on
>   another. If your p99 balloons under load, look at the caller leg's loss first,
>   not the model.

## Recipe 4 — Hand off from the model to a human

Run the speech-to-speech agent first, then transfer to a live agent when the
caller asks. The model leg detaches; the call continues on the new leg.

```ts
import { Voice } from "@clutchcall/sdk/voice";

const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });

const call = await voice.calls.originate({
  to: "+15551234567", trunkId: "trunk_main", agent: "support-s2s",
});

// Your own logic decides when a human is needed (a tool call, a keyword, a
// supervisor request). When it does, transfer the live audio to a human leg.
async function escalate() {
  await call.transfer({ to: "+15558675309", trunkId: "trunk_main" });
  // The model is no longer driving; the new leg owns the audio.
}
```

> **NOTE:**
> `transfer` performs a SIP `REFER` (or HTTP fallback) to hand off the live
>   audio. The original sid stays in history; the new leg gets its own sid. The
>   speech-to-speech agent stops driving the conversation the moment the transfer
>   completes.

## Related

- [Inference — Details](/modalities/inference/details) — the commit gate, codecs, and turn-latency metric
- [Inference — Cookbook](/modalities/inference/cookbook) — short single-task snippets
- [Voice — Recipes](/modalities/voice/recipes) — call control patterns this builds on