# Turn Detection & Barge-In

> When the AI agent stops talking, when the human starts — VAD, end-of-utterance, and interrupt signals across ClutchCall.

Turn detection is the voice-agent term for "who has the floor." A working
agent stops talking the moment the caller starts (barge-in), and starts
its response the moment the caller's utterance ends (end-of-utterance).
Get this wrong and the call feels broken: the agent talks over the user,
or sits silent for awkward seconds while waiting for "real" silence.

ClutchCall handles turn detection in two layers — a local VAD on the
gateway and a configurable detection mode per agent — and signals
interrupts end-to-end via a control channel between the gateway, the
agent runtime, and (where supported) vendor adapters.

## The two detection modes

```
┌─────────────────────────────────────────────────────────────────┐
│  TurnDetection {                                                │
│    type: "silero" | "server_vad",                               │
│    silence_threshold_ms?, min_speech_ms?, prefix_padding_ms?    │
│  }                                                              │
└─────────────────────────────────────────────────────────────────┘
```

| `type`        | Who runs VAD                              | When to use                                                           |
|---------------|-------------------------------------------|-----------------------------------------------------------------------|
| `silero`      | Gateway, locally (on-device VAD)  | Cascaded ASR → LLM → TTS. Default.                                    |
| `server_vad`  | The AI provider (e.g. OpenAI Realtime)    | Realtime models that already emit `speech_started`/`speech_stopped`.  |

**`silero` is the default** because most pipelines are cascaded
(Deepgram → Anthropic → ElevenLabs, for example) and need a local trigger.
The runtime auto-promotes to `server_vad` only when the agent's entry
node is `REALTIME` AND the operator didn't supply an explicit type.

## End-to-end interrupt flow

When local VAD trips, this chain fires:

```
caller starts speaking
       │
       ▼
VAD source: on-device VAD on the gateway (SIP/RTP path)
            OR vendor server-VAD via the agent runtime (direct-media path)
       │
       │  control frame: [0xFFFFFFFF sentinel][type=BARGE_IN][session_id]
       ▼
agent_runtime_bridge → runtime
       │
       ▼
DagExecutor::on_cancel(CancelEvent{source: user_vad})
       │
       ├──▶ ASR / LLM / TTS nodes: forward cancel
       ├──▶ PushAudio node: clear outbound buffer
       └──▶ RealtimeNode: emit `conversation.item.truncate` to OpenAI,
                          close session
```

The same chain fires on `source: server_vad` when the AI provider tells
us the user spoke first (e.g. OpenAI Realtime's
`input_audio_buffer.speech_started`). Both sources land in the same
handler, so node code is detection-mode-agnostic.

## Configuring turn detection per agent

Drop a `TurnDetection` block into your agent config:

```yaml
# agent-config.yml
entry_node: ASR
turn_detection:
  type: silero
  silence_threshold_ms: 500    # min silence before EOU fires
  min_speech_ms: 300           # filter out clicks / noise
  prefix_padding_ms: 200       # include this much pre-speech audio
nodes:
  ASR: { provider: deepgram, ... }
  LLM: { provider: anthropic, ... }
  TTS: { provider: elevenlabs, ... }
```

For a realtime entry node, omit the block entirely and the runtime
defaults to `server_vad`:

```yaml
entry_node: REALTIME
nodes:
  REALTIME: { provider: openai-realtime, model: gpt-4o-realtime-preview }
# turn_detection auto-promotes to {type: server_vad} based on entry_node
```

Override is always honored — set `type: silero` explicitly even with a
realtime entry if you want gateway-side detection.

## Per-vendor support matrix

Honest scorecard. Where this says "no" the agent will keep talking over
the user; treat as a known limitation, not a configuration error.

### Agent runtime LLM/ASR/TTS providers

| Provider                        | Cancel mid-flight? | Barge trigger                                |
|---------------------------------|:------------------:|----------------------------------------------|
| OpenAI Realtime                 | ✅                 | Local VAD or `speech_started` → `truncate`   |
| OpenAI HTTP / Anthropic / Gemini / Ollama | ❌ (timeout-only) | Gateway silences playback; LLM completes server-side |
| Deepgram ASR streaming          | ✅ (forward cancel)| Cancel propagates; partial transcript discarded |
| Deepgram TTS / ElevenLabs TTS   | ✅ (clear buffer)  | Outbound PCM buffer cleared at PushAudio node |

### WebRTC / streaming vendor adapters

| Vendor      | Path                  | Interrupt signal              | Status                                       |
|-------------|-----------------------|-------------------------------|----------------------------------------------|
| **SIP/RTP** | Native PSTN trunk     | on-device VAD → bridge control      | ✅ End-to-end                                 |
| Twilio      | Media Streams         | `clear` event                 | ⚠️ Adapter doesn't yet emit it on barge       |
| Vapi        | WebSocket transport   | `stop` message                | ⚠️ Adapter doesn't yet emit it on barge       |
| LiveKit     | Room participant      | None (no clean media-plane interrupt) | ❌ Client-side action required        |
| Daily       | Room participant      | None                          | ❌ Client-side action required                |
| Chime       | Meeting attendee      | None                          | ❌ Client-side action required                |
| Browser     | WebTransport          | Caller mic → server VAD       | ✅ Same as SIP path                           |

For `❌` vendors, the AI's TTS keeps streaming; ClutchCall silences the
playback locally but the AI is unaware. For latency-sensitive UX, prefer
SIP/RTP, Browser, or — once the adapter work lands — Twilio/Vapi.

## Tuning parameters

| Field                   | Default | What it does                                          | When to change                                              |
|-------------------------|---------|-------------------------------------------------------|-------------------------------------------------------------|
| `silence_threshold_ms`  | 500     | Min trailing silence before EOU fires                 | Lower (300) for snappier replies; higher (800) for thinkers |
| `min_speech_ms`         | 300     | Discard candidate utterances shorter than this        | Raise to 500 if line noise causes false triggers            |
| `prefix_padding_ms`     | 200     | Audio kept before the speech-start marker             | Raise if first syllables are getting clipped on the ASR     |
| `silence_floor_dbfs`    | -45     | Signal level below which is treated as silence        | Quieter trunks need lower (-50); noisy ones higher (-40)    |

## Common gotchas

- **Realtime entry node + explicit `type: silero`**: works, but you're
  doing VAD twice. The OpenAI server-side VAD will still fire and may
  produce `speech_started` events the runtime ignores. Pick one.
- **`silence_threshold_ms` too low (< 300)**: end-of-utterance fires on
  inter-word pauses; agent interrupts the user mid-sentence.
- **Cascaded pipeline with `server_vad`**: only the realtime model can
  produce server-VAD events. A Deepgram → Anthropic → ElevenLabs pipeline
  with `server_vad` will never emit a turn boundary. Misconfiguration —
  use `silero` instead.
- **HTTP LLM "barge" feels delayed**: Anthropic/Gemini/OpenAI HTTP run
  to completion server-side. ClutchCall silences the audio playback
  but the server still bills you for the full response. For tight
  barge-in budgets, use a Realtime model.

## Where to dig further

- `get_page rpc/audio-frames` — wire format for the PCM/PCMU stream that
  feeds the gateway VAD.
- `get_page rpc/method-ids` — `Barge` method ID for operator-initiated
  interrupts.
- `get_page admin/agent-dags` — full `TurnDetection` schema in the agent
  config.
