Inference — SDK Methods

The inference modality has no standalone typed client — it rides the Voice surface. You attach a speech-to-speech agent to a call with voice.agents.attach, configure turn detection and the model codec, and (optionally) tap the audio bridge. This page is the inference-relevant slice of the Voice SDK plus the turn-detection knobs.

A dedicated typed Inference client is Preview — see the bottom of this page. For shipping code today, use the Voice surface documented here.

Import

The agent path lives on the voice subpath:

TypeScript
Python

import { Voice } from "@clutchcall/sdk/voice";

const voice = new Voice({ baseUrl: BASE_URL, apiKey: KEY, orgId: ORG });

from clutchcall.voice import Voice

voice = Voice(base_url=BASE_URL, api_key=KEY, org_id=ORG)

Go, Rust, Java, and .NET bindings expose the same shapes (snake_case in Python, PascalCase methods in Go/.NET).

`voice.agents.attach`

Bind a speech-to-speech agent to a live (or just-originated) call. The runtime bridges the call’s uplink / downlink audio tracks to the model and owns turn-taking for the duration of the call.

TypeScript
Python

await voice.agents.attach(callSid, agent);

voice.agents.attach(call_sid, agent)

callSid

string

required

The call sid to bind the agent to, from originate() or an inbound answer.

agent

string | AgentSpec

required

Either an agent id (string) registered in the control plane, or an inline agent spec selecting the speech-to-speech model and its turn-detection config.

Returns Promise<void> — resolves once the agent is bridged. Audio begins flowing immediately; the first agent utterance is gated by turn detection.

Attaching at originate time

originate accepts an agent directly, so the common case is one call:

TypeScript
Python

const call = await voice.calls.originate({
  to:      "+15551234567",
  from:    "+15558675309",
  trunkId: "trunk_main",
  agent:   "support-s2s",     // attach the speech-to-speech agent inline
});

call = voice.calls.originate(
    to="+15551234567",
    from_="+15558675309",
    trunk_id="trunk_main",
    agent="support-s2s",
)

The agent spec

An inline AgentSpec selects the model leg and the turn-detection policy. The shape mirrors the agent config the control plane stores.

type AgentSpec = {
  // The model the runtime serves on the call's audio legs. Describe the model
  // as a speech-to-speech model in your registry; do not name a vendor here.
  mode: "speech_to_speech";
  modelCodec?: AudioCodec;        // what the model ingests/emits (default "pcm16")
  inputRateHz?: number;           // model input rate (default 16000)
  outputRateHz?: number;          // model output rate (default 24000)
  turnDetection?: TurnDetection;  // turn-taking + barge-in policy (below)
};

type AudioCodec = "opus" | "pcm16" | "g711_ulaw" | "g711_alaw";

ClutchCall resamples between the caller leg rate (commonly 8 kHz on PSTN) and the model’s inputRateHz / outputRateHz. You set the model’s rates; the caller leg is handled at the bridge.

Turn detection

The TurnDetection block is the heart of this modality — it owns the client-commit gate, backchannel suppression, and hold-and-confirm barge-in. Drop it into the agent spec, or set it server-side in the agent config.

type TurnDetection = {
  responseMinSpeechMs?: number;   // bursts shorter than this = backchannel, dropped
  commitMinIntervalMs?: number;   // cooldown between turn commits
  cancelMinIntervalMs?: number;   // cooldown between barge-ins
  bargeConfirmMs?: number;        // sustain speech this long before cancelling the agent
  ttsGuardMs?: number;            // raise the mic gate this long after agent audio
  silenceThresholdMs?: number;    // trailing silence before end-of-turn fires
  minSpeechMs?: number;           // discard candidate utterances shorter than this
  prefixPaddingMs?: number;       // audio kept before the speech-start marker
};

Field	Default	What it does	When to change
`responseMinSpeechMs`	600	Speech shorter than this is a backchannel — never commits a turn	Lower (400) to let very short utterances answer; raise to ignore more continuers
`commitMinIntervalMs`	1500	Minimum gap between two turn commits	Raise if the agent over-answers rapid speech
`cancelMinIntervalMs`	400	Minimum gap between barge-ins	Raise on noisy trunks that false-trigger
`bargeConfirmMs`	300	Speech must be sustained this long over the agent before it is cancelled	`0` = cancel on first frame (browser/AEC legs); raise on noisy legs
`ttsGuardMs`	200	Mic gate is raised for this long after each agent chunk	Raise on no-AEC SIP/PSTN legs where TTS bleeds into the mic
`silenceThresholdMs`	500	Trailing silence before end-of-turn fires	Lower (300) for snappier replies; higher (800) for thinkers
`minSpeechMs`	300	Discard candidate utterances shorter than this	Raise to 500 if line noise causes false triggers
`prefixPaddingMs`	200	Audio kept before the speech-start marker	Raise if first syllables get clipped

silenceThresholdMs below ~300 fires end-of-turn on inter-word pauses — the agent will interrupt the caller mid-sentence. Keep it at or above 300.

Tapping the audio bridge (optional)

You usually don’t touch raw frames — the runtime bridges the model leg for you. When you do need to observe or inject audio (recording sidecar, custom DSP), attach the Voice AudioBridge to the same call.

TypeScript
Python

const bridge = await voice.audioBridge.attach(call.sid, {
  codec: "pcm16",
  onUplink: (frame, tsUs) => recorder.feed(frame),   // caller audio (also fed to the model)
});

bridge = voice.audio_bridge.attach(
    call.sid,
    codec="pcm16",
    on_uplink=lambda frame, ts_us: recorder.feed(frame),
)

AudioBridge methods relevant here:

Method	Signature	Notes
`publishDownlink`	`(frame: Uint8Array, timestampUs?: bigint)`	Inject audio toward the caller (e.g. a pre-roll prompt before the model speaks).
`publishUplink`	`(frame: Uint8Array, timestampUs?: bigint)`	Inject audio toward the model (browser caller path).
`close`	`()`	Detach the tap. The agent stays attached.

See Voice — SDK Methods for the full AudioBridge surface.

Events

The agent leg surfaces turn-boundary and barge events through the call’s status stream. Subscribe with the Voice call handle:

TypeScript
Python

call.onStatus((s) => {
  // dialing → ringing → in_progress → completed | failed | no_answer
  if (s === "in_progress") console.log("agent bridged, listening");
});

call.on_status(lambda s: print("status:", s))

Event	Fires when
turn committed	The turn detector decided a real turn ended; the model is generating its reply.
barge-in confirmed	Sustained speech over the agent passed `bargeConfirmMs`; the in-flight reply is cancelled.
backchannel dropped	A short continuer was suppressed; the agent kept the floor (diagnostic).

Preview: a dedicated typed Inference client

Preview. A standalone Inference client (a typed handle that wraps agent attach, the commit gate, and audio bridging behind one object) is in design. The shape below is forward-looking and may change — for shipping code, attach via voice.agents.attach as shown above.

// PREVIEW — not yet stable
import { Inference } from "@clutchcall/sdk/inference";

const session = await Inference.attach(call.sid, {
  model: "speech-to-speech",
  codec: "pcm16",
  turnDetection: { responseMinSpeechMs: 600, bargeConfirmMs: 300 },
  onTurn:    (t) => console.log("turn latency ms:", t.latencyMs),
  onBargeIn: ()  => console.log("caller interrupted"),
});
await session.close();

Inference — Details — wire model, the commit gate, turn-latency metric
Inference — Cookbook — short task snippets
Voice — SDK Methods — the full call + audio-bridge surface

​Import

​voice.agents.attach

​Attaching at originate time

​The agent spec

​Turn detection

​Tapping the audio bridge (optional)

​Events

​Preview: a dedicated typed Inference client

​Related

Import

`voice.agents.attach`

Attaching at originate time

The agent spec

Turn detection

Tapping the audio bridge (optional)

Events

Preview: a dedicated typed Inference client

Related