Conversational Voice AI From Robotic IVRs to Natural Conversations
How Cadences connects Twilio, ElevenLabs and Durable Objects to create voice agents that hold natural phone conversations in 28+ languages — with real-time business context.
"Press 1 for sales, 2 for support, 3 for billing…" We've all suffered through an IVR. Endless menus, voice recognition that fails 40% of the time, and a sense of frustration that starts before you even talk to anyone. Cadences' conversational voice AI eliminates all of that: an artificial intelligence agent answers the call, understands what you say, queries your history in real time and responds with a natural voice in whatever language you prefer.
The key stat
Companies adopting conversational voice AI report a 60% reduction in wait times and a 35% increase in customer satisfaction. In Cadences, the voice agent can handle calls 24/7 with zero setup time — just configure the prompt and connect a phone number.
Traditional IVR vs. Voice AI
The fundamental difference isn't just technological — it's experiential. An IVR is a fixed decision tree; a voice AI agent is an open conversation with real-time data access.
| Aspect | Traditional IVR | Voice AI in Cadences |
|---|---|---|
| Interaction | "Press 1, 2 or 3" | Free conversation |
| Understanding | Fixed keywords | Full contextual NLU |
| Voice | Robotic / pre-recorded | Cloned neural AI, 28+ languages |
| Context | None | CRM + history + pipeline |
| Interruptions | Not supported | Natural detection and response |
| Languages | 1-2 pre-recorded | 28+ automatic via multilingual model |
| Setup | Weeks of recording | One prompt + phone number |
The Triangle: Twilio ↔ Durable Object ↔ ElevenLabs
Voice AI in Cadences works as a real-time bridge between three systems. There are no polling intermediaries or processing queues — everything happens over bidirectional WebSocket with sub-500ms latency.
Twilio
Global phone network. Receives the call, converts audio to digital stream via Media Streams.
Durable Object
WebSocket bridge on Cloudflare. Maintains session state, translates formats between Twilio and ElevenLabs.
ElevenLabs
Conversational AI. Speech-to-text, LLM reasoning, and text-to-speech with cloned neural voice.
Incoming call → Twilio
A customer calls the configured number. Twilio generates a webhook with the call data (From, To, CallSid) and executes the TwiML with <Connect><Stream> to initiate a Media Stream via WebSocket.
Stream → Durable Object
Twilio's WebSocket connects to the ElevenLabsMediaStream Durable Object. Upon receiving the start event, the DO extracts custom parameters (agentId, language, context, first message) and establishes a connection with ElevenLabs via Signed URL.
Real-time bidirectional audio
Each audio chunk from the customer (media event) is repackaged as user_audio_chunk and sent to ElevenLabs. The agent's audio response arrives as base64 and is forwarded to Twilio as a media payload. The customer hears the agent's voice with no perceptible delay.
Smart interruptions
If the user talks while the agent is responding, ElevenLabs sends an interruption event. The DO immediately sends a clear to Twilio to flush the audio buffer, and the agent reformulates its response. This creates a natural conversation without "talking over each other".
Call ended → Cleanup
When the caller hangs up, Twilio sends stop. The DO closes both WebSockets, releases resources and the Durable Object goes back to hibernation. Cost: only for the seconds of active conversation.
TwiML: Connecting the Call to the Agent
When Twilio receives a call, it executes a webhook that generates dynamic TwiML. This XML tells Twilio to open a WebSocket Media Stream to the Durable Object instead of playing an IVR, injecting business context as parameters:
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://cadences.app/api/elevenlabs/media-stream">
<Parameter name="agentId" value="agent_abc123" />
<Parameter name="language" value="en" />
<Parameter name="callerName" value="John Smith" />
<Parameter name="contextPrompt"
value="Dental clinic. Patient with appointment on 02/15. Treatment: orthodontics." />
</Stream>
</Connect>
</Response>
Notice the contextPrompt parameter: this is where you inject the business information. The ElevenLabs agent receives this context before the first word and can use it throughout the entire conversation. "Good morning John, I see you have an orthodontics appointment on February 15th. How can I help you?"
Dynamic Variables and Business Context
What turns a generic voice agent into a truly useful one is context. The Durable Object sends an initialization message to ElevenLabs with the conversation config and the caller's dynamic variables:
// Message sent when connecting to ElevenLabs
const initMessage = {
type: 'conversation_initiation_client_data',
conversation_config_override: {
agent: {
language: 'en',
prompt: {
prompt: contextPrompt // Business context
},
first_message: 'Hello John, welcome to the Dental Clinic...'
}
},
dynamic_variables: {
caller_number: '+14155551234',
caller_name: 'John Smith',
call_sid: 'CA1234567890abcdef',
workflow_id: 'wf_dental_appointment'
}
}; Dynamic variables allow the same voice agent to behave differently depending on who calls. If the CRM says John has an overdue invoice, the agent knows it before he even asks.
Neural Voices and Customization
Cadences includes a catalog of configurable voices with fine control over stability, expressiveness and similarity. The eleven_multilingual_v2 model allows the same voice to speak Spanish, English, French or Japanese without changing profiles.
Cristina
Default voiceProfessional female voice in native Spanish. Ideal for customer support and medical queries. Clear and empathetic.
Mario Osborne
SpanishProfessional and clear male voice. Perfect for corporate communications and tutorials.
Mario
Latin American SpanishWarm and friendly voice in Latin American Spanish. Automatically adapts to regional context.
Multilingual voices
28+ languagesEnglish (US/UK/AU), Portuguese, French, German, Italian, Japanese, Korean, Chinese and more. All with the same neural model.
| Parameter | Range | Default | Effect |
|---|---|---|---|
| Stability | 0 – 1 | 0.50 | Higher = more consistent voice, lower = more expressive |
| Similarity Boost | 0 – 1 | 0.75 | Higher = more faithful to the original cloned voice |
| Style | 0 – 1 | 0.00 | Expressiveness level (advanced models only) |
| Speaker Boost | on/off | on | Improves voice clarity (recommended on) |
Outbound Calls from Your Application
Besides incoming calls, Cadences enables outbound calls from any storefront via API. A clinic can automate appointment reminders, an e-commerce can confirm orders, or a sales team can follow up — all with personalized voice AI.
POST /api/storefront/voice-call
{
"phoneNumber": "+14155551234",
"cadences_prompt": "Remind John about his orthodontics appointment tomorrow at 10:00 AM. Ask if he confirms or wants to reschedule.",
"callType": "reminder",
"metadata": {
"patientId": "pat_001",
"appointmentId": "apt_2025_0215",
"organizationId": "org_clinic23"
}
}
// Response
{
"success": true,
"callId": "call_1705312800",
"conversationId": "conv_abc123",
"status": "initiated"
} The proxy handles authentication with ElevenLabs and Twilio — your storefront never exposes API keys. The call is logged in D1 with the conversation ID for later follow-up.
Proxy security
ElevenLabs credentials (xi-api-key) and the Twilio phone number ID live exclusively in the Worker's environment variables. The storefront only sends the prompt and business metadata — no access to API keys.
28+ Languages with a Single Model
Thanks to the eleven_multilingual_v2 model, the same voice can speak in any of these languages without needing to configure separate voices. Simply change the language parameter in the agent configuration.
Español (ES/MX/AR)
English (US/UK/AU)
Português (BR/PT)
Français
Deutsch
Italiano
日本語
한국어
Where Voice AI Shines
Clinics & healthcare
Appointment reminders, automatic confirmations, pre-operative instructions. The agent knows patient data and can reschedule on the spot.
E-commerce
Order confirmation, shipping status, return management. "Your order #4521 was shipped yesterday, you'll receive it tomorrow between 10 AM and 2 PM."
Sales & follow-up
Automated follow-up after demos, phone-based lead qualification, and meeting scheduling. With integrated CRM data.
24/7 technical support
Common issue resolution, smart escalation to humans when the agent detects it can't resolve, and post-call satisfaction surveys.
Desktop Application for Audio Management
Beyond telephony, Cadences includes Audio Hub: an Electron desktop application for generating, previewing and managing AI audio content. Batch Text-to-Speech, voice cloning, and export in multiple formats — all connected to the same ElevenLabs infrastructure.
Generate hundreds of audio clips from a spreadsheet or text list.
Listen to each clip, adjust parameters and regenerate before using in production.
7-day cache, 50 MB max. Avoids regenerating clips that already exist.
Production Controls
Voice AI has strict controls to prevent abuse and manage costs. Each parameter is configurable per organization.
| Control | Default Value | Purpose |
|---|---|---|
| Max text | 5,000 characters | Prevent excessive TTS prompts |
| Max recording | 180 sec (3 min) | Limit user recording duration |
| Rate limit | 10 req/min | Protect against API abuse |
| Audio format | MP3 44100 128kbps | Quality/size balance |
| Cache | 7 days, 50 MB | Reuse generated audio |
| Phone | E.164 format | Strict number validation |
Cost Model: You Only Pay for What You Use
The voice AI infrastructure in Cadences runs on a pure serverless model. No servers running 24/7, no monthly infrastructure minimums. The bill is composed of three elements:
Twilio
Per-minute call cost based on destination. Media Streams included. Typically $0.013 – $0.022/min for domestic calls.
ElevenLabs
Cost per characters generated. The Scale plan includes Conversational AI. Advanced neural voice enables natural conversations with no setup cost.
Cloudflare Durable Objects
Marginal cost per active WebSocket duration. The DO hibernates between calls — no cost when there are no active conversations.
Voice as a First-Class Interface
Conversational voice AI in Cadences isn't a secondary feature — it's a first-class interface on the same level as the web dashboard or the REST API. A voice agent has access to the same CRM, the same pipeline, the same data as any other entry point into the system.
The combination of Twilio (global phone network), a Durable Object as a smart bridge, and ElevenLabs (neural voice in 28+ languages) creates a voice stack that is simultaneously powerful and simple to configure. You don't need telephony engineers, you don't need to record IVR prompts, you don't need to hire a call center.
Technical summary
- ✦ Twilio Media Streams + Durable Objects + ElevenLabs Conversational AI
- ✦ Bidirectional WebSocket with sub-500ms latency
- ✦ 28+ languages with multilingual model, same voice
- ✦ Business context injected in real time from CRM
- ✦ Natural interruptions with auto buffer clearing
- ✦ Outbound call API for storefronts without exposing credentials
- ✦ Audio Hub desktop for batch TTS management
- ✦ Serverless model: pay only for active conversations
Cadences Engineering
Technical documentation from the engineering team
AI Agents in Cadences
From prompt to autonomous system
Next article →Synapse: The Data Brain
Your business's central nervous system