Back to Blog
AI & Machine Learning · 14 min read

Conversational Voice AI From Robotic IVRs to Natural Conversations

How Cadences connects Twilio, ElevenLabs and Durable Objects to create voice agents that hold natural phone conversations in 28+ languages — with real-time business context.

"Press 1 for sales, 2 for support, 3 for billing…" We've all suffered through an IVR. Endless menus, voice recognition that fails 40% of the time, and a sense of frustration that starts before you even talk to anyone. Cadences' conversational voice AI eliminates all of that: an artificial intelligence agent answers the call, understands what you say, queries your history in real time and responds with a natural voice in whatever language you prefer.

📞

The key stat

Companies adopting conversational voice AI report a 60% reduction in wait times and a 35% increase in customer satisfaction. In Cadences, the voice agent can handle calls 24/7 with zero setup time — just configure the prompt and connect a phone number.

The Problem

Traditional IVR vs. Voice AI

The fundamental difference isn't just technological — it's experiential. An IVR is a fixed decision tree; a voice AI agent is an open conversation with real-time data access.

Aspect Traditional IVR Voice AI in Cadences
Interaction "Press 1, 2 or 3" Free conversation
Understanding Fixed keywords Full contextual NLU
Voice Robotic / pre-recorded Cloned neural AI, 28+ languages
Context None CRM + history + pipeline
Interruptions Not supported Natural detection and response
Languages 1-2 pre-recorded 28+ automatic via multilingual model
Setup Weeks of recording One prompt + phone number
Architecture

The Triangle: Twilio ↔ Durable Object ↔ ElevenLabs

Voice AI in Cadences works as a real-time bridge between three systems. There are no polling intermediaries or processing queues — everything happens over bidirectional WebSocket with sub-500ms latency.

📱

Twilio

Global phone network. Receives the call, converts audio to digital stream via Media Streams.

🧠

Durable Object

WebSocket bridge on Cloudflare. Maintains session state, translates formats between Twilio and ElevenLabs.

🎙️

ElevenLabs

Conversational AI. Speech-to-text, LLM reasoning, and text-to-speech with cloned neural voice.

1

Incoming call → Twilio

A customer calls the configured number. Twilio generates a webhook with the call data (From, To, CallSid) and executes the TwiML with <Connect><Stream> to initiate a Media Stream via WebSocket.

2

Stream → Durable Object

Twilio's WebSocket connects to the ElevenLabsMediaStream Durable Object. Upon receiving the start event, the DO extracts custom parameters (agentId, language, context, first message) and establishes a connection with ElevenLabs via Signed URL.

3

Real-time bidirectional audio

Each audio chunk from the customer (media event) is repackaged as user_audio_chunk and sent to ElevenLabs. The agent's audio response arrives as base64 and is forwarded to Twilio as a media payload. The customer hears the agent's voice with no perceptible delay.

4

Smart interruptions

If the user talks while the agent is responding, ElevenLabs sends an interruption event. The DO immediately sends a clear to Twilio to flush the audio buffer, and the agent reformulates its response. This creates a natural conversation without "talking over each other".

5

Call ended → Cleanup

When the caller hangs up, Twilio sends stop. The DO closes both WebSockets, releases resources and the Durable Object goes back to hibernation. Cost: only for the seconds of active conversation.

Configuration

TwiML: Connecting the Call to the Agent

When Twilio receives a call, it executes a webhook that generates dynamic TwiML. This XML tells Twilio to open a WebSocket Media Stream to the Durable Object instead of playing an IVR, injecting business context as parameters:

Dynamically generated TwiML
<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://cadences.app/api/elevenlabs/media-stream">
      <Parameter name="agentId"    value="agent_abc123" />
      <Parameter name="language"   value="en" />
      <Parameter name="callerName" value="John Smith" />
      <Parameter name="contextPrompt"
                value="Dental clinic. Patient with appointment on 02/15. Treatment: orthodontics." />
    </Stream>
  </Connect>
</Response>

Notice the contextPrompt parameter: this is where you inject the business information. The ElevenLabs agent receives this context before the first word and can use it throughout the entire conversation. "Good morning John, I see you have an orthodontics appointment on February 15th. How can I help you?"

Dynamic Context

Dynamic Variables and Business Context

What turns a generic voice agent into a truly useful one is context. The Durable Object sends an initialization message to ElevenLabs with the conversation config and the caller's dynamic variables:

Conversation initialization (DO → ElevenLabs)
// Message sent when connecting to ElevenLabs
const initMessage = {
  type: 'conversation_initiation_client_data',
  conversation_config_override: {
    agent: {
      language: 'en',
      prompt: {
        prompt: contextPrompt  // Business context
      },
      first_message: 'Hello John, welcome to the Dental Clinic...'
    }
  },
  dynamic_variables: {
    caller_number: '+14155551234',
    caller_name:   'John Smith',
    call_sid:      'CA1234567890abcdef',
    workflow_id:   'wf_dental_appointment'
  }
};

Dynamic variables allow the same voice agent to behave differently depending on who calls. If the CRM says John has an overdue invoice, the agent knows it before he even asks.

Voice Catalog

Neural Voices and Customization

Cadences includes a catalog of configurable voices with fine control over stability, expressiveness and similarity. The eleven_multilingual_v2 model allows the same voice to speak Spanish, English, French or Japanese without changing profiles.

👩

Cristina

Default voice

Professional female voice in native Spanish. Ideal for customer support and medical queries. Clear and empathetic.

👨

Mario Osborne

Spanish

Professional and clear male voice. Perfect for corporate communications and tutorials.

🌎

Mario

Latin American Spanish

Warm and friendly voice in Latin American Spanish. Automatically adapts to regional context.

🌍

Multilingual voices

28+ languages

English (US/UK/AU), Portuguese, French, German, Italian, Japanese, Korean, Chinese and more. All with the same neural model.

Parameter Range Default Effect
Stability 0 – 1 0.50 Higher = more consistent voice, lower = more expressive
Similarity Boost 0 – 1 0.75 Higher = more faithful to the original cloned voice
Style 0 – 1 0.00 Expressiveness level (advanced models only)
Speaker Boost on/off on Improves voice clarity (recommended on)
Storefront API

Outbound Calls from Your Application

Besides incoming calls, Cadences enables outbound calls from any storefront via API. A clinic can automate appointment reminders, an e-commerce can confirm orders, or a sales team can follow up — all with personalized voice AI.

POST /api/storefront/voice-call
POST /api/storefront/voice-call

{
  "phoneNumber": "+14155551234",
  "cadences_prompt": "Remind John about his orthodontics appointment tomorrow at 10:00 AM. Ask if he confirms or wants to reschedule.",
  "callType": "reminder",
  "metadata": {
    "patientId": "pat_001",
    "appointmentId": "apt_2025_0215",
    "organizationId": "org_clinic23"
  }
}

// Response
{
  "success": true,
  "callId": "call_1705312800",
  "conversationId": "conv_abc123",
  "status": "initiated"
}

The proxy handles authentication with ElevenLabs and Twilio — your storefront never exposes API keys. The call is logged in D1 with the conversation ID for later follow-up.

🔐

Proxy security

ElevenLabs credentials (xi-api-key) and the Twilio phone number ID live exclusively in the Worker's environment variables. The storefront only sends the prompt and business metadata — no access to API keys.

Multilingual

28+ Languages with a Single Model

Thanks to the eleven_multilingual_v2 model, the same voice can speak in any of these languages without needing to configure separate voices. Simply change the language parameter in the agent configuration.

🇪🇸

Español (ES/MX/AR)

🇺🇸

English (US/UK/AU)

🇧🇷

Português (BR/PT)

🇫🇷

Français

🇩🇪

Deutsch

🇮🇹

Italiano

🇯🇵

日本語

🇰🇷

한국어

Use Cases

Where Voice AI Shines

🏥

Clinics & healthcare

Appointment reminders, automatic confirmations, pre-operative instructions. The agent knows patient data and can reschedule on the spot.

🛒

E-commerce

Order confirmation, shipping status, return management. "Your order #4521 was shipped yesterday, you'll receive it tomorrow between 10 AM and 2 PM."

💼

Sales & follow-up

Automated follow-up after demos, phone-based lead qualification, and meeting scheduling. With integrated CRM data.

🎧

24/7 technical support

Common issue resolution, smart escalation to humans when the agent detects it can't resolve, and post-call satisfaction surveys.

Audio Hub

Desktop Application for Audio Management

Beyond telephony, Cadences includes Audio Hub: an Electron desktop application for generating, previewing and managing AI audio content. Batch Text-to-Speech, voice cloning, and export in multiple formats — all connected to the same ElevenLabs infrastructure.

Batch TTS

Generate hundreds of audio clips from a spreadsheet or text list.

Preview before publishing

Listen to each clip, adjust parameters and regenerate before using in production.

Smart cache

7-day cache, 50 MB max. Avoids regenerating clips that already exist.

Limits & Safety

Production Controls

Voice AI has strict controls to prevent abuse and manage costs. Each parameter is configurable per organization.

Control Default Value Purpose
Max text 5,000 characters Prevent excessive TTS prompts
Max recording 180 sec (3 min) Limit user recording duration
Rate limit 10 req/min Protect against API abuse
Audio format MP3 44100 128kbps Quality/size balance
Cache 7 days, 50 MB Reuse generated audio
Phone E.164 format Strict number validation
Cost

Cost Model: You Only Pay for What You Use

The voice AI infrastructure in Cadences runs on a pure serverless model. No servers running 24/7, no monthly infrastructure minimums. The bill is composed of three elements:

📱

Twilio

Per-minute call cost based on destination. Media Streams included. Typically $0.013 – $0.022/min for domestic calls.

🎙️

ElevenLabs

Cost per characters generated. The Scale plan includes Conversational AI. Advanced neural voice enables natural conversations with no setup cost.

☁️

Cloudflare Durable Objects

Marginal cost per active WebSocket duration. The DO hibernates between calls — no cost when there are no active conversations.

Conclusion

Voice as a First-Class Interface

Conversational voice AI in Cadences isn't a secondary feature — it's a first-class interface on the same level as the web dashboard or the REST API. A voice agent has access to the same CRM, the same pipeline, the same data as any other entry point into the system.

The combination of Twilio (global phone network), a Durable Object as a smart bridge, and ElevenLabs (neural voice in 28+ languages) creates a voice stack that is simultaneously powerful and simple to configure. You don't need telephony engineers, you don't need to record IVR prompts, you don't need to hire a call center.

🎙️

Technical summary

  • Twilio Media Streams + Durable Objects + ElevenLabs Conversational AI
  • Bidirectional WebSocket with sub-500ms latency
  • 28+ languages with multilingual model, same voice
  • Business context injected in real time from CRM
  • Natural interruptions with auto buffer clearing
  • Outbound call API for storefronts without exposing credentials
  • Audio Hub desktop for batch TTS management
  • Serverless model: pay only for active conversations
C

Cadences Engineering

Technical documentation from the engineering team