April 8, 2026

Tutorial: How to easily build a voice agent with AssemblyAI

This tutorial walks you through building a complete AI voice agent that handles real-time conversations through natural speech.

Kelsey Foster

Growth

AI voice agents

Tutorial

Reviewed by

Table of contents

[Visible on live site]

This tutorial walks you through building a complete AI voice agent that handles real-time conversations through natural speech. You’ll create a system that listens to users, understands their requests, generates intelligent responses, and speaks back with human-like voice synthesis—all within the sub-second timing required for natural conversation flow.

You’ll integrate three core technologies: AssemblyAI’s Universal-3 Pro Streaming model for real-time speech-to-text transcription, OpenAI’s GPT-4 for conversation logic and response generation, and ElevenLabs for natural voice synthesis. The implementation covers audio capture, WebSocket streaming, conversation management, and the orchestration layer that ties everything together into a production-ready voice application.

What Is an AI voice agent?

An AI voice agent is software that talks with people using natural speech, just like a phone conversation. This means you can speak to it normally, and it understands what you want and responds back with a human-like voice.

The key difference from old phone systems is real-time streaming. Traditional phone menus make you press buttons and wait. Voice agents process your speech as you talk, understanding your words before you finish speaking.

What makes voice agents work:

Streaming speech-to-text: Converts your voice to text instantly
Language model: Understands what you mean and decides how to respond
Text-to-speech: Creates natural-sounding voice responses
Orchestration: Manages the conversation flow and connects to other systems

Think of it like having a smart assistant that never gets tired, works 24/7, and can handle multiple conversations at once.

How AI voice agents work

Voice agents process conversations through a real-time pipeline. When you speak, your voice travels through multiple AI models before you hear a response—all within a second.

The process starts when you speak into a microphone or phone. The speech-to-text model immediately begins converting your words to text, sending partial results as you talk. These early transcripts flow to the language model, which starts thinking about responses before you finish your sentence.

Step	What Happens	Time Needed	Why It Matters
Audio Capture	Records your voice	Under 50ms	Clean audio input
Speech-to-text	Converts speech to text	200-400ms	Accuracy determines success
LLM Processing	Understands and responds	300-600ms	Creates intelligent responses
Text-to-speech	Makes voice output	200-400ms	Natural-sounding speech
Audio Playback	Plays response	Under 50ms	Smooth conversation flow

The total response time must stay under one second for natural conversation. When responses take longer, people start talking over the system or repeating themselves.

WebSocket connections make streaming possible. Unlike normal web requests that wait for complete data, WebSockets keep connections open to stream audio continuously.

You’ll see two types of transcripts:

Interim transcripts: Quick guesses that arrive every few hundred milliseconds
Final transcripts: Complete, accurate text when you pause speaking

Smart voice agents use both types—interim for speed and final for accuracy.

Core components of a voice agent

Building a voice agent requires four essential components working together. Each piece affects the overall conversation quality.

Streaming speech-to-text

Speech-to-text converts your spoken words into text that computers can understand. This component determines whether your voice agent hears what you’re saying correctly.

Real-world speech creates challenges. People have different accents, speak at varying speeds, and deal with background noise. You might spell out email addresses, mention product codes, or use industry terms. Your voice agent needs speech-to-text that handles all these situations while staying fast enough for real conversation.

Accuracy requirements for different use cases:

Below 90% accuracy: Users get frustrated with frequent mistakes
90-93% accuracy: Occasional errors but generally usable
93%+ accuracy: Natural conversations with rare corrections needed

AssemblyAI’s Universal-3 Pro Streaming model achieves approximately 94% accuracy across different audio conditions. It processes audio through WebSocket connections, returning properly formatted transcripts for names, addresses, and numbers.

Here’s how to set up streaming speech-to-text with Python:

import assemblyai as aai
 
aai.settings.api_key = "your-assemblyai-api-key"
 
def on_data(transcript: aai.RealtimeTranscript):
    if not transcript.text:
        return
 
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"Final: {transcript.text}")
    else:
        print(f"Interim: {transcript.text}")
 
transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=lambda e: print(f"Error: {e}")
)
 
transcriber.connect()

Explore AssemblyAI transcription in your browser

Test transcription quality and formatting in a no-code UI. Upload audio and see results instantly before wiring up your application. → Try the Playground

LLM and orchestration layer

The language model serves as your voice agent’s brain, understanding what people want and deciding how to respond. But you need more than just raw AI responses—you need orchestration that manages conversations, calls other systems, and remembers context.

Modern language models like GPT-4 excel at understanding natural speech and creating human-like responses. The orchestration layer adds business logic, conversation management, and system connections. This combination lets voice agents check inventory, book appointments, or update customer records.

Key orchestration responsibilities:

Intent recognition: Figuring out what the user wants to accomplish
Context tracking: Remembering conversation history and user information
Function calling: Executing tasks like database queries or API calls
Response creation: Building appropriate replies based on context

Text-to-speech synthesis

Text-to-speech creates the voice your users hear. Modern TTS sounds completely natural, with proper pauses, emphasis, and emotional tone. The challenge isn’t just quality—it’s achieving that quality fast enough for real-time conversation.

You have several options for TTS:

ElevenLabs: Exceptionally natural voices with emotional range
Google Cloud: Extensive language support and cloud integration
OpenAI: High-quality voices with good API integration

The key is starting speech generation before the language model finishes writing the complete response. Streaming TTS begins playing audio from the first words while generating the rest.

Integration and business logic

Voice agents need to connect with your existing systems to provide real value. Whether checking order status, scheduling appointments, or updating customer records, these integrations transform voice agents from simple chatbots into useful business assets.

Security becomes critical when voice agents access business systems:

API key protection: Store credentials securely, never in code
Encrypted connections: All data transmission must be secure
User authentication: Verify identity naturally within conversation flow

Common integration examples:

Customer relationship management (CRM) systems
Calendar and scheduling platforms
Payment processing systems
Inventory and order management
Email and SMS messaging

Performance requirements for voice agents

Performance determines whether people want to use your voice agent. Users expect conversations that flow naturally, without awkward pauses or constant corrections.

The most visible performance metric is latency. When someone stops speaking, they expect a response within normal conversation timing—under one second total. This includes every step from hearing speech to playing the response.

Here’s what good performance looks like:

What You're Measuring	Target	Why It Matters
Total response time	Under 1000ms	Feels like natural conversation
Speech accuracy	93%+	Rarely need to repeat yourself
Name recognition	95%+	Gets customer names right
Number accuracy	95%+	Captures phone numbers correctly
Voice quality	Human-like	Easy to understand and pleasant

You need accuracy beyond simple word recognition. Voice agents must correctly capture proper nouns like company names and customer names. They need accurate transcription of numbers like order IDs, phone numbers, and account numbers. Getting these details wrong frustrates users and creates problems downstream.

AssemblyAI’s Universal-3 Pro model specifically optimizes for voice agent requirements, with better recognition of names and numbers compared to general speech recognition models.

Common use cases for voice agents

Voice agents work best for specific types of interactions where their constant availability and consistent performance provide clear value.

Customer support automation handles common questions about business hours, return policies, order status, and basic troubleshooting. Voice agents provide immediate answers without wait times, escalating complex issues to human agents with full context.

Appointment scheduling works well for medical offices, salons, and service businesses. The agent checks real-time availability, confirms details, and sends confirmation messages—tasks that used to require dedicated staff time.

Lead qualification helps sales teams respond immediately to inquiries. The agent gathers contact information, understands needs, and routes qualified leads to the right salesperson. Instant response captures leads that might otherwise choose competitors.

After-hours service extends business availability beyond normal hours. Voice agents answer questions, take messages, and handle urgent requests, ensuring customers get help when they need it.

Building a voice agent: Implementation steps

Let’s build a working voice agent that combines streaming speech recognition, intelligent responses, and natural voice output. You’ll use AssemblyAI for speech-to-text, OpenAI for conversation logic, and ElevenLabs for voice synthesis.

Step 1: Set up your development environment

Create a new Python project with the required dependencies. You need Python 3.8 or later for modern library support.

# Create project directory
mkdir voice-agent
cd voice-agent
 
# Create virtual environment
python -m venv venv
 
# Activate virtual environment
# On Mac/Linux:
source venv/bin/activate
 
# On Windows:
venv\Scripts\activate
 
# Install required packages
pip install assemblyai openai elevenlabs websockets pyaudio python-dotenv

Create a .env file to store your API keys securely:

ASSEMBLYAI_API_KEY=your_assemblyai_key_here
OPENAI_API_KEY=your_openai_key_here
ELEVENLABS_API_KEY=your_elevenlabs_key_here

Get your AssemblyAI API key

This tutorial needs an API key for the Universal-3 Pro Streaming transcriber. Create a free account to generate your key and start building.

Set up your main Python file with the necessary imports:

import os
import threading
from queue import Queue
from dotenv import load_dotenv
import assemblyai as aai
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import pyaudio
 
load_dotenv()
 
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))

Step 2: Set up real-time audio streaming

Audio capture forms the foundation of your voice agent. For development, you’ll use PyAudio to capture microphone input. In production, you’d integrate with phone systems or web-based audio.

class AudioCapture:
    def __init__(self, sample_rate=16000):
        self.sample_rate = sample_rate
        self.chunk_size = 8000  # 0.5 seconds at a time
        self.audio_queue = Queue()
        self.recording = False
        self.audio = pyaudio.PyAudio()
        self.stream = None
 
    def start_recording(self):
        self.recording = True
        self.stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk_size
        )
        thread = threading.Thread(target=self._capture_audio, daemon=True)
        thread.start()
        print("Microphone active - start speaking...")
 
    def _capture_audio(self):
        while self.recording:
            try:
                audio_data = self.stream.read(self.chunk_size, exception_on_overflow=False)
                self.audio_queue.put(audio_data)
            except Exception as e:
                print(f"Audio error: {e}")
 
    def get_audio_data(self):
        return self.audio_queue.get() if not self.audio_queue.empty() else None
 
    def stop_recording(self):
        self.recording = False
        if self.stream:
            self.stream.stop_stream()
            self.stream.close()
        self.audio.terminate()

Step 3: Integrate streaming speech-to-text

Connect AssemblyAI’s streaming API to transcribe audio in real-time. The system processes both partial and final transcripts for the best user experience.

class VoiceAgent:
    def __init__(self):
        self.conversation_history = []
        self.is_processing = False
        self.audio_capture = AudioCapture()
 
    def handle_transcript(self, transcript: aai.RealtimeTranscript):
        if not transcript.text:
            return
 
        if isinstance(transcript, aai.RealtimeFinalTranscript):
            print(f"You: {transcript.text}")
            self.conversation_history.append({"role": "user", "content": 
transcript.text})
            if not self.is_processing:
                self.is_processing = True
                threading.Thread(target=self.generate_and_speak_response, 
daemon=True).start()
        else:
            print(f"Hearing: {transcript.text}")
 
    def start_conversation(self):
        print("Voice Agent starting...")
        print("Press Ctrl+C to stop")
 
        self.transcriber = aai.RealtimeTranscriber(
            sample_rate=16000,
            on_data=self.handle_transcript,
            on_error=lambda e: print(f"Speech error: {e}")
        )
 
        try:
            self.transcriber.connect()
            print("Connected to speech recognition")
            self.audio_capture.start_recording()
            while True:
                audio_chunk = self.audio_capture.get_audio_data()
                if audio_chunk:
                    self.transcriber.stream(audio_chunk)
        except KeyboardInterrupt:
            print("Shutting down...")
            self.cleanup()
 
    def cleanup(self):
        self.audio_capture.stop_recording()
        if hasattr(self, 'transcriber'):
            self.transcriber.close()

Step 4: Connect your LLM for responses

Add intelligent response generation using OpenAI’s GPT models. The system maintains conversation context and generates appropriate responses.

   def generate_and_speak_response(self):
        try:
            messages = [{
                "role": "system",
                "content": """You are a helpful voice assistant.
                Keep responses conversational and concise - aim for 1-2 sentences.
                Speak naturally as if you're having a phone conversation."""
            }] + self.conversation_history
 
            print("Thinking...")
            response = openai_client.chat.completions.create(
                model="gpt-4",
                messages=messages,
                temperature=0.7,
                max_tokens=150
            )
 
            ai_response = response.choices[0].message.content
            print(f"Agent: {ai_response}")
            self.conversation_history.append({"role": "assistant", "content": ai_response})
            self.speak_text(ai_response)
 
        except Exception as e:
            print(f"Response error: {e}")
            self.speak_text("Sorry, I had trouble processing that. Could you try again?")
        finally:
            self.is_processing = False

Step 5: Add text-to-speech for voice responses

Complete your voice agent by adding natural speech synthesis. This creates the final piece of the conversation loop.

    def speak_text(self, text):
        try:
            print("Speaking response...")
            audio = elevenlabs_client.generate(
                text=text,
                voice="Rachel",
                model="eleven_monolingual_v1"
            )
            stream(audio)
        except Exception as e:
            print(f"Speech synthesis error: {e}")
 
 
def main():
    agent = VoiceAgent()
    agent.start_conversation()
 
if __name__ == "__main__":
    main()

Here’s the complete, working voice agent code:

from queue import Queue
from dotenv import load_dotenv
import assemblyai as aai
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import pyaudio
 
load_dotenv()
 
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
 
 
class AudioCapture:
    def __init__(self, sample_rate=16000):
        self.sample_rate = sample_rate
        self.chunk_size = 8000
        self.audio_queue = Queue()
        self.recording = False
        self.audio = pyaudio.PyAudio()
        self.stream = None
 
    def start_recording(self):
        self.recording = True
        self.stream = self.audio.open(
            format=pyaudio.paInt16, channels=1, rate=self.sample_rate,
            input=True, frames_per_buffer=self.chunk_size
        )
        thread = threading.Thread(target=self._capture_audio, daemon=True)
        thread.start()
 
    def _capture_audio(self):
        while self.recording:
            try:
                self.audio_queue.put(self.stream.read(self.chunk_size, exception_on_overflow=False))
            except Exception as e:
                print(f"Audio error: {e}")
 
    def get_audio_data(self):
        return self.audio_queue.get() if not self.audio_queue.empty() else None
 
    def stop_recording(self):
        self.recording = False
        if self.stream:
            self.stream.stop_stream()
            self.stream.close()
        self.audio.terminate()
 
 
class VoiceAgent:
    def __init__(self):
        self.conversation_history = []
        self.is_processing = False
        self.audio_capture = AudioCapture()
 
    def handle_transcript(self, transcript: aai.RealtimeTranscript):
        if not transcript.text:
            return
 
        if isinstance(transcript, aai.RealtimeFinalTranscript):
            print(f"You: {transcript.text}")
            self.conversation_history.append({"role": "user", "content": transcript.text})
            if not self.is_processing:
                self.is_processing = True
                threading.Thread(target=self.generate_and_speak_response, daemon=True).start()
        else:
            print(f"Hearing: {transcript.text}")
 
    def generate_and_speak_response(self):
        try:
            messages = [{
                "role": "system",
                "content": "You are a helpful voice assistant. Keep responses conversational and concise."
            }] + self.conversation_history
 
            response = openai_client.chat.completions.create(
                model="gpt-4", messages=messages, temperature=0.7, max_tokens=150
            )
 
            ai_response = response.choices[0].message.content
            print(f"Agent: {ai_response}")
            self.conversation_history.append({"role": "assistant", "content": ai_response})
 
            audio = elevenlabs_client.generate(
                text=ai_response,
                voice="Rachel",
                model="eleven_monolingual_v1"
            )
            stream(audio)
 
        except Exception as e:
            print(f"Error: {e}")
        finally:
            self.is_processing = False
 
    def start_conversation(self):
        self.transcriber = aai.RealtimeTranscriber(
            sample_rate=16000,
            on_data=self.handle_transcript,
            on_error=lambda e: print(f"Speech error: {e}")
        )
 
        try:
            self.transcriber.connect()
            self.audio_capture.start_recording()
            print("Voice Agent ready - start speaking!")
            while True:
                audio_chunk = self.audio_capture.get_audio_data()
                if audio_chunk:
                    self.transcriber.stream(audio_chunk)
        except KeyboardInterrupt:
            print("Goodbye!")
            self.audio_capture.stop_recording()
            self.transcriber.close()
 
 
if __name__ == "__main__":
    VoiceAgent().start_conversation()

Run your voice agent with:

python voice_agent.py

Building production voice agents

Moving from a working prototype to production requires addressing reliability, scale, and quality that development doesn’t expose. Production voice agents handle thousands of conversations while maintaining consistent performance.

Speech recognition accuracy becomes critical at scale. When your agent misunderstands one word in every twenty, users notice and get frustrated. You need speech recognition that works reliably across different accents, background noise, and speaking styles.

Production requirements you need to address:

Telephony integration: Connect through phone systems like Twilio or SIP trunking
Concurrent handling: Process multiple conversations simultaneously
Error recovery: Handle network failures and API outages gracefully
Performance monitoring: Track response times and conversation quality
Security compliance: Protect customer data and maintain audit trails

AssemblyAI’s Universal-3 Pro Streaming model delivers production-grade accuracy specifically designed for voice agent applications. The model handles the challenging aspects of real conversations—background noise, accents, interruptions, and specialized terminology.

You’ll also need to consider infrastructure scaling:

WebSocket connection pooling for handling multiple simultaneous conversations
Load balancing across multiple server instances
Database integration for storing conversation history and user preferences
Monitoring and alerting for system health and performance metrics

Final words

Building an AI voice agent involves connecting streaming speech-to-text, language models, and text-to-speech into a real-time conversation system. The key is choosing components that deliver the accuracy and speed necessary for natural conversation—particularly the speech recognition foundation that determines whether your agent understands users correctly.

AssemblyAI’s Voice AI platform provides the streaming transcription accuracy needed for production voice agents, with specialized models that handle the complexities of real-world conversation. The Universal-3 Pro Streaming model delivers the reliability required when your voice agent represents your business to customers.

Build production-ready voice agents faster

Start with AssemblyAI’s Universal-3 Pro Streaming model for low-latency, high-accuracy transcription—ideal for real-time conversation pipelines like the one in this guide. Sign up free

Frequently asked questions

What response time should I target for natural conversation flow?

Target under 1000ms total response time from when the user stops speaking to when they hear your agent’s response. This breaks down to roughly 200-400ms for speech recognition, 300-600ms for language model processing, 200-400ms for text-to-speech, and 100-200ms for network delays.

How do I handle users interrupting the voice agent mid-response?

Implement barge-in detection by monitoring the audio stream for speech activity during agent responses and immediately stopping text-to-speech playback when detected. AssemblyAI’s streaming API provides real-time speech detection that enables smooth interruption handling without cutting off users.

Which language model works best for voice agent conversations?

GPT-4 handles complex conversations requiring nuanced understanding, while GPT-3.5 Turbo works well for simpler interactions with lower latency and cost. Configure system prompts to keep responses concise for voice, and manage conversation history to balance context with response speed.

Do I need to train custom speech recognition models for my industry?

Modern pre-trained models handle most use cases without custom training—AssemblyAI’s Universal-3 Pro achieves production accuracy out-of-the-box for general conversation. For specialized terminology, use custom vocabulary features to improve recognition of specific terms without the complexity of full model training.

What’s the best way to integrate voice agents with existing phone systems?

Use cloud telephony providers like Twilio for easiest integration, or implement SIP trunking for direct connection to your phone infrastructure. Both approaches require WebRTC for browser-based calling or server-side audio processing for traditional phone integration.

‍