Tutorial: How to easily build a voice agent with AssemblyAI
This tutorial walks you through building a complete AI voice agent that handles real-time conversations through natural speech.



This tutorial walks you through building a complete AI voice agent that handles real-time conversations through natural speech. You’ll create a system that listens to users, understands their requests, generates intelligent responses, and speaks back with human-like voice synthesis—all within the sub-second timing required for natural conversation flow.
You’ll integrate three core technologies: AssemblyAI’s Universal-3 Pro Streaming model for real-time speech-to-text transcription, OpenAI’s GPT-4 for conversation logic and response generation, and ElevenLabs for natural voice synthesis. The implementation covers audio capture, WebSocket streaming, conversation management, and the orchestration layer that ties everything together into a production-ready voice application.
What Is an AI voice agent?
An AI voice agent is software that talks with people using natural speech, just like a phone conversation. This means you can speak to it normally, and it understands what you want and responds back with a human-like voice.
The key difference from old phone systems is real-time streaming. Traditional phone menus make you press buttons and wait. Voice agents process your speech as you talk, understanding your words before you finish speaking.
What makes voice agents work:
- Streaming speech-to-text: Converts your voice to text instantly
- Language model: Understands what you mean and decides how to respond
- Text-to-speech: Creates natural-sounding voice responses
- Orchestration: Manages the conversation flow and connects to other systems
Think of it like having a smart assistant that never gets tired, works 24/7, and can handle multiple conversations at once.
How AI voice agents work
Voice agents process conversations through a real-time pipeline. When you speak, your voice travels through multiple AI models before you hear a response—all within a second.
The process starts when you speak into a microphone or phone. The speech-to-text model immediately begins converting your words to text, sending partial results as you talk. These early transcripts flow to the language model, which starts thinking about responses before you finish your sentence.
The total response time must stay under one second for natural conversation. When responses take longer, people start talking over the system or repeating themselves.
WebSocket connections make streaming possible. Unlike normal web requests that wait for complete data, WebSockets keep connections open to stream audio continuously.
You’ll see two types of transcripts:
- Interim transcripts: Quick guesses that arrive every few hundred milliseconds
- Final transcripts: Complete, accurate text when you pause speaking
Smart voice agents use both types—interim for speed and final for accuracy.
Core components of a voice agent
Building a voice agent requires four essential components working together. Each piece affects the overall conversation quality.
Streaming speech-to-text
Speech-to-text converts your spoken words into text that computers can understand. This component determines whether your voice agent hears what you’re saying correctly.
Real-world speech creates challenges. People have different accents, speak at varying speeds, and deal with background noise. You might spell out email addresses, mention product codes, or use industry terms. Your voice agent needs speech-to-text that handles all these situations while staying fast enough for real conversation.
Accuracy requirements for different use cases:
- Below 90% accuracy: Users get frustrated with frequent mistakes
- 90-93% accuracy: Occasional errors but generally usable
- 93%+ accuracy: Natural conversations with rare corrections needed
AssemblyAI’s Universal-3 Pro Streaming model achieves approximately 94% accuracy across different audio conditions. It processes audio through WebSocket connections, returning properly formatted transcripts for names, addresses, and numbers.
Here’s how to set up streaming speech-to-text with Python:
import assemblyai as aai
aai.settings.api_key = "your-assemblyai-api-key"
def on_data(transcript: aai.RealtimeTranscript):
if not transcript.text:
return
if isinstance(transcript, aai.RealtimeFinalTranscript):
print(f"Final: {transcript.text}")
else:
print(f"Interim: {transcript.text}")
transcriber = aai.RealtimeTranscriber(
sample_rate=16_000,
on_data=on_data,
on_error=lambda e: print(f"Error: {e}")
)
transcriber.connect()
Explore AssemblyAI transcription in your browser
Test transcription quality and formatting in a no-code UI. Upload audio and see results instantly before wiring up your application. → Try the Playground
LLM and orchestration layer
The language model serves as your voice agent’s brain, understanding what people want and deciding how to respond. But you need more than just raw AI responses—you need orchestration that manages conversations, calls other systems, and remembers context.
Modern language models like GPT-4 excel at understanding natural speech and creating human-like responses. The orchestration layer adds business logic, conversation management, and system connections. This combination lets voice agents check inventory, book appointments, or update customer records.
Key orchestration responsibilities:
- Intent recognition: Figuring out what the user wants to accomplish
- Context tracking: Remembering conversation history and user information
- Function calling: Executing tasks like database queries or API calls
- Response creation: Building appropriate replies based on context
Text-to-speech synthesis
Text-to-speech creates the voice your users hear. Modern TTS sounds completely natural, with proper pauses, emphasis, and emotional tone. The challenge isn’t just quality—it’s achieving that quality fast enough for real-time conversation.
You have several options for TTS:
- ElevenLabs: Exceptionally natural voices with emotional range
- Google Cloud: Extensive language support and cloud integration
- OpenAI: High-quality voices with good API integration
The key is starting speech generation before the language model finishes writing the complete response. Streaming TTS begins playing audio from the first words while generating the rest.
Integration and business logic
Voice agents need to connect with your existing systems to provide real value. Whether checking order status, scheduling appointments, or updating customer records, these integrations transform voice agents from simple chatbots into useful business assets.
Security becomes critical when voice agents access business systems:
- API key protection: Store credentials securely, never in code
- Encrypted connections: All data transmission must be secure
- User authentication: Verify identity naturally within conversation flow
Common integration examples:
- Customer relationship management (CRM) systems
- Calendar and scheduling platforms
- Payment processing systems
- Inventory and order management
- Email and SMS messaging
Performance requirements for voice agents
Performance determines whether people want to use your voice agent. Users expect conversations that flow naturally, without awkward pauses or constant corrections.
The most visible performance metric is latency. When someone stops speaking, they expect a response within normal conversation timing—under one second total. This includes every step from hearing speech to playing the response.
Here’s what good performance looks like:
You need accuracy beyond simple word recognition. Voice agents must correctly capture proper nouns like company names and customer names. They need accurate transcription of numbers like order IDs, phone numbers, and account numbers. Getting these details wrong frustrates users and creates problems downstream.
AssemblyAI’s Universal-3 Pro model specifically optimizes for voice agent requirements, with better recognition of names and numbers compared to general speech recognition models.
Common use cases for voice agents
Voice agents work best for specific types of interactions where their constant availability and consistent performance provide clear value.
Customer support automation handles common questions about business hours, return policies, order status, and basic troubleshooting. Voice agents provide immediate answers without wait times, escalating complex issues to human agents with full context.
Appointment scheduling works well for medical offices, salons, and service businesses. The agent checks real-time availability, confirms details, and sends confirmation messages—tasks that used to require dedicated staff time.
Lead qualification helps sales teams respond immediately to inquiries. The agent gathers contact information, understands needs, and routes qualified leads to the right salesperson. Instant response captures leads that might otherwise choose competitors.
After-hours service extends business availability beyond normal hours. Voice agents answer questions, take messages, and handle urgent requests, ensuring customers get help when they need it.
Building a voice agent: Implementation steps
Let’s build a working voice agent that combines streaming speech recognition, intelligent responses, and natural voice output. You’ll use AssemblyAI for speech-to-text, OpenAI for conversation logic, and ElevenLabs for voice synthesis.
Step 1: Set up your development environment
Create a new Python project with the required dependencies. You need Python 3.8 or later for modern library support.
# Create project directory
mkdir voice-agent
cd voice-agent
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Mac/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install required packages
pip install assemblyai openai elevenlabs websockets pyaudio python-dotenv
Create a .env file to store your API keys securely:
ASSEMBLYAI_API_KEY=your_assemblyai_key_here
OPENAI_API_KEY=your_openai_key_here
ELEVENLABS_API_KEY=your_elevenlabs_key_here
Set up your main Python file with the necessary imports:
import os
import threading
from queue import Queue
from dotenv import load_dotenv
import assemblyai as aai
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import pyaudio
load_dotenv()
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
Step 2: Set up real-time audio streaming
Audio capture forms the foundation of your voice agent. For development, you’ll use PyAudio to capture microphone input. In production, you’d integrate with phone systems or web-based audio.
class AudioCapture:
def __init__(self, sample_rate=16000):
self.sample_rate = sample_rate
self.chunk_size = 8000 # 0.5 seconds at a time
self.audio_queue = Queue()
self.recording = False
self.audio = pyaudio.PyAudio()
self.stream = None
def start_recording(self):
self.recording = True
self.stream = self.audio.open(
format=pyaudio.paInt16,
channels=1,
rate=self.sample_rate,
input=True,
frames_per_buffer=self.chunk_size
)
thread = threading.Thread(target=self._capture_audio, daemon=True)
thread.start()
print("Microphone active - start speaking...")
def _capture_audio(self):
while self.recording:
try:
audio_data = self.stream.read(self.chunk_size, exception_on_overflow=False)
self.audio_queue.put(audio_data)
except Exception as e:
print(f"Audio error: {e}")
def get_audio_data(self):
return self.audio_queue.get() if not self.audio_queue.empty() else None
def stop_recording(self):
self.recording = False
if self.stream:
self.stream.stop_stream()
self.stream.close()
self.audio.terminate()
Step 3: Integrate streaming speech-to-text
Connect AssemblyAI’s streaming API to transcribe audio in real-time. The system processes both partial and final transcripts for the best user experience.
class VoiceAgent:
def __init__(self):
self.conversation_history = []
self.is_processing = False
self.audio_capture = AudioCapture()
def handle_transcript(self, transcript: aai.RealtimeTranscript):
if not transcript.text:
return
if isinstance(transcript, aai.RealtimeFinalTranscript):
print(f"You: {transcript.text}")
self.conversation_history.append({"role": "user", "content":
transcript.text})
if not self.is_processing:
self.is_processing = True
threading.Thread(target=self.generate_and_speak_response,
daemon=True).start()
else:
print(f"Hearing: {transcript.text}")
def start_conversation(self):
print("Voice Agent starting...")
print("Press Ctrl+C to stop")
self.transcriber = aai.RealtimeTranscriber(
sample_rate=16000,
on_data=self.handle_transcript,
on_error=lambda e: print(f"Speech error: {e}")
)
try:
self.transcriber.connect()
print("Connected to speech recognition")
self.audio_capture.start_recording()
while True:
audio_chunk = self.audio_capture.get_audio_data()
if audio_chunk:
self.transcriber.stream(audio_chunk)
except KeyboardInterrupt:
print("Shutting down...")
self.cleanup()
def cleanup(self):
self.audio_capture.stop_recording()
if hasattr(self, 'transcriber'):
self.transcriber.close()
Step 4: Connect your LLM for responses
Add intelligent response generation using OpenAI’s GPT models. The system maintains conversation context and generates appropriate responses.
def generate_and_speak_response(self):
try:
messages = [{
"role": "system",
"content": """You are a helpful voice assistant.
Keep responses conversational and concise - aim for 1-2 sentences.
Speak naturally as if you're having a phone conversation."""
}] + self.conversation_history
print("Thinking...")
response = openai_client.chat.completions.create(
model="gpt-4",
messages=messages,
temperature=0.7,
max_tokens=150
)
ai_response = response.choices[0].message.content
print(f"Agent: {ai_response}")
self.conversation_history.append({"role": "assistant", "content": ai_response})
self.speak_text(ai_response)
except Exception as e:
print(f"Response error: {e}")
self.speak_text("Sorry, I had trouble processing that. Could you try again?")
finally:
self.is_processing = False
Step 5: Add text-to-speech for voice responses
Complete your voice agent by adding natural speech synthesis. This creates the final piece of the conversation loop.
def speak_text(self, text):
try:
print("Speaking response...")
audio = elevenlabs_client.generate(
text=text,
voice="Rachel",
model="eleven_monolingual_v1"
)
stream(audio)
except Exception as e:
print(f"Speech synthesis error: {e}")
def main():
agent = VoiceAgent()
agent.start_conversation()
if __name__ == "__main__":
main()
Here’s the complete, working voice agent code:
from queue import Queue
from dotenv import load_dotenv
import assemblyai as aai
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import pyaudio
load_dotenv()
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
class AudioCapture:
def __init__(self, sample_rate=16000):
self.sample_rate = sample_rate
self.chunk_size = 8000
self.audio_queue = Queue()
self.recording = False
self.audio = pyaudio.PyAudio()
self.stream = None
def start_recording(self):
self.recording = True
self.stream = self.audio.open(
format=pyaudio.paInt16, channels=1, rate=self.sample_rate,
input=True, frames_per_buffer=self.chunk_size
)
thread = threading.Thread(target=self._capture_audio, daemon=True)
thread.start()
def _capture_audio(self):
while self.recording:
try:
self.audio_queue.put(self.stream.read(self.chunk_size, exception_on_overflow=False))
except Exception as e:
print(f"Audio error: {e}")
def get_audio_data(self):
return self.audio_queue.get() if not self.audio_queue.empty() else None
def stop_recording(self):
self.recording = False
if self.stream:
self.stream.stop_stream()
self.stream.close()
self.audio.terminate()
class VoiceAgent:
def __init__(self):
self.conversation_history = []
self.is_processing = False
self.audio_capture = AudioCapture()
def handle_transcript(self, transcript: aai.RealtimeTranscript):
if not transcript.text:
return
if isinstance(transcript, aai.RealtimeFinalTranscript):
print(f"You: {transcript.text}")
self.conversation_history.append({"role": "user", "content": transcript.text})
if not self.is_processing:
self.is_processing = True
threading.Thread(target=self.generate_and_speak_response, daemon=True).start()
else:
print(f"Hearing: {transcript.text}")
def generate_and_speak_response(self):
try:
messages = [{
"role": "system",
"content": "You are a helpful voice assistant. Keep responses conversational and concise."
}] + self.conversation_history
response = openai_client.chat.completions.create(
model="gpt-4", messages=messages, temperature=0.7, max_tokens=150
)
ai_response = response.choices[0].message.content
print(f"Agent: {ai_response}")
self.conversation_history.append({"role": "assistant", "content": ai_response})
audio = elevenlabs_client.generate(
text=ai_response,
voice="Rachel",
model="eleven_monolingual_v1"
)
stream(audio)
except Exception as e:
print(f"Error: {e}")
finally:
self.is_processing = False
def start_conversation(self):
self.transcriber = aai.RealtimeTranscriber(
sample_rate=16000,
on_data=self.handle_transcript,
on_error=lambda e: print(f"Speech error: {e}")
)
try:
self.transcriber.connect()
self.audio_capture.start_recording()
print("Voice Agent ready - start speaking!")
while True:
audio_chunk = self.audio_capture.get_audio_data()
if audio_chunk:
self.transcriber.stream(audio_chunk)
except KeyboardInterrupt:
print("Goodbye!")
self.audio_capture.stop_recording()
self.transcriber.close()
if __name__ == "__main__":
VoiceAgent().start_conversation()
Run your voice agent with:
python voice_agent.py
Building production voice agents
Moving from a working prototype to production requires addressing reliability, scale, and quality that development doesn’t expose. Production voice agents handle thousands of conversations while maintaining consistent performance.
Speech recognition accuracy becomes critical at scale. When your agent misunderstands one word in every twenty, users notice and get frustrated. You need speech recognition that works reliably across different accents, background noise, and speaking styles.
Production requirements you need to address:
- Telephony integration: Connect through phone systems like Twilio or SIP trunking
- Concurrent handling: Process multiple conversations simultaneously
- Error recovery: Handle network failures and API outages gracefully
- Performance monitoring: Track response times and conversation quality
- Security compliance: Protect customer data and maintain audit trails
AssemblyAI’s Universal-3 Pro Streaming model delivers production-grade accuracy specifically designed for voice agent applications. The model handles the challenging aspects of real conversations—background noise, accents, interruptions, and specialized terminology.
You’ll also need to consider infrastructure scaling:
- WebSocket connection pooling for handling multiple simultaneous conversations
- Load balancing across multiple server instances
- Database integration for storing conversation history and user preferences
- Monitoring and alerting for system health and performance metrics
Final words
Building an AI voice agent involves connecting streaming speech-to-text, language models, and text-to-speech into a real-time conversation system. The key is choosing components that deliver the accuracy and speed necessary for natural conversation—particularly the speech recognition foundation that determines whether your agent understands users correctly.
AssemblyAI’s Voice AI platform provides the streaming transcription accuracy needed for production voice agents, with specialized models that handle the complexities of real-world conversation. The Universal-3 Pro Streaming model delivers the reliability required when your voice agent represents your business to customers.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.



