INTRODUCING UNIVERSAL-3 Pro Streaming

The most accurate real-time transcription model for voice agents

Universal-3 Pro Streaming gives your voice agents the accuracy, speed, and real-time control to handle real conversations at scale — rare word recognition, turn detection, context memory, and more.

Try Universal-3 Pro Streaming

See the difference in real-time

Speak naturally. Universal-3 Pro Streaming captures what other models miss — try credit card numbers, email addresses, passwords, or company names.

Try saying a company name, like "Granola"...

Tap the Mic to start streaming
2:00
Tap the mic to start
0 turns
Clinical evaluation history:
00:00
01:59
"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes.  Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"
Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Proper noun spelling:
00:00
01:59
"keyterms_prompt": ["Kelly Byrne-Donoghue"]
Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Caputuring speaker roles:
00:00
01:59
"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}
With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?

Speaker B: Oh yeah, yeah.

Speaker  A: Good.

Speaker B: Every evening.

Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?

Speaker [Patient]: Oh yeah, yeah.

Speaker  [Nurse]: Good.

Speaker [Patient]: Every evening.

Speaker [Nurse]: And no side effects with it?

Spanish and english audio:
00:00
01:59
"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").
Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Built with the capabilities that make or break voice agent deployments

Audio-contextual turn detection, seamless interruption handling, and high reliability on short utterances. Universal-3 Pro Streaming handles what other models can't.

Features
AssemblyAI
Universal-3 Pro Streaming
Deepgram
Nova-3
OpenAI
GPT-4o Transcribe
Microsoft
Azure
ElevenLabs
Scribe V2
Average missed entity rate
(lower is better)

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

16.7%
25.2%
23.3%
25.1%
22.1%
Speaker diarization performance

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Industry Leading
Unreliable
Unreliable
Unreliable
Unlimited concurrency, no rate limits

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Dynamic keyterms prompting
(turn-by-turn)

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Static only

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Real-time prompting

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Usage-based pricing, no contracts

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Commitments, overages, & rate limits
Commitments, overages, & rate limits
Commitments, overages, & rate limits
Commitments, overages, & rate limits
LiveKit / Pipecat / Twilio
native support

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Partial

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Real-time accuracy where voice agents actually operate

Universal-3 Pro Streaming improves over Universal-Streaming, delivering accuracy in conditions voice agents actually face: telephony, accented speech, high-turn-taking conversations, and noisy call center environments.

Missed Entity Rate: Universal-3 Pro Streaming vs. Universal-Streaming

Lower is better  ·  % of entities not correctly transcribed

Universal-3 Pro Streaming
Universal-Streaming

Temporal

8.30%

9.91%

+1.61

Deepgram

Occupation

8.7%

10.1%

+3.2

Microsoft

Location

9.22%

12.99%

+3.77

Microsoft
Deepgram

Medical

12.0%

14.7%

+2.7

Deepgram

Organization

12.6%

15.8%

+3.2

Deepgram

OpenAI

Names

13.1%

14.6%

+1.5

Deepgram

Deepgram

Phone

19.6%

23.2%

+3.6

OpenAI

Deepgram

Email

34.3%

56.4%

+22.1

Entity Recognition on actual customer data

Names, dates, policy numbers, credit card numbers — the entities that drive outcomes are the ones most models get wrong. Universal-3 Pro Streaming delivers the lowest missed entity rates on real-world audio.

Missed Entity Rate by Category — All Providers

Lower is better  ·  Universal-3-Pro Streaming highlighted

Amazon

AssemblyAI Universal-3-Pro

34.3%

Amazon

AssemblyAI Universal-2

56.4%

Amazon

Amazon

Amazon Transcribe

71.3%

Amazon

Deepgram Nova-3

62.7%

Amazon

Amazon

ElevenLabs Scribe-2

62.1%

Amazon

Amazon

Microsoft Azure

63.7%

Amazon

Amazon

OpenAI GPT-4o Transcribe

72,1%

Amazon
Amazon

AssemblyAI Universal-3-Pro

12.0%

Amazon

AssemblyAI Universal-2

14.7%

Amazon

Amazon

Amazon Transcribe

15.9%

Amazon

Deepgram Nova-3

15.1%

Amazon

Amazon

ElevenLabs Scribe-2

15.28%

Amazon
Amazon

Amazon

Microsoft Azure

18.4%

Amazon

Amazon

OpenAI GPT-4o Transcribe

13.0%

Amazon

AssemblyAI Universal-3-Pro

19.6%

Amazon

AssemblyAI Universal-2

23.2%

Amazon

Amazon

Amazon Transcribe

22.4%

Amazon

Deepgram Nova-3

30.0%

Amazon

Amazon

ElevenLabs Scribe-2

21.5%

Amazon

Amazon

Microsoft Azure

24.2%

Amazon

Amazon

OpenAI GPT-4o Transcribe

20.1%

Amazon
Amazon

AssemblyAI Universal-3-Pro

13.1%

Amazon

AssemblyAI Universal-2

14.6%

Amazon

Amazon

Amazon Transcribe

16.7%

Amazon

Deepgram Nova-3

16.5%

Amazon

Amazon

ElevenLabs Scribe-2

15.3%

Amazon

Amazon

Microsoft Azure

17.5%

Amazon

Amazon

OpenAI GPT-4o Transcribe

19.4%

Amazon

Word Error Rate (%) 

Lower is better  ·  English, all domains

AssemblyAI Universal-3-Pro

8.14%

Amazon

AssemblyAI Universal-2

9.02%

Amazon
Amazon

Amazon

ElevenLabs Scribe-2

9.11%

Amazon

Amazon

Microsoft Azure

9.11%

Amazon

OpenAI

OpenAI GPT4o Transcribe

9.90%

Amazon

Amazon

Deepgram Nova-3

11.06%

Amazon

Amazon

Amazon Transcribe

15.20%

See the performance on your own files

Reach out to our Applied AI team to run latency and accuracy benchmarks on your own data.

Built for production voice agents

Every feature engineered for the demands of real voice agent infrastructure.

Industry-leading entity accuracy

Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.

Unlimited concurrency, no rate limits

Scale from a single call to millions without hitting limits or renegotiating contracts. Truly pay-as-you-go — no commitments required.

Real-time speaker diarization

Identify and separate speakers mid-conversation. Enable as a per-session toggle — no extra configuration needed.

Dynamic key term prompting

Boost up to 1,000 domain-specific terms, updated turn-by-turn mid-conversation. Unlike static alternatives, ours adapt in real time.

One-line integrations

Native support for LiveKit, PipeCat, Twilio, and Daily. Go from sign-up to a production voice agent in under 15 minutes.

Real-time Prompting
Beta

Guide transcription behavior with natural language in streaming mode. Start with our prompt templates — experiment and share what works.

Sub-200ms end-to-end latency

Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.

Open community models

We've built the best voice AI inference infrastructure in the world — and we're opening it to community models, starting with Whisper Streaming.

Global language coverage

Full prompting with keyterms, diarization, and audio tagging in English, Spanish, German, French, Portuguese, and Italian

More on Universal-3 Pro Streaming

What's next

We’ll be releasing new updates and improvements to Universal-3 Pro Streaming over the coming weeks.

Read the blog

Playground

Access our production-ready Voice AI models for speech recognition, speaker detection, audio summarization, and more—all in our no-code playground.

Try our Playground

Start Building

Explore our comprehensive prompt engineering guide with use case templates, best practices, and an AI-powered prompt generator to optimize accuracy for your application.

Read the docs

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.