INTRODUCING UNIVERSAL-3 Pro Streaming

The most accurate real-time transcription model for voice agents

Universal-3 Pro Streaming gives your voice agents the accuracy, speed, and real-time control to handle real conversations at scale — rare word recognition, turn detection, context memory, and more.

Try Universal-3 Pro Streaming

See the difference in real-time

Speak naturally. Universal-3 Pro Streaming captures what other models miss — try credit card numbers, email addresses, passwords, or company names.

Try saying a company name, like "Granola"...

Tap the Mic to start streaming

2:00

Tap the mic to start

0 turns

Source

Clinical evaluation history:

00:00

01:59

"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes. Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Source

Non-speech audio event:

00:00

01:59

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"

Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Source

Speech with disfluencies:

00:00

01:59

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Source

Proper noun spelling:

00:00

01:59

"keyterms_prompt": ["Kelly Byrne-Donoghue"]

Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Source

Caputuring speaker roles:

00:00

01:59

"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}

With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?
‍
Speaker B: Oh yeah, yeah.
‍
Speaker A: Good.
‍
Speaker B: Every evening.
‍
Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?
‍
Speaker [Patient]: Oh yeah, yeah.
‍
Speaker [Nurse]: Good.
‍
Speaker [Patient]: Every evening.
‍
Speaker [Nurse]: And no side effects with it?

Source

Spanish and english audio:

00:00

01:59

"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").

Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Built with the capabilities that make or break voice agent deployments

Audio-contextual turn detection, seamless interruption handling, and high reliability on short utterances. Universal-3 Pro Streaming handles what other models can't.

Features	AssemblyAI Universal-3 Pro Streaming	Deepgram Nova-3	OpenAI GPT-4o Transcribe	Microsoft Azure	ElevenLabs Scribe V2
Average missed entity rate (lower is better) Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	16.7%	25.2%	23.3%	25.1%	22.1%
Speaker diarization performance Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Industry Leading	Unreliable	Unreliable	Unreliable
Unlimited concurrency, no rate limits Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.
Dynamic keyterms prompting (turn-by-turn) Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Static only	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.
Real-time prompting Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.		Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.
Usage-based pricing, no contracts Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Commitments, overages, & rate limits	Commitments, overages, & rate limits	Commitments, overages, & rate limits	Commitments, overages, & rate limits
LiveKit / Pipecat / Twilio native support Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Partial	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Real-time accuracy where voice agents actually operate

Universal-3 Pro Streaming improves over Universal-Streaming, delivering accuracy in conditions voice agents actually face: telephony, accented speech, high-turn-taking conversations, and noisy call center environments.

Missed Entity Rate: Universal-3 Pro Streaming vs. Universal-Streaming

Lower is better · % of entities not correctly transcribed

Universal-3 Pro Streaming

Universal-Streaming

Temporal

8.30%

9.91%

+1.61

Deepgram

Occupation

8.7%

10.1%

+3.2

Microsoft

Location

9.22%

12.99%

+3.77

Microsoft

Deepgram

Medical

12.0%

14.7%

+2.7

Deepgram

Organization

12.6%

15.8%

+3.2

Deepgram

OpenAI

Names

13.1%

14.6%

+1.5

Deepgram

Phone

19.6%

23.2%

+3.6

OpenAI

Deepgram

34.3%

56.4%

+22.1

Entity Recognition on actual customer data

Names, dates, policy numbers, credit card numbers — the entities that drive outcomes are the ones most models get wrong. Universal-3 Pro Streaming delivers the lowest missed entity rates on real-world audio.

Missed Entity Rate by Category — All Providers

Lower is better · Universal-3-Pro Streaming highlighted

Amazon

AssemblyAI Universal-3-Pro

34.3%

Amazon

AssemblyAI Universal-2

56.4%

Amazon

Amazon Transcribe

71.3%

Amazon

Deepgram Nova-3

62.7%

Amazon

ElevenLabs Scribe-2

62.1%

Amazon

Microsoft Azure

63.7%

Amazon

OpenAI GPT-4o Transcribe

72,1%

Amazon

AssemblyAI Universal-3-Pro

12.0%

Amazon

AssemblyAI Universal-2

14.7%

Amazon

Amazon Transcribe

15.9%

Amazon

Deepgram Nova-3

15.1%

Amazon

ElevenLabs Scribe-2

15.28%

Amazon

Microsoft Azure

18.4%

Amazon

OpenAI GPT-4o Transcribe

13.0%

Amazon

AssemblyAI Universal-3-Pro

19.6%

Amazon

AssemblyAI Universal-2

23.2%

Amazon

Amazon Transcribe

22.4%

Amazon

Deepgram Nova-3

30.0%

Amazon

ElevenLabs Scribe-2

21.5%

Amazon

Microsoft Azure

24.2%

Amazon

OpenAI GPT-4o Transcribe

20.1%

Amazon

AssemblyAI Universal-3-Pro

13.1%

Amazon

AssemblyAI Universal-2

14.6%

Amazon

Amazon Transcribe

16.7%

Amazon

Deepgram Nova-3

16.5%

Amazon

ElevenLabs Scribe-2

15.3%

Amazon

Microsoft Azure

17.5%

Amazon

OpenAI GPT-4o Transcribe

19.4%

Amazon

Word Error Rate (%)

Lower is better · English, all domains

AssemblyAI Universal-3-Pro

8.14%

Amazon

AssemblyAI Universal-2

9.02%

Amazon

ElevenLabs Scribe-2

9.11%

Amazon

Microsoft Azure

9.11%

Amazon

OpenAI

OpenAI GPT4o Transcribe

9.90%

Amazon

Deepgram Nova-3

11.06%

Amazon

Amazon Transcribe

15.20%

See the performance on your own files

Reach out to our Applied AI team to run latency and accuracy benchmarks on your own data.

Contact Applied AI

Built for production voice agents

Every feature engineered for the demands of real voice agent infrastructure.

Industry-leading entity accuracy

Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.

Unlimited concurrency, no rate limits

Scale from a single call to millions without hitting limits or renegotiating contracts. Truly pay-as-you-go — no commitments required.

Real-time speaker diarization

Identify and separate speakers mid-conversation. Enable as a per-session toggle — no extra configuration needed.

Dynamic key term prompting

Boost up to 1,000 domain-specific terms, updated turn-by-turn mid-conversation. Unlike static alternatives, ours adapt in real time.

One-line integrations

Native support for LiveKit, PipeCat, Twilio, and Daily. Go from sign-up to a production voice agent in under 15 minutes.

Real-time Prompting

Beta

Guide transcription behavior with natural language in streaming mode. Start with our prompt templates — experiment and share what works.

Sub-200ms end-to-end latency

Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.

Open community models

We've built the best voice AI inference infrastructure in the world — and we're opening it to community models, starting with Whisper Streaming.

Global language coverage

Full prompting with keyterms, diarization, and audio tagging in English, Spanish, German, French, Portuguese, and Italian

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.

Try our API for free Contact sales

INTRODUCING UNIVERSAL-3 Pro Streaming

The most accurate real-time transcription model for voice agents

See the difference in real-time

Built with the capabilities that make or break voice agent deployments

Real-time accuracy where voice agents actually operate

Missed Entity Rate: Universal-3 Pro Streaming vs. Universal-Streaming

Entity Recognition on actual customer data

Missed Entity Rate by Category — All Providers

Word Error Rate (%)

See the performance on your own files

Built for production voice agents

More on Universal-3 Pro Streaming

What's next

Playground

Start Building

Unlock the value of voice data