April 8, 2026

Transcript output guide: SRT, VTT & TXT export formats + live caption sync

This guide explains each format's structure, capabilities, and limitations, then shows you when to use SRT for maximum compatibility, WebVTT for web-based projects, or TXT for content repurposing and accessibility compliance.

Kelsey Foster

Growth

Speech-to-Text

Reviewed by

Table of contents

[Visible on live site]

When you export speech-to-text transcripts, you need to choose the right subtitle format for your specific use case. The three most common formats—SRT, WebVTT (VTT), and TXT—each serve different purposes and offer distinct capabilities. SRT provides universal compatibility across all video platforms, WebVTT delivers advanced web-based styling and positioning features, while TXT format creates searchable transcripts without timing information.

Understanding these format differences directly impacts your video accessibility, SEO performance, and viewer experience. The wrong format choice can result in broken captions, poor readability, or missing features that your platform supports. This guide explains each format's structure, capabilities, and limitations, then shows you when to use SRT for maximum compatibility, WebVTT for web-based projects, or TXT for content repurposing and accessibility compliance.

What is SRT?

SRT is a simple text file that stores subtitles with timestamps. This means you get a plain text document that tells video players exactly when to show each line of text on screen. SRT files work with almost every video platform and player because they use the most basic format possible—just text, numbers, and timing codes.

The format gets its name from SubRip, the DVD ripping software that first created this structure. Today, SRT has become the universal standard because it works everywhere from YouTube to professional video editing software.

SRT file structure

Every SRT file follows the same four-part pattern that repeats for each subtitle:‍

1
00:00:05,000 --> 00:00:08,000
This is the first subtitle line.

2
00:00:08,500 --> 00:00:12,000
And this is the second subtitle
that spans two lines.

Here's what each part does:

Sequential number: Each subtitle gets a number (1, 2, 3...) to show the order
Timestamps: Start and end times in hours:minutes:seconds,milliseconds format
Text content: The actual words that appear on screen, usually 1-2 lines
Blank line: An empty line that separates each subtitle block

You must save SRT files with UTF-8 encoding and the .srt extension. UTF-8 handles international characters like accents or symbols, making your subtitles work across different languages.

SRT capabilities and limitations

SRT handles the basics perfectly but can't do advanced formatting. Here's what you can and can't do:

What SRT supports:

Plain text display
Basic HTML tags like <b>, <i>, and <u>
Speaker labels using brackets like [John]
Line breaks for two-line subtitles

What SRT doesn't support:

Text colors or custom fonts
Positioning subtitles anywhere except bottom-center
Animation effects or transitions
Metadata like language information

Think of SRT as the reliable workhorse—it does exactly what you need without complications.

What is WebVTT?

WebVTT is a subtitle format designed specifically for web browsers and HTML5 video. This means it includes all of SRT's basic features plus advanced styling and positioning options that work natively in web browsers. WebVTT stands for Web Video Text Tracks and was created by the W3C (the organization that sets web standards).

The key advantage is native browser support—you don't need plugins or special software to display WebVTT captions in web video players.

WebVTT structure and syntax

WebVTT files look similar to SRT but have some important differences:

WEBVTT

00:00:05.000 --> 00:00:08.000
This is a WebVTT subtitle.

00:00:08.500 --> 00:00:12.000 position:10% align:left
<v Speaker1>This subtitle appears on the left side.

00:00:12.500 --> 00:00:16.000
<b>Bold text</b> and <i>italic text</i> are supported.

The main structural differences:

Required header: Every WebVTT file must start with "WEBVTT" on the first line
Period for milliseconds: Uses 05.000 instead of 05,000 like SRT
Inline positioning: You can add position and alignment settings directly in the timestamp line
Voice tags: Use <v Speaker> to identify who's talking

WebVTT styling and advanced features

WebVTT's real power comes from its styling capabilities that make captions look professional and accessible.

Text styling options:

Different colors using CSS-style tags
Font variations with standard HTML tags
Ruby text for languages that need pronunciation guides

Positioning controls:

position:50% moves text horizontally across the screen
align:center controls text alignment (left, center, right)
line:90% sets vertical position from the top
size:80% adjusts the width of the caption area

Speaker identification:‍

<v John>I think we should review the proposal.
<v Sarah>That's a great idea. When can we meet?

These features make WebVTT perfect when you want captions that enhance rather than distract from your video content.

The <v Speaker> tag is a valid WebVTT feature, but AssemblyAI does not automatically include it in VTT exports. Developers must enable speaker labeling, use the transcript results, and programmatically construct a custom VTT file if they want speaker-tagged captions.

TXT format for untimed transcripts

TXT format gives you plain text without any timing information. This means you get just the spoken words in a simple text file—no timestamps, no formatting, just the content. While this isn't technically a subtitle format, TXT exports serve important purposes that timed formats can't handle.

Accessibility compliance: Provide text alternatives for screen readers

SEO optimization: Search engines can read and index text content from your videos
Content repurposing: Turn video content into blog posts or social media content
Accessibility compliance: Provide text alternatives for screen readers
Translation workflows: Translators work faster with plain text than timed formats
Documentation: Create searchable records of meetings or interviews

The main advantage is simplicity—anyone can open, read, edit, and copy TXT files without special software.

SRT vs VTT vs TXT comparison

Each format excels in different scenarios. Your choice depends on where you're publishing and what features you need.

Feature	SRT	WebVTT	TXT
Platform support	Universal	Web-focused	Universal
Styling options	Basic HTML only	Full CSS styling	None
Text positioning	Bottom-center only	Fully customizable	N/A
File complexity	Simple	Moderate	Simplest
Best for	Cross-platform video	HTML5 web video	Transcripts

When to choose SRT: You need subtitles that work everywhere without complications. SRT is your safe choice for maximum compatibility.

When to choose WebVTT: You're publishing to web platforms and want professional styling, positioning, or live streaming capabilities.

When to choose TXT: You need searchable content, accessibility documentation, or material for content repurposing.

Platform-specific requirements

Different platforms have specific preferences that affect how your subtitles display:

YouTube: Prefers WebVTT but accepts SRT. Offers auto-timing features with VTT files.

Facebook: Only accepts SRT and strips all formatting tags, so keep it simple.

Vimeo: Supports both SRT and WebVTT with full styling capabilities.

LinkedIn: Accepts SRT but limits lines to 80 characters maximum.

Zoom: Uses WebVTT for webinar recordings and replay features.

Understanding these requirements prevents formatting issues and ensures your captions display correctly.

Try real-time transcription with timestamps

See how accurate, low-latency transcription produces clean segments you can map to SRT or WebVTT cues. Explore models and sample audio in your browser.

Open playground

Live caption sync for streaming

Live streaming needs real-time caption generation that appears within 1-3 seconds of spoken words. Unlike pre-recorded video where you can perfect timing afterward, live captions must process and display instantly while maintaining accuracy.

WebVTT works best for live streaming because it supports progressive delivery—captions can be sent in individual chunks as they're created rather than waiting for complete files.

Key requirements for live captions:

Sub-second speech processing to maintain natural flow
Buffer management that balances speed with accuracy
Error recovery for network interruptions or delays
Format compatibility with streaming protocols

Modern streaming platforms use specialized APIs that convert speech to text in real-time. AssemblyAI's real-time streaming API returns JSON messages, and developers must transform those responses—such as `Turn` events—into VTT or SRT caption chunks for live caption workflows.

This enables features like live accessibility for webinars, real-time captions for virtual events, and instant subtitles for social media live streams.

Subtitle export best practices

Following proven practices ensures your subtitles are readable and professional regardless of format.

Technical requirements:

Use UTF-8 encoding for international character support
Save files with correct extensions (.srt, .vtt, .txt)
Test files with your target video players before publishing

Timing and readability:

Keep subtitles between 1-5 seconds duration for comfortable reading
Leave 0.5-1 second gaps between subtitles to prevent fatigue
Limit lines to 32-40 characters for optimal reading speed
Use maximum 2 lines per subtitle to avoid covering video content

Content formatting:

Break lines at natural speech pauses or grammatical boundaries
Maintain consistent speaker identification throughout files
Include speaker labels for off-screen voices or multiple speakers

File management:

Use descriptive naming like video-title_english_v1.srt
Keep original and edited versions in separate folders
Maintain version control for multiple edits or translations

These practices ensure your subtitles enhance rather than distract from the viewing experience.

Final words

Subtitle formats serve different needs in the modern video ecosystem. SRT provides universal compatibility when you need captions that work everywhere. WebVTT offers advanced features for web-based video with styling and positioning capabilities. TXT transcripts handle non-synchronized needs from SEO optimization to accessibility compliance.

Modern Voice AI platforms simplify the entire workflow from audio input to formatted subtitle output. AssemblyAI provides direct export endpoints for basic SRT and VTT subtitle files, but these exports do not include speaker labels. To create speaker-labeled subtitle files, developers must use the transcript JSON response—specifically the `utterances` array—and generate the subtitle file manually. The subtitle export endpoint supports limited formatting controls, such as `chars_per_caption`, which can indirectly affect subtitle segmentation. It does not offer direct controls for subtitle duration or gap timing.

Export transcripts to SRT, VTT, or TXT

Transcribe audio with AssemblyAI and export subtitle files with customizable timing and speaker labels—matched to your platform's requirements.

Get API key

Frequently asked questions

Should I use SRT or WebVTT for YouTube videos?

Use WebVTT for YouTube because it supports YouTube's advanced features like auto-timing and styling options. While YouTube accepts SRT files, WebVTT gives you better integration with YouTube's caption editor and allows for enhanced formatting that improves viewer experience.

Can I edit subtitle files in a regular text editor?

Yes, both SRT and WebVTT files are plain text formats you can edit in Notepad, TextEdit, or any text editor. Just make sure to save the file with the correct encoding (UTF-8) and file extension (.srt or .vtt) to ensure compatibility with video players.

Do subtitle files work across different video platforms?

SRT files work on virtually all video platforms and players, making them the safest choice for cross-platform use. WebVTT files work on most modern web platforms but may not be compatible with older software or non-web video players. TXT files aren't subtitle formats but can be imported into most caption editing software.

What happens if my subtitle timing is off?

Incorrect timing makes captions appear too early or too late relative to speech, creating a confusing viewing experience. You can fix timing issues by editing the timestamp values in the subtitle file or using video editing software with subtitle timing adjustment features. Most platforms also offer built-in caption editors for making timing corrections.