Open Benchmark · Fully Auditable

How Accurately Do Commercial
Streaming Text-To-Speech Models
Pronounce Non-Standard Text?

Async Flash v1.0

Pronunciation accuracy across 2,200+ non-standard words

Why This Benchmark Matters

TTS models must convert written text like “03/15/2024” or “$4.99” into spoken words — “March fifteenth” or “four dollars ninety-nine.” This process, called text normalization, is one of the most common failure modes in production TTS. In batch or REST-based pipelines, an LLM preprocessing layer can rewrite text before synthesis, largely sidestepping the problem. But in real-time streaming scenarios, where audio must begin playing within milliseconds of input, there is no room for a normalization pass — the TTS model must handle it natively. This benchmark specifically targets that streaming use case.

📝

Real-World Test Set

1,000+ sentences containing 2,200+ non-standard words across 31 categories — dates, currencies, phone numbers, URLs, and more.

🎯

WebSocket Streaming Endpoints

Every audio sample is generated through each provider's WebSocket streaming API — the same interface used in production conversational applications. No REST fallback, no preprocessing.

⚖️

Automated, Reproducible Judging

Gemini 3.1 Pro serves as an LLM judge — listening to every audio sample and scoring each unit against a category-specific evaluation rulebook.

🔍

Fully Auditable

Every audio sample, transcription, and per-unit judgment is available for inspection. All raw data is downloadable for independent analysis.

Overall Comparison

Side-by-side accuracy across all four models. Sentence-level accuracy requires every normalization unit within a sentence to be correct — a strict, end-to-end measure. Unit-level accuracy scores each non-standard word independently, providing a more granular view of model performance.

Sentence-Level Accuracy

Unit-Level Accuracy

Methodology

Transparent, reproducible evaluation powered by an LLM-as-a-judge pipeline.

Audio Generation

All audio is synthesized through each provider's WebSocket streaming endpoint — the same low-latency interface used in production voice agents and conversational applications. Text is sent directly to the model with no preprocessing, normalization rewriting, or LLM-based cleanup. This ensures the benchmark measures each model's native text normalization capability under real streaming conditions.

Automated Judging

Gemini 3.1 Pro serves as an automated judge: it listens to each audio sample, produces a verbatim transcription, and evaluates every normalization unit against a detailed category-specific rulebook defining acceptable and unacceptable spoken renderings. To validate reliability, automated scores were compared against expert linguist review on a stratified sample of ~300 sentence–model pairs across all 4 providers, confirming 97% agreement between the LLM judge and human evaluation.

Scoring

Accuracy = correctly normalized units / total units. We report two complementary metrics: sentence-level accuracy, which marks a sentence correct only when every unit passes — a strict end-to-end measure — and unit-level accuracy, which scores each non-standard word independently for a granular view of per-category performance.

Expert Linguist Validation

How closely does the automated judge match human judgment? We compared Gemini’s scores against expert linguist review on a stratified sample spanning every normalization category and all 4 providers.

97.4%
Human–LLM Agreement

Across ~300 sentence–model pairs, expert linguist review overturned Gemini’s score on just ~2.5% of cases — only 3 unique sentences out of the entire sample.

Where Expert Linguist and LLM Disagree

In all 3 cases, both readings are correct — the reviewer simply picked the one that sounds more natural.

date_time_mixed

“Release notes: 03/04/2025 (read as US), EU vacation d/m is ambiguous.”

LLM: strict Human: lenient

TTS said “d m” instead of “d slash m.” Gemini flagged the missing separator. Reviewer accepted — context makes the format obvious, and people naturally drop separators when speaking.

phone

“He gave me his number as 212.555.7890, which I saved to my phone…”

LLM: strict Human: lenient

TTS said “two hundred twelve” for area code 212. Gemini flagged digit grouping as unnatural for phone numbers. Reviewer accepted — grouping area codes is common in everyday speech.

scientific

“I scored 10e3 points in the game, while my opponent only managed 1.2e-3…”

LLM: strict Human: lenient

TTS said “ten e three” for 10e3. Gemini flagged it as reading notation literally. Reviewer accepted — “ten e three” is standard shorthand in technical speech.

Accuracy by Category

Unit-level accuracy broken down by non-standard word type. Categories with fewer than 30 evaluation units are excluded to ensure statistical reliability. Click any column header to sort.

Audio Explorer

Browse the full evaluation dataset sentence by sentence. Expand any row to listen to each model's synthesized audio, read its transcription, and inspect per-unit pass/fail judgments — making every score fully auditable.

Open Data

Every data point behind this benchmark is publicly available. Download the raw evaluation data for independent analysis, verification, or further research.