Open Benchmark · Fully Auditable
Pronunciation accuracy across 2,200+ non-standard words
TTS models must convert written text like “03/15/2024” or “$4.99” into spoken words — “March fifteenth” or “four dollars ninety-nine.” This process, called text normalization, is one of the most common failure modes in production TTS. In batch or REST-based pipelines, an LLM preprocessing layer can rewrite text before synthesis, largely sidestepping the problem. But in real-time streaming scenarios, where audio must begin playing within milliseconds of input, there is no room for a normalization pass — the TTS model must handle it natively. This benchmark specifically targets that streaming use case.
1,000+ sentences containing 2,200+ non-standard words across 31 categories — dates, currencies, phone numbers, URLs, and more.
Every audio sample is generated through each provider's WebSocket streaming API — the same interface used in production conversational applications. No REST fallback, no preprocessing.
Gemini 3.1 Pro serves as an LLM judge — listening to every audio sample and scoring each unit against a category-specific evaluation rulebook.
Every audio sample, transcription, and per-unit judgment is available for inspection. All raw data is downloadable for independent analysis.
Side-by-side accuracy across all four models. Sentence-level accuracy requires every normalization unit within a sentence to be correct — a strict, end-to-end measure. Unit-level accuracy scores each non-standard word independently, providing a more granular view of model performance.
Transparent, reproducible evaluation powered by an LLM-as-a-judge pipeline.
All audio is synthesized through each provider's WebSocket streaming endpoint — the same low-latency interface used in production voice agents and conversational applications. Text is sent directly to the model with no preprocessing, normalization rewriting, or LLM-based cleanup. This ensures the benchmark measures each model's native text normalization capability under real streaming conditions.
Gemini 3.1 Pro serves as an automated judge: it listens to each audio sample, produces a verbatim transcription, and evaluates every normalization unit against a detailed category-specific rulebook defining acceptable and unacceptable spoken renderings. To validate reliability, automated scores were compared against expert linguist review on a stratified sample of ~300 sentence–model pairs across all 4 providers, confirming 97% agreement between the LLM judge and human evaluation.
Accuracy = correctly normalized units / total units. We report two complementary metrics: sentence-level accuracy, which marks a sentence correct only when every unit passes — a strict end-to-end measure — and unit-level accuracy, which scores each non-standard word independently for a granular view of per-category performance.
How closely does the automated judge match human judgment? We compared Gemini’s scores against expert linguist review on a stratified sample spanning every normalization category and all 4 providers.
Across ~300 sentence–model pairs, expert linguist review overturned Gemini’s score on just ~2.5% of cases — only 3 unique sentences out of the entire sample.
In all 3 cases, both readings are correct — the reviewer simply picked the one that sounds more natural.
“Release notes: 03/04/2025 (read as US), EU vacation d/m is ambiguous.”
TTS said “d m” instead of “d slash m.” Gemini flagged the missing separator. Reviewer accepted — context makes the format obvious, and people naturally drop separators when speaking.
“He gave me his number as 212.555.7890, which I saved to my phone…”
TTS said “two hundred twelve” for area code 212. Gemini flagged digit grouping as unnatural for phone numbers. Reviewer accepted — grouping area codes is common in everyday speech.
“I scored 10e3 points in the game, while my opponent only managed 1.2e-3…”
TTS said “ten e three” for 10e3. Gemini flagged it as reading notation literally. Reviewer accepted — “ten e three” is standard shorthand in technical speech.
Unit-level accuracy broken down by non-standard word type. Categories with fewer than 30 evaluation units are excluded to ensure statistical reliability. Click any column header to sort.
Browse the full evaluation dataset sentence by sentence. Expand any row to listen to each model's synthesized audio, read its transcription, and inspect per-unit pass/fail judgments — making every score fully auditable.
Every data point behind this benchmark is publicly available. Download the raw evaluation data for independent analysis, verification, or further research.