“皆様にお知らせ申し上げます。東京地方裁判所第2A法廷において予定されていた事件番号CV-9X77Bの審理は、来週木曜日午前10時へ延期となりました。”
Cartesia Sonic-3 on VoiceArena
A 119-Elo spread on a single model. Six languages, twelve thousand battles, and a defect pattern that points at three concrete fixes.
How Sonic-3 ranks across six languages
Same model, six markets. The gap between top and bottom is bigger than the gap between most adjacent competitors.
| Language | Rank | Elo | 95% CI | Win rate | Battles |
|---|---|---|---|---|---|
| 🇮🇳 Hindi | 4 / 8 | 1012.1 | ±9 | 52.1% | 2,475 |
| 🇧🇷 Brazilian Portuguese | 3 / 7 | 1011.7 | ±11 | 51.9% | 2,201 |
| 🇻🇳 Vietnamese | 3 / 7 | 991.1 | ±12 | 48.7% | 1,487 |
| 🇸🇦 Arabic MSA | 6 / 7 | 931.6 | ±8 | 39.5% | 2,687 |
| 🇯🇵 Japanese | 4 / 6 | 925.2 | ±12 | 38.7% | 2,225 |
| 🇺🇸 US English | 7 / 7 | 892.5 | ±10 | 32.9% | 1,784 |
Hindi and Brazilian Portuguese sit comfortably mid-pack; US English finishes dead last among seven models. The 119-Elo spread between best and worst is a single-model variance, not a model-vs-model one.
Where in each language does Sonic-3 actually rank?
Four content buckets — customer support, media, general conversation, content & education — broken down per language.
| Language | Customer support | Media | Gen-conv | Content & Edu |
|---|---|---|---|---|
🇺🇸 US English | 7/7 Elo 909 · 35.4% | 7/7 Elo 873 · 30.1% | 7/7 Elo 903 · 34.5% | 7/7 Elo 873 · 29.7% |
🇮🇳 Hindi | 4/8 Elo 1001 · 50.1% | 3/8 Elo 1012 · 52.1% | 4/8 Elo 1023 · 53.9% | 4/8 Elo 1021 · 53.6% |
🇯🇵 Japanese | 4/6 Elo 923 · 38.5% | 4/6 Elo 906 · 36.5% | 4/6 Elo 940 · 40.4% | 5/6 Elo 931 · 40.2% |
🇧🇷 Brazilian Portuguese | 3/7 Elo 1012 · 52.1% | 3/7 Elo 1018 · 52.7% | 4/7 Elo 981 · 47.0% | 3/7 Elo 1037 · 55.7% |
🇻🇳 Vietnamese | 3/7 Elo 989 · 48.4% | 4/7 Elo 982 · 46.6% | 4/7 Elo 1004 · 51.1% | 4/7 Elo 994 · 49.0% |
🇸🇦 Arabic MSA | 4/7 Elo 960 · 44.0% | 7/7 Elo 911 · 36.8% | 6/7 Elo 938 · 40.3% | 7/7 Elo 894 · 33.7% |
The category gap inside a language is tiny — Sonic-3 is either good in a language or it isn't. Arabic is the one exception, where customer-support performance (#4) clearly beats content & education (#7).
When Sonic-3 loses, what do raters complain about?
Stacked share of defect tags placed on Sonic-3 by raters who preferred the opposing model.
Overall, across all languages
| Defect | Count | Share |
|---|---|---|
| Irregular pacing | 2,790 | 24.3% |
| Mild mispronunciation | 2,754 | 24.0% |
| Unnatural / robotic voice | 2,041 | 17.8% |
| Severe mispronunciation | 1,624 | 14.2% |
| Noise / distortion | 836 | 7.3% |
| Hallucination / extra content | 775 | 6.8% |
| Missing word | 657 | 5.7% |
Irregular pacing, mispronunciation, and an unnatural / robotic voice together account for ~80% of complaints. The mix differs sharply by language: US English is overwhelmingly pacing + robotic timbre, while Arabic is dominated by mispronunciation.
Sonic-3 vs every other model, per language
Win rate with Wilson 95% CIs. Green = Sonic-3 wins, red = loses, grey = within tie band.
The same pair of models can flip outcome by language. Sonic-3 beats Gemini 3.1 Flash in Hindi by 30+ points, then loses to it in US English by a similar gap. The competitive bar Sonic-3 has to clear is language-specific.
Where Sonic-3 wins
Five strong examples per content category, with side-by-side audio against the opponent it beat most decisively.
“Hum, ah, deixa eu verificar isso aqui pra você… beleza, parece que o entregador está virando agora perto do Pão de Açúcar na Rua Oscar Freire, e, hum, deve chegar na sua porta em uns 12 minutinhos.”
“うーん、えーと、ちょっと確認いたしますね…はい、ドライバーが今ちょうど吉祥寺のサンロード商店街の角を曲がったところでして、あの、お荷物は、まあ、約7分ほどでお手元に届く予定です。”
“कृपया ध्यान दें, बंबई उच्च न्यायालय के न्यायालय कक्ष CR-2A में निर्धारित मुक़दमा संख्या CV-9X77B की सुनवाई न्यायाधीश के निर्देशानुसार स्थगित कर दी गई है और अब 14 अगस्त को होगी। संशोधित सुनवाई का समय सुबह 11:30 बजे होगा, और सभी कार्यवाही updated court calendar के अनुसार चलेंगी। संबंधित सभी पक्षों, अधिवक्ताओं और वादकारों से अनुरोध है कि वे इस परिवर्तन का संज्ञान लें और आवश्यक व्यवस्थाएँ कर लें। अधिक स्पष्टीकरण हेतु कृपया अदालत के क्लर्क कार्यालय से संपर्क करें। हुई असुविधा के लिए हमें खेद है और आपके सहयोग के लिए धन्यवाद।”
“أم، أعتذر منكم، لكن، أه، ليس لدينا أي سجل لدفع فاتورة الكهرباء لشهر يناير على رقم الحساب CG-554-3320.”
The wins skew toward short, conversational, single-script sentences — the natural territory of Sonic-3's prosody.
Where Sonic-3 lost
Same shape as the wins, with the defect tags raters flagged on Sonic-3 and their verbatim comments.
“はい、えーと、お客様のサービス予約番号HVAC-TKY-4351、確定いたしました。”
- it read out loud another sentence again and again.
- Echo Pause should be made after a hiphen.ハイフンごとに話すべきです。
- ノイズがある 機械的な音声と抑揚 英数字を読む時の息を置くタイミングが違う
“سيشهد المشتركون في منطقة الشبكة 7-شرق انقطاعاً في التيار الكهربائي من الساعة 11 صباحاً إلى الساعة 2 ظهراً يوم 25 مارس 2026، وذلك بسبب أعمال صيانة طارئة على المحول الفرعي. ويُنصح السكان بتأجيل استخدام الأجهزة كثيفة الاستهلاك للطاقة وحفظ أي عمل رقمي قبل بدء الانقطاع.”
- المشتركون wrong accent صباحا mispronounced مارس wrong accent
- اخطاء
“مرحباً، أه، أهلاً بكم في مكتب الدعم لدينا. أنا، أم، أحوّل مكالمتكم إلى أخصائينا أحمد المنصوري المختص بطلبات إعادة التمويل، حتى يتمكن من خدمتكم بشكل أفضل.”
- نطق غير دقيق لكلمة : طلبات : نطق غير دقيق لكل من أه - أم
- لفظ التنوين في كلمة " أهلا " بشكل خاطيء و التعابير غير بشرية
- اقل جودة
“事業者免許番号BIZ-TKY-2026-W4-882をお持ちの事業主の皆様は、金曜日までに区役所第5-B窓口にて許可証の更新手続きをお済ませください。”
- The letters and numbers are read incorrectly.
- There is noise background. The pace is not natural when it says the number.
- There was a wrong pronunciation for the number,BIZ-TKY-2026-W4-882. Others are generally ok.
- BIZ-TKY-2026-W4-882 is spoken with irregular pacing and messed up. At 0:15, "te" is missed.
- ハイフンで区切られた箇所は一気に読む
“Ừm, em thực sự xin lỗi chị, nhưng, ờ, em không tìm thấy bất kỳ ghi nhận thanh toán hóa đơn điện nào cho số tài khoản CG-554-3320 trong tháng 1 ạ.”
- AudioB phía cuối có tạp âm
- CG-554-3320: should speak one by one word and number at a time Pacing not good should improve Got a weird voice or sound at the end
- Tiếng ồn cuối Audio ( Từ giây 13 đến giây 19) Đọc dãy số bị cách quãng
- Tiếng ồn lớn cuối audio, giây 12 đến hết audio. Ngắt nghỉ ở các con số giữa "CG-554-3320" cũng không bình thường, giọng đều đều, ngữ điệu phẳng, lối đọc cơ học.
Recommendations
Five concrete moves the data points at, in priority order. (preview — not final)
- 01
Fix pacing first
Irregular pacing is the single largest defect tag (24% overall, 45% in US English). Tightening micro-pause behavior — especially after hyphens, alphanumerics, and clause boundaries — would lift the floor across every language.
- 02
Treat US English as a regression target
US English is dead last among seven models and the biggest contributor to the 119-Elo spread. Allocate dedicated voice-data and prosody work to close the gap with Gemini and ElevenLabs.
- 03
Invest in Arabic pronunciation
Arabic complaints are 57% mispronunciation (mild + severe). This is a lexicon problem, not a prosody one: prioritize Arabic G2P, MSA edge cases, and code-mixed Latin spans inside Arabic prompts.
- 04
Audit alphanumeric handling per locale
Japanese drops ~5 points of win rate when a prompt contains inline alphanumerics; Arabic gains ~9 points. The handling is clearly locale-conditional and worth a directed eval.
- 05
Stress-test long inputs
The length-vs-WR trend is mild but consistent. Add long-sentence battles (>200 chars) to the routine eval so regressions there don't hide behind aggregate scores.
How this brief was built
Re-derived from raw pairwise votes at report-build time, then cross-checked against the live VoiceArena leaderboard.
All figures come from VoiceArena v2 data captured in May 2026. Elo scores were recomputed from raw pairwise battles using the same algorithm the live dashboard uses, with a Bayesian bootstrap for the 95% confidence intervals. Recomputed Sonic-3 Elos matched the leaderboard exactly (max |Δ| = 0.00 across all six languages).
Defect tags are only counted when the rater preferred the opposing model and chose to fill out the optional tag form. Counts reflect how often each tag was placed specifically on Sonic-3.
Audio examples are the original generated waveforms served by VoiceArena's storage bucket — neither file was re-encoded or normalized for this report.