When a single radio call can flip from English to Mandarin to Polish in the same minute, translation errors turn into operational risk. Our voice‑platform client initially leaned on Microsoft Azure’s speech‑to‑text, but licensing costs ballooned and “long‑tail” languages lagged behind. We decided to build (and ultimately open‑source) an in‑house, GPU‑accelerated translation service.
Cool—now someone had to test it across 69 languages, dialect quirks included.
Layer |
Stack |
Translation engine |
OpenNMT + Forte custom adapters |
Test harness |
PyTest + Ragas framework |
LLM evaluators |
Started with GPT‑4o, migrated to Claude 3.5 Sonnet for a 70 % cost drop |
Metrics |
Faithfulness, Semantic similarity, BLEU, Grammar |
Why Ragas? It lets me call an external LLM to grade each sample, then rolls results into numeric metrics—no mystical “looks‑right‑to‑me.”
I tried running with semantic similarity alone; it happily passed a sample where “reinforcements on route” became “strength on the road.” Combining all four metrics caught it.
Claude generated 20 domain‑specific phrases per language—fire‑department lingo, stadium‑security chatter, mining‑site commands.
That’s 10× cheaper than hiring translators and fully amortised after one release cycle.
Run |
Test cases |
Cost |
Wall‑clock |
Uni‑directional |
1 380 |
$2.28 |
~2 h (CPU) |
Bi‑directional |
2 660 |
$4.60 |
~4 h (CPU) |
Bugs found so far
Shipping software that talks (and listens) in 69 languages—without losing the plot.