From Proof‑of‑Concept to Practice: Using AI to Test Multilingual Translation

Written by Aleksandr Pushkarev | Aug 13, 2025

Why this matters

When a single radio call can flip from English to Mandarin to Polish in the same minute, translation errors turn into operational risk. Our voice‑platform client initially leaned on Microsoft Azure’s speech‑to‑text, but licensing costs ballooned and “long‑tail” languages lagged behind. We decided to build (and ultimately open‑source) an in‑house, GPU‑accelerated translation service.

Cool—now someone had to test it across 69 languages, dialect quirks included.

1. Framing the problem

Scope: Text‑to‑text endpoint first (audio paths are next).
Languages: 69 (everything we currently ship, from Arabic to Zulu).
Constraints: < $5 per regression run, < 2 h wall time, CI‑friendly.

2. Toolchain at a glance

Layer	Stack
Translation engine	OpenNMT + Forte custom adapters
Test harness	PyTest + Ragas framework
LLM evaluators	Started with GPT‑4o, migrated to Claude 3.5 Sonnet for a 70 % cost drop
Metrics	Faithfulness, Semantic similarity, BLEU, Grammar

Why Ragas? It lets me call an external LLM to grade each sample, then rolls results into numeric metrics—no mystical “looks‑right‑to‑me.”

3. Metrics that actually surface bugs

Faithfulness – Facts stay intact.
Semantic similarity – Idioms land in the right cultural zip code.
BLEU – Classic token‑level drift detector.
Grammar – Tense, gender, and word order behave.

I tried running with semantic similarity alone; it happily passed a sample where “reinforcements on route” became “strength on the road.” Combining all four metrics caught it.

4. Generating reliable test data (without going broke)

Claude generated 20 domain‑specific phrases per language—fire‑department lingo, stadium‑security chatter, mining‑site commands.

Tokens: ≈ 9 M
One‑time cost: $6.30

That’s 10× cheaper than hiring translators and fully amortised after one release cycle.

5. Running the suite

Run	Test cases	Cost	Wall‑clock
Uni‑directional	1 380	$2.28	~2 h (CPU)
Bi‑directional	2 660	$4.60	~4 h (CPU)

Bugs found so far

Missing tokenizer rules → tokens rendered as “?” for Marathi & Khmer
Intermittent 503s under high concurrency (container thread pool patched)
Mistranslated proper nouns that slipped faithfulness but failed BLEU

6. Lessons worth stealing

Cheap models first – Claude 3.5 Sonnet nails grad‑school accuracy for pennies.
Cache tokenizers locally – Storing NLTK assets beside the codebase slashes startup latency.
Trace cost per test – Pipe token usage to stdout; surprises disappear.
“AI tests AI” ≠ hands‑off – Spot‑check generated corpora with human translators every release.

7. Roadmap

GPU builds → target 10× throughput
Using human made translations as reference i.e. UN Parallel Corpus, Paracrawl → not to be caught in the loophole where “AI tests AI”
One‑to‑Many translations → fan‑out testing in a single request
Speech pathways → integrate transcribe & speak endpoints once the audio model stabilises
Open repo → sanitized reference architecture + sample corpora (ETA Q3‑2025)

Shipping software that talks (and listens) in 69 languages—without losing the plot.

View full post