Insights

From Proof‑of‑Concept to Practice: Using AI to Test Multilingual Translation

Written by Aleksandr Pushkarev | Aug 13, 2025

Why this matters

When a single radio call can flip from English to Mandarin to Polish in the same minute, translation errors turn into operational risk. Our voice‑platform client initially leaned on Microsoft Azure’s speech‑to‑text, but licensing costs ballooned and “long‑tail” languages lagged behind. We decided to build (and ultimately open‑source) an in‑house, GPU‑accelerated translation service.

Cool—now someone had to test it across 69 languages, dialect quirks included.

 

1. Framing the problem

  • Scope: Text‑to‑text endpoint first (audio paths are next).
  • Languages: 69 (everything we currently ship, from Arabic to Zulu).
  • Constraints: < $5 per regression run, < 2 h wall time, CI‑friendly.

2. Toolchain at a glance

Layer

Stack

Translation engine

OpenNMT + Forte custom adapters

Test harness

PyTest + Ragas framework

LLM evaluators

Started with GPT‑4o, migrated to Claude 3.5 Sonnet for a 70 % cost drop

Metrics

Faithfulness, Semantic similarity, BLEU, Grammar


Why Ragas? It lets me call an external LLM to grade each sample, then rolls results into numeric metrics—no mystical “looks‑right‑to‑me.”

3. Metrics that actually surface bugs

  1. Faithfulness – Facts stay intact.
  2. Semantic similarity – Idioms land in the right cultural zip code.
  3. BLEU – Classic token‑level drift detector.
  4. Grammar – Tense, gender, and word order behave.

I tried running with semantic similarity alone; it happily passed a sample where “reinforcements on route” became “strength on the road.” Combining all four metrics caught it.

 

4. Generating reliable test data (without going broke)

Claude generated 20 domain‑specific phrases per language—fire‑department lingo, stadium‑security chatter, mining‑site commands.

  • Tokens: ≈ 9 M
  • One‑time cost: $6.30

That’s 10× cheaper than hiring translators and fully amortised after one release cycle.

 

5. Running the suite


Run

Test cases

Cost

Wall‑clock

Uni‑directional

1 380

$2.28

~2 h (CPU)

Bi‑directional

2 660

$4.60

~4 h (CPU)

Bugs found so far

  • Missing tokenizer rules → tokens rendered as “?” for Marathi & Khmer
  • Intermittent 503s under high concurrency (container thread pool patched)
  • Mistranslated proper nouns that slipped faithfulness but failed BLEU

6. Lessons worth stealing

  1. Cheap models first – Claude 3.5 Sonnet nails grad‑school accuracy for pennies.
  2. Cache tokenizers locally – Storing NLTK assets beside the codebase slashes startup latency.
  3. Trace cost per test – Pipe token usage to stdout; surprises disappear.
  4. “AI tests AI” ≠ hands‑off – Spot‑check generated corpora with human translators every release.

7. Roadmap

  • GPU builds → target 10× throughput
  • Using human made translations as reference i.e. UN Parallel Corpus, Paracrawl  → not to be caught in the loophole where “AI tests AI”
  • One‑to‑Many translations → fan‑out testing in a single request
  • Speech pathways → integrate transcribe & speak endpoints once the audio model stabilises
  • Open repo → sanitized reference architecture + sample corpora (ETA Q3‑2025)

Shipping software that talks (and listens) in 69 languages—without losing the plot.