◇ how it works

five steps. no surprises.

this page is the user-friendly walkthrough. if you want the engine internals, benchmarks, and full security posture — that lives at /under-the-hood.

━━ the 5-step flow

  1. 01
    sign in

    email magic link. no passwords stored, ever. lost access? request a new link.

  2. 02
    upload

    drag-drop a .tmx (up to 2 GB). uploaded directly to encrypted object storage via a one-time signed URL — bytes never touch our API node.

  3. 03
    preview · free

    we scan and show you exactly what will be removed: duplicates, junk, projected output size. doesn't consume your quota until you proceed.

  4. 04
    pick a preset · then clean

    three named presets cover ~95% of jobs — Lenient / Balanced / Strict — one click, eleven options applied. fine-tune via the Custom rules panel if you need to. the engine streams through your file in two passes, never loading it into RAM.

  5. 05
    download

    one bundle.zip (cleaned.tmx + removed.tmx + removed.csv, max compression). individual files also available. original upload is purged the moment processing finishes; downloads stay live for 24 hours.

━━ pick a preset · or fine-tune

three named cards above the rules panel — one click applies eleven options atomically. the right shape for ~95% of jobs. power users open the “custom rules” disclosure for the granular toggle panel underneath.

Lenient
keep more · QA & review

preserves untranslated entries, MT-flagged TUs, and short UI labels verbatim. Good for human review or pre-delivery inspection where you want the final say.

Balanced
default · LSP delivery

drops mechanical leakage (untranslated, empty target, MT-flagged) but doesn't run the strict QA validators. Sane defaults for delivering a TM to a client.

Strict
cut deep · MT training

everything Balanced does plus all QA validators on (placeholder mismatch, mojibake, length anomalies) and case-insensitive dedup for tighter merging. Right shape for MT-training datasets.

━━ what gets removed · or flagged

the engine ships eight cleanup classifiers, grouped into three categories. each is individually toggleable in the custom-rules panel; the presets above bundle sane defaults so you don’t have to pick.

duplicates
  • exact matches by normalized source
  • whitespace-only differences
  • winner picked by Popular / Latest — or by your priority-author list
  • manual per-group override always available in the preview
structural junk
  • empty source / empty target
  • single character · pure numbers
  • URLs · emails · phone numbers
  • pure tags · repeated symbols (--- / === / ***)
  • suspect markup (closing-tag corruption from CAT round-trips)
translation-quality red flags
  • untranslated (target = source · case + whitespace ignored)
  • MT-flagged TUs (creationid / changeid contains MT engine signature)
  • length outliers — target much longer OR shorter than source · NFC-normalised so CJK / Indic / Thai targets are treated fairly
  • placeholder mismatch — {0} / %s / <ph/> count differs between source and target
  • mojibake (encoding corruption — 'café' → 'café' style)
  • brand / DNT terms missing in target — you list iPhone, HIPAA, etc. and we flag any TU where the term appears in source but not target

━━ priority authors

when duplicates conflict, you can rank the trusted translators whose variants should win. an ordered list (5–20 names typically) overrides the Popular / Latest strategy for any group where one of those authors has a variant; everywhere else falls back to your default. catches the senior-translator-vs-freelancer case that no major CAT tool solves at TM-cleanup time.

━━ what we keep, untouched

━━ privacy in plain english

full security posture (XXE handling, signed URLs, headers, retention policy) → /under-the-hood