◇ documentation

how TM Cleaner works

streaming dedup + junk removal for translation memories. your files never leave their 1-hour TTL.

━━ the 5-step flow

  1. 01
    sign in

    email magic link. no passwords stored, ever. lost access? request a new link.

  2. 02
    upload

    drag-drop a .tmx (up to 2 GB). uploaded directly to encrypted object storage via a one-time signed URL — bytes never touch our API node.

  3. 03
    preview · free

    we scan and show you exactly what will be removed: duplicates, junk, projected output size. doesn't consume your quota until you proceed.

  4. 04
    clean

    review and confirm. the engine streams through your file in two passes — never loading it into RAM, even for multi-GB files.

  5. 05
    download

    one bundle.zip (cleaned.tmx + removed.tmx + removed.csv, max compression). individual files also available. original upload is purged the moment processing finishes; downloads stay live for 24 hours.

━━ what gets removed

duplicates
  • exact matches by normalized source
  • whitespace-only differences
  • best version wins (more content + inline tags + newer date)
junk segments
  • empty source
  • single character
  • pure numbers (12345 / 1.2.3 / 1,234)
  • URLs (https://… / www.…)
  • emails / mailto links
  • phone numbers
  • pure tags (<br/>)
  • repeated symbols (--- / === / ***)
  • >80% punctuation ratio

━━ what we keep, untouched

━━ under the hood

streaming engine. the cleaner uses a two-pass iterparse design. memory is O(unique segments), not O(file size) — so a 2 GB TMX cleans on a small VPS without swapping.

measured benchmark. on a synthetic 1,000,000-segment TMX (255 MB, 30% duplicates): 108 seconds wall-clock, 373 MB peak RAM, 35% size reduction. linear with input.

no audit lost. every removed/displaced segment is logged with its reason empty_source, junk_short, duplicate_*, replaced_* in the CSV + preserved as a real TMX you can re-import.

━━ privacy & security