how TM Cleaner works
streaming dedup + junk removal for translation memories. your files never leave their 1-hour TTL.
━━ the 5-step flow
- 01sign in
email magic link. no passwords stored, ever. lost access? request a new link.
- 02upload
drag-drop a .tmx (up to 2 GB). uploaded directly to encrypted object storage via a one-time signed URL — bytes never touch our API node.
- 03preview · free
we scan and show you exactly what will be removed: duplicates, junk, projected output size. doesn't consume your quota until you proceed.
- 04clean
review and confirm. the engine streams through your file in two passes — never loading it into RAM, even for multi-GB files.
- 05download
one bundle.zip (cleaned.tmx + removed.tmx + removed.csv, max compression). individual files also available. original upload is purged the moment processing finishes; downloads stay live for 24 hours.
━━ what gets removed
- →exact matches by normalized source
- →whitespace-only differences
- →best version wins (more content + inline tags + newer date)
- →empty source
- →single character
- →pure numbers (12345 / 1.2.3 / 1,234)
- →URLs (https://… / www.…)
- →emails / mailto links
- →phone numbers
- →pure tags (<br/>)
- →repeated symbols (--- / === / ***)
- →>80% punctuation ratio
━━ what we keep, untouched
- ✓original document order
- ✓all inline tags (ph, bpt, ept, mrk, g, it)
- ✓TMX header + namespace declarations
- ✓multi-language TMs (any source / target pairing)
- ✓short UI phrases (1–2 words like “OK”, “Cancel”) — configurable
━━ under the hood
streaming engine. the cleaner uses a two-pass iterparse design. memory is O(unique segments), not O(file size) — so a 2 GB TMX cleans on a small VPS without swapping.
measured benchmark. on a synthetic 1,000,000-segment TMX (255 MB, 30% duplicates): 108 seconds wall-clock, 373 MB peak RAM, 35% size reduction. linear with input.
no audit lost. every removed/displaced segment is logged with its reason empty_source, junk_short, duplicate_*, replaced_* in the CSV + preserved as a real TMX you can re-import.
━━ privacy & security
- ●strict-purge of original uploads · the source file is hard-deleted the moment the clean finishes; outputs auto-expire 24h later
- ●signed URLs for both upload and download · no public object access
- ●XXE-safe XML parsing · uploads go through defusedxml before the engine touches them
- ●no password storage · magic-link auth means there's nothing to leak
- ●strict transport security · HSTS, CSP, X-Frame-Options=DENY on every response