how we handle your data.
a short, plain-language reference for security teams and language-tech leads doing procurement diligence. user-facing walkthrough is at /how-it-works; full legal text is at /privacy and /terms.
━━ how the cleaner runs
streams your file rather than loading it into memory all at once. comfortably handles multi-gigabyte translation memories on modest hardware.
XML parsing is XXE-safe at every entry point. external entity resolution is structurally disabled — there is no fallback path that re-enables it.
deterministic output — a clean run on the same input with the same options produces a byte-identical result. removed entries are preserved alongside kept ones in a separate audit file so you can verify nothing important was cut.
script-aware comparisons — length-outlier and short-phrase rules use NFC code-point counts so combining-mark scripts (Devanagari, Thai, Khmer) and no-space scripts (CJK, Tibetan) are treated fairly. a legitimate Hindi or Japanese translation is not penalised for looking longer or shorter than its English source.
━━ what the cleaner catches
eight classifiers, individually toggleable in the rules panel and bundled by the three named presets:
- ●duplicates — by normalised source; winner picked by Popular / Latest / Longest, by your priority-author ranking, or by manual per-group override
- ●structural junk — URLs, emails, phone numbers, pure numeric / punctuation / tags-only segments, suspect markup from CAT round-trips
- ●untranslated entries — target equals source after case + whitespace + NFC normalisation; respects the "keep short UI labels" toggle for legitimately preserved phrases
- ●empty target with non-empty source — common leakage from CAT exports of in-progress projects
- ●MT-flagged TUs — creationid / changeid contains an MT engine signature. short patterns get word-boundary semantics so SMTP / GMT / format-tool changeids don't false-positive
- ●length outliers in both directions — target much longer or much shorter than source. NFC-normalised so Indic / Thai / CJK targets don't false-positive
- ●placeholder / inline-tag mismatch — {0} / %s / <ph/> counts differ between source and target. recognises Java, .NET, printf, Mustache, Liquid, Shell, XLIFF inline tags
- ●mojibake — encoding corruption (UTF-8 read as cp1252 → café-style cross-encoding) detected via ftfy round-trip
- ●brand / DNT terms translated — operator supplies a list (iPhone, HIPAA, Acme Corp …) and any TU where a term appears in source but not target gets flagged. word-boundary match, NFC-normalised
━━ engine hardening
we run repeated independent code reviews — each round audits the engine end-to-end, fixes findings, then re-audits with a fresh reviewer who has no memory of prior passes. the most recent campaign closed 9 HIGH + 23 MED issues across five rounds before reaching convergence (zero new findings on the final pass).
convergence is a quality bar: if a fresh reviewer with no context finds nothing new, the surface is stable. we re-run the campaign whenever significant engine logic ships.
━━ security posture
- ●strict-purge of uploads — your source file is hard-deleted the moment cleaning finishes successfully. cleaned outputs auto-expire 24 hours later, and you can wipe them earlier from the dashboard.
- ●signed URLs only — no public-read access. uploads use one-time presigned PUTs; downloads use short-TTL presigned GETs.
- ●XXE-safe XML parsing throughout. external entity expansion + billion-laughs attacks are blocked at the parser level with no fallback path.
- ●no password storage — authentication is via single-use email magic link. nothing to leak, nothing to rotate.
- ●strict transport security — HSTS (preload), Content-Security-Policy, frame-ancestors deny, X-Content-Type-Options, strict-origin referrer policy. all responses, no exceptions.
- ●anti-abuse — per-IP signup throttle plus email-alias normalisation prevent a single inbox from farming free quotas across many addresses.
- ●audit trail by design — every billable action emits a structured log line so a full reconstruction from logs is always possible.
- ●regular external security audits — multiple independent audit passes per quarter; findings tracked and remediated. summary available on request via /contact.
━━ where your data lives
EU jurisdiction. compute and object storage are in EU datacenters; transactional email is sent through an EU-region provider. no data is replicated outside the EU.
no third-party analytics or trackers. no Google Analytics, no Hotjar, no advertising pixels. the only cookie this site sets is a signed session id (HttpOnly, Secure, SameSite=Lax).
full subprocessor list (compute provider, object storage, email, payments) and per-field retention policy is enumerated in our privacy policy.
━━ asking us more
doing a security or procurement review and need details that aren't on this page (DPA, audit report, sub-processor change notifications, custom retention windows)? reach out via /contact — we typically respond within one business day.