the engine, in detail.

for engineers, security teams, and language-tech leads doing procurement diligence. user-facing walkthrough lives at /how-it-works.

━━ streaming engine

The cleaner uses a two-pass lxml.etree.iterparse design. Pass 1 scans the entire file with element-clearing after each <tu> close — only the dedup-key bookkeeping (one dict entry per UNIQUE normalized source) survives. Pass 2 streams the file again, with an xmlfile incremental writer emitting kept TUs to cleaned.tmx and removed TUs to removed.tmx in document-order.

Memory is O(unique sources), not O(file size). A 2 GB TMX with 30% duplicates cleans on a small VPS in well under 1 GB of RSS.

XML parsing goes through defusedxml (XXE / billion-laughs safe) at every entry point, including the preview pass. There is NO non-defused fallback path; if defusedxml fails to import, the worker refuses to start.

━━ benchmarks

scenario	size	segments	wall	peak rss	reduction
synthetic · 30% dup	255 MB	1,000,000	108 s	373 MB	35%
real LSP · 5 yr accumulated	1.4 GB	3,200,000	~9 min	~600 MB	52%

benchmarks run on a Hetzner CX33 (4 vCPU, 8 GB RAM, NVMe). linear with input · the bottleneck is XML serialisation, not the dedup map.

━━ removal-reason codes

every row in the removed.csv audit log carries one of these codes, so you can grep / filter / sort by removal reason after the fact.

`empty_source`	no source text at all (sometimes a stray <tu> from a broken export)
`junk_short`	single character or whitespace-only — can never produce useful leverage
`junk_pure_num`	the source is just digits / version strings (12345 / 1.2.3)
`junk_url`	URL-only or email-only segment — translation isn't meaningful
`junk_phone`	phone-number-only segment
`junk_pure_tag`	segment is just an inline tag with no surrounding text
`junk_repeated_sym`	lines of dashes / equals signs / asterisks — formatting artefacts
`junk_punct_heavy`	>80% punctuation by character count
`duplicate_*`	this TU lost a dedup race · the kept variant's normalized_source is in the audit
`junk_filtered_user`	removed by your changeid (user) filter
`junk_filtered_date`	removed by your before/after-date filter
`replaced_*`	your manual variant pick — this loser was replaced by the variant you chose

━━ security posture

●strict-purge of original uploads — the source file is hard-deleted (S3 DELETE) the moment the clean finishes successfully · outputs auto-expire 24h later via R2 lifecycle policy
●signed URLs only · no public-read object access · uploads use one-time presigned PUTs, downloads use short-TTL presigned GETs
●XXE-safe XML parsing · defusedxml at every entry · no fallback path · external entity resolution is structurally impossible
●no password storage · magic-link auth (15-min single-use Redis tokens via GETDEL)
●strict transport security · HSTS (1 year, preload), CSP (default-src 'self'), X-Frame-Options=DENY, X-Content-Type-Options=nosniff, Referrer-Policy=strict-origin-when-cross-origin
●per-IP signup throttle + +suffix email normalisation · prevents one inbox from farming free quotas across many addresses
●audit-trail by design · every billable action emits a structured log line (job_id, user_id, bytes, reservation state) so a rebuild from logs is always possible
●7+ external security audit passes · DeepSeek-V3.2, DeepSeek-V4-Pro, Qwen3.5-397B; findings tracked + remediated · changelog available on request

━━ infrastructure

FastAPI + ARQ workers on a single Hetzner CX33 in Nuremberg (EU). Postgres 17 for user / history / billing state, Redis 7 for job runtime + reservations, Cloudflare R2 (EU jurisdiction) for object storage, Mailgun EU for transactional email, Stripe for billing.

No third-party trackers · no Google Analytics · no Hotjar · no Sentry by default (operator opt-in via env). The page you're reading right now sets one cookie: a signed session id, SameSite=Lax, HttpOnly, Secure.

clean a tmx now →

questions? /contact