◇ under the hood

the engine, in detail.

for engineers, security teams, and language-tech leads doing procurement diligence. user-facing walkthrough lives at /how-it-works.

━━ streaming engine

The cleaner uses a two-pass lxml.etree.iterparse design. Pass 1 scans the entire file with element-clearing after each <tu> close — only the dedup-key bookkeeping (one dict entry per UNIQUE normalized source) survives. Pass 2 streams the file again, with an xmlfile incremental writer emitting kept TUs to cleaned.tmx and removed TUs to removed.tmx in document-order.

Memory is O(unique sources), not O(file size). A 2 GB TMX with 30% duplicates cleans on a small VPS in well under 1 GB of RSS.

XML parsing goes through defusedxml (XXE / billion-laughs safe) at every entry point, including the preview pass. There is NO non-defused fallback path; if defusedxml fails to import, the worker refuses to start.

━━ benchmarks

scenariosizesegmentswallpeak rssreduction
synthetic · 30% dup255 MB1,000,000108 s373 MB35%
real LSP · 5 yr accumulated1.4 GB3,200,000~9 min~600 MB52%

benchmarks run on a Hetzner CX33 (4 vCPU, 8 GB RAM, NVMe). linear with input · the bottleneck is XML serialisation, not the dedup map.

━━ removal-reason codes

every row in the removed.csv audit log carries one of these codes, so you can grep / filter / sort by removal reason after the fact.

empty_sourceno source text at all (sometimes a stray <tu> from a broken export)
junk_shortsingle character or whitespace-only — can never produce useful leverage
junk_pure_numthe source is just digits / version strings (12345 / 1.2.3)
junk_urlURL-only or email-only segment — translation isn't meaningful
junk_phonephone-number-only segment
junk_pure_tagsegment is just an inline tag with no surrounding text
junk_repeated_symlines of dashes / equals signs / asterisks — formatting artefacts
junk_punct_heavy>80% punctuation by character count
duplicate_*this TU lost a dedup race · the kept variant's normalized_source is in the audit
junk_filtered_userremoved by your changeid (user) filter
junk_filtered_dateremoved by your before/after-date filter
replaced_*your manual variant pick — this loser was replaced by the variant you chose

━━ security posture

━━ infrastructure

FastAPI + ARQ workers on a single Hetzner CX33 in Nuremberg (EU). Postgres 17 for user / history / billing state, Redis 7 for job runtime + reservations, Cloudflare R2 (EU jurisdiction) for object storage, Mailgun EU for transactional email, Stripe for billing.

No third-party trackers · no Google Analytics · no Hotjar · no Sentry by default (operator opt-in via env). The page you're reading right now sets one cookie: a signed session id, SameSite=Lax, HttpOnly, Secure.