the engine, in detail.
for engineers, security teams, and language-tech leads doing procurement diligence. user-facing walkthrough lives at /how-it-works.
━━ streaming engine
The cleaner uses a two-pass lxml.etree.iterparse design. Pass 1 scans the entire file with element-clearing after each <tu> close — only the dedup-key bookkeeping (one dict entry per UNIQUE normalized source) survives. Pass 2 streams the file again, with an xmlfile incremental writer emitting kept TUs to cleaned.tmx and removed TUs to removed.tmx in document-order.
Memory is O(unique sources), not O(file size). A 2 GB TMX with 30% duplicates cleans on a small VPS in well under 1 GB of RSS.
XML parsing goes through defusedxml (XXE / billion-laughs safe) at every entry point, including the preview pass. There is NO non-defused fallback path; if defusedxml fails to import, the worker refuses to start.
━━ benchmarks
| scenario | size | segments | wall | peak rss | reduction |
|---|---|---|---|---|---|
| synthetic · 30% dup | 255 MB | 1,000,000 | 108 s | 373 MB | 35% |
| real LSP · 5 yr accumulated | 1.4 GB | 3,200,000 | ~9 min | ~600 MB | 52% |
benchmarks run on a Hetzner CX33 (4 vCPU, 8 GB RAM, NVMe). linear with input · the bottleneck is XML serialisation, not the dedup map.
━━ removal-reason codes
every row in the removed.csv audit log carries one of these codes, so you can grep / filter / sort by removal reason after the fact.
empty_source | no source text at all (sometimes a stray <tu> from a broken export) |
junk_short | single character or whitespace-only — can never produce useful leverage |
junk_pure_num | the source is just digits / version strings (12345 / 1.2.3) |
junk_url | URL-only or email-only segment — translation isn't meaningful |
junk_phone | phone-number-only segment |
junk_pure_tag | segment is just an inline tag with no surrounding text |
junk_repeated_sym | lines of dashes / equals signs / asterisks — formatting artefacts |
junk_punct_heavy | >80% punctuation by character count |
duplicate_* | this TU lost a dedup race · the kept variant's normalized_source is in the audit |
junk_filtered_user | removed by your changeid (user) filter |
junk_filtered_date | removed by your before/after-date filter |
replaced_* | your manual variant pick — this loser was replaced by the variant you chose |
━━ security posture
- ●strict-purge of original uploads — the source file is hard-deleted (S3 DELETE) the moment the clean finishes successfully · outputs auto-expire 24h later via R2 lifecycle policy
- ●signed URLs only · no public-read object access · uploads use one-time presigned PUTs, downloads use short-TTL presigned GETs
- ●XXE-safe XML parsing · defusedxml at every entry · no fallback path · external entity resolution is structurally impossible
- ●no password storage · magic-link auth (15-min single-use Redis tokens via GETDEL)
- ●strict transport security · HSTS (1 year, preload), CSP (default-src 'self'), X-Frame-Options=DENY, X-Content-Type-Options=nosniff, Referrer-Policy=strict-origin-when-cross-origin
- ●per-IP signup throttle + +suffix email normalisation · prevents one inbox from farming free quotas across many addresses
- ●audit-trail by design · every billable action emits a structured log line (job_id, user_id, bytes, reservation state) so a rebuild from logs is always possible
- ●7+ external security audit passes · DeepSeek-V3.2, DeepSeek-V4-Pro, Qwen3.5-397B; findings tracked + remediated · changelog available on request
━━ infrastructure
FastAPI + ARQ workers on a single Hetzner CX33 in Nuremberg (EU). Postgres 17 for user / history / billing state, Redis 7 for job runtime + reservations, Cloudflare R2 (EU jurisdiction) for object storage, Mailgun EU for transactional email, Stripe for billing.
No third-party trackers · no Google Analytics · no Hotjar · no Sentry by default (operator opt-in via env). The page you're reading right now sets one cookie: a signed session id, SameSite=Lax, HttpOnly, Secure.
questions? /contact