Skip to content

BLAST XML parser — incremental walk via iterparse

Motivation

parse_blast_xml called ET.fromstring(content) and walked the full DOM to extract per-HSP rows. For the 20 MiB XML cap that route handlers enforce, the resident DOM blew up to ~100-200 MiB while the function ran; multiple concurrent analytics calls summed to GB-scale worker RSS spikes.

User-facing change

None. Same row schema, same field coercions, same namespace handling (verified by test_parse_blast_xml_namespaced).

API / IaC diff

  • api/services/blast_results_parser.py
  • parse_blast_xml rewritten as a defusedxml.iterparse state machine that walks start/end events, captures Iteration-level query metadata, fans out one row per <Hsp>, and elem.clear()-s each <Hit> and <Iteration> subtree as it closes so the parser's resident DOM is bounded by one Hit subtree.
  • New private helper _build_hit_row(...) keeps the per-HSP row construction unchanged but factored out so the walker stays readable.
  • No new dependency; defusedxml.ElementTree.iterparse is part of the existing defusedxml pin.

Validation

  • uv run pytest -q api/tests/test_blast_results_parser.py api/tests/test_blast_results_routes.py — 47 passed (XML, namespaced XML, route export and aggregate paths all green).
  • uv run ruff check api/services/blast_results_parser.py — clean.