BLAST XML parser — incremental walk via iterparse¶
Motivation¶
parse_blast_xml called ET.fromstring(content) and walked the full DOM
to extract per-HSP rows. For the 20 MiB XML cap that route handlers
enforce, the resident DOM blew up to ~100-200 MiB while the function ran;
multiple concurrent analytics calls summed to GB-scale worker RSS spikes.
User-facing change¶
None. Same row schema, same field coercions, same namespace handling
(verified by test_parse_blast_xml_namespaced).
API / IaC diff¶
api/services/blast_results_parser.pyparse_blast_xmlrewritten as adefusedxml.iterparsestate machine that walksstart/endevents, captures Iteration-level query metadata, fans out one row per<Hsp>, andelem.clear()-s each<Hit>and<Iteration>subtree as it closes so the parser's resident DOM is bounded by one Hit subtree.- New private helper
_build_hit_row(...)keeps the per-HSP row construction unchanged but factored out so the walker stays readable. - No new dependency;
defusedxml.ElementTree.iterparseis part of the existingdefusedxmlpin.
Validation¶
uv run pytest -q api/tests/test_blast_results_parser.py api/tests/test_blast_results_routes.py— 47 passed (XML, namespaced XML, route export and aggregate paths all green).uv run ruff check api/services/blast_results_parser.py— clean.