Split-parent XML merge — streaming rewrite¶
Motivation¶
_build_parent_split_xml_result_bytes materialized the entire merged BLAST
XML tree in worker memory before gzip-compressing it:
gzip.decompress(...)of each child's full gzipped XMLET.fromstring(...)of the decompressed payload (full DOM)copy.deepcopy(child_root)for the base +copy.deepcopy(iteration)per child Iteration element accumulated under one growingbase_iterationsET.ElementTree(base_root).write(BytesIO)thengzip.compress(...)on the final bytes
Worst case (100-shard × 200 MB per child) = >20 GB resident in one worker
just to merge a single split parent. Defeats the careful chunk-based
stream_blob_bytes upload path that already existed for the tabular case.
The companion helper _read_split_child_merged_result_bytes defeated the
streaming download with a b"".join(stream_blob_bytes(...)).
User-facing change¶
None. Same merged BLAST XML is produced (same BlastOutput_program,
BlastOutput_version, BlastOutput_iterations content, same renumbered
Iteration_iter-num). Worker RSS stays bounded by one Iteration element +
gzip window regardless of child count or per-child file size.
API / IaC diff¶
- New
_iter_parent_split_xml_chunks(...)generator inapi/tasks/blast/split_pipeline.py: - Streams each child gzip blob through
_GeneratorByteReader(stream_blob_bytes→gzip.GzipFile(fileobj=…)→ET.iterparse). - Saves only the first child's
BlastOutput_*metadata tags, emits a rebuilt header withBlastOutput_dboverridden once. - For each
<Iteration>end event: renumber, serialize, feed tozlib.compressobj(wbits=16+MAX_WBITS), yield the compressed bytes,.remove()the element from its parent and.clear()it so the iterparse tree never accumulates. - Caller (
_write_split_parent_result_artifacts) passes the generator directly toupload_blob_bytes— no[buffered_bytes]wrap. - Legacy
_build_parent_split_xml_result_bytes(...)retained as a thin shim that materializes the streaming generator. Kept solely for the existing test fixtures andapi.tasks.blast.__init__re-export contract so external monkeypatch sites do not break. _GeneratorByteReaderadded — minimal read-only binary file-like over abytesiterator. Used by the streaming merge to chainstream_blob_bytes → GzipFile → iterparsewithout materializing the decompressed payload.__all__export andapi/tasks/blast/__init__.pyre-export updated to surface_iter_parent_split_xml_chunks.
Validation¶
uv run pytest -q api/tests/test_blast_tasks.py— 120 passed (XML merge testtest_write_split_parent_result_artifacts_merges_child_xmlexercises the new path; output schema unchanged).uv run ruff check api/tasks/blast/split_pipeline.py api/tasks/blast/__init__.py— clean.