BLAST Run Progress and SSD Staging¶
Motivation¶
Run details could appear idle during early submit phases, then mark several legacy steps complete at once. The Submit Job step also hid the long node-local SSD staging work performed by ElasticBLAST, making the run look stuck.
User-Facing Change¶
- Run details now uses Container Apps / AKS-oriented phases instead of legacy VM/storage/upload phases.
- Earlier running steps are marked completed as soon as the orchestrator advances to the next step.
- Node-local DB staging is represented as a dedicated
staging_dbprogress step when submit requires warmed local SSD shards. - The warmup and submit-time shard init scripts skip already warmed node-local DB files via
.download-complete, reducing repeat submit-time staging after warmup. - Warm cache validation now matches the prepared
core_ntlayout:taxdb.btd/taxdb.bti, non-partial.nsqfiles, source-version markers, and stale.azDownload-*cleanup.
API / IaC Diff Summary¶
- Backend progress payload merging now follows a canonical BLAST progress order.
- The
/api/blast/jobs/{job_id}K8s refresh path now uses the same canonical progress order when it observes that a running job has completed, so previous running steps such assubmittingare closed instead of lingering after the overall job reachescompleted. - BLAST submit no longer attributes the whole
elastic-blast submitstream tostaging_db. Warm-cache preparation can markstaging_dbfirst, but live submit output advances tosubmittingso the UI does not look stuck while ElasticBLAST waits for Kubernetes work. - Warmup ConfigMap script generation now makes
init-db-shard-aks.shidempotent for direct ElasticBLAST init-SSD calls. The script changes into/blast/blastdbinternally, accepts existing zero-byte completion markers, writes non-empty markers for new downloads, removes stale azcopy partial files, and validates the source generation whenELB_DB_SOURCE_VERSIONis present. - Warmup now uses a dedicated
elb-warmup-scriptsConfigMap instead of overwriting ElasticBLAST'selb-scriptsConfigMap. This prevents warmed DB jobs from deleting submit-time scripts such asquery-download-ssd-aks.sh,results-export-aks.sh, andelb-finalizer-aks.sh. - The ElasticBLAST runtime patch now overwrites the vendored
init-db-shard-aks.shsubmit template with the same hardened skip contract. This preventselastic-blast submitfrom replacing the warmed ConfigMap with an older script that requirestaxonomy4blast.sqlite3or a non-empty completion marker before it can skip. When ElasticBLAST does not passELB_DB_SOURCE_VERSION, the submit-time script resolves{db}-metadata.jsonfrom the prepared DB container and uses itssource_versionto reject stale cache markers. - Frontend Run details phases and messages were aligned with backend progress keys.
- No IaC resource shape changes.
Validation Evidence¶
- Targeted pytest:
uv run pytest -q api/tests/test_blast_tasks.py -k 'merge_progress_payload_keeps_submit_context_and_live_output or merge_progress_payload_keeps_completed_submit_output or merge_progress_payload_completes_previous_running_steps or merge_progress_payload_completes_steps_when_phase_advances or merge_progress_payload_tracks_staging_db_before_submit' - Warmup script pytest:
uv run pytest -q api/tests/test_warmup_jobs.py -k warmup_scripts_configmap_contains_job_scripts - Warmup regression file:
uv run pytest -q api/tests/test_warmup_jobs.py-> 22 passed. - Warmup lint:
uv run ruff check api/services/warmup_jobs.py api/tests/test_warmup_jobs.py-> passed. - Submit-template regression:
uv run pytest -q api/tests/test_terminal_patch_elastic_blast.py api/tests/test_warmup_jobs.py-> 24 passed. - Submit-template lint:
uv run ruff check terminal/patch_elastic_blast.py api/tests/test_terminal_patch_elastic_blast.py api/tests/test_warmup_jobs.py-> passed. - ConfigMap split regression:
uv run pytest -q api/tests/test_terminal_patch_elastic_blast.py api/tests/test_warmup_jobs.py-> 25 passed. - Focused lint after ConfigMap split:
uv run ruff check terminal/patch_elastic_blast.py api/services/warmup_jobs.py api/tasks/blast/__init__.py api/tests/test_terminal_patch_elastic_blast.py api/tests/test_warmup_jobs.py-> passed. - Live submit
26662f8c-ee23-4aa0-9fc5-00f7586609f9proved the old UI interpretation was misleading:init-ssd-0..9all reachedCompleteat2026-05-20T06:15:48Z/06:15:49Z, while the UI still showedstaging_dbuntil theelastic-blast submitcommand returned. The same run then failed because the warmup ConfigMap had overwrittenelb-scriptswith onlyblast-vmtouch-aks.shandinit-db-shard-aks.sh, so batch pods could not start/scripts/query-download-ssd-aks.shand the finalizer could not start/scripts/elb-finalizer-aks.sh. - Live ConfigMap repair:
kubectl create configmap elb-scripts -n default --from-file=/tmp/elb-patched-runtime.KAJo3X/src/elastic_blast/templates/scripts --dry-run=client -o yaml | kubectl apply -f -; verified keys now includeblast-run-aks.sh,query-download-ssd-aks.sh,results-export-aks.sh,elb-finalizer-aks.sh, and the hardenedinit-db-shard-aks.shwithResolving DB source versionand without the old-s .download-complete/taxonomy4blast.sqlite3precheck. - Generated submit-template syntax:
bash -non a temporary patchedinit-db-shard-aks.shpassed; marker grep confirmedResolving DB source version,CLEANUP partial downloads,-f .download-complete,DOWNLOAD_SKIP existing shard=${ELB_SHARD_IDX},taxdb.btd, andtaxdb.bti. - Local terminal-exec runtime was restarted against a fresh temporary patched ElasticBLAST tree (
/tmp/elb-patched-runtime.KAJo3Xduring validation). Process environment confirmedPYTHONPATH=/tmp/elb-patched-runtime.KAJo3X/src, and the active template containedResolving DB source version,CLEANUP partial downloads, andDOWNLOAD_SKIP existing shard=${ELB_SHARD_IDX}. - Frontend build:
cd web && npm run build - Live canary before hardening:
elb-cache-skip-canary-00onaks-blastpool-41800479-vmss00001ostill printedDownloading manifest, proving direct init-SSD calls were not skipping warmed shards. - Live canary after hardening and ConfigMap refresh:
elb-cache-skip-canary-00completed within 60 seconds and loggedCLEANUP partial downloadsfollowed byDOWNLOAD_SKIP existing shard=00; no manifest or DB copy started. - Live full submit after the canary still recopied shards because
elastic-blast submitoverwrote theelb-scriptsConfigMap from its vendored template. The affectedinit-ssd-*logs printedDownloading with pattern, and the live ConfigMap contained the old-s .download-complete/taxonomy4blast.sqlite3checks. The final fix therefore moved the hardening intoterminal/patch_elastic_blast.py, which is the submit-time template source. - Live full submit
f172ae44-472a-41e6-8d02-408472d895c0completed after the ConfigMap split and submit-template hardening:staging_dbcompleted at2026-05-20T06:39:18Z, new init suffixe2fc8081advanced to 10/10 completed batch jobs,elb-finalizer-e2fc8081completed, and finalizer logs uploadedmerged_results.out.gzplusmerge-report.jsonunder the run result prefix. - Result API smoke for the same job with
storage_account=elbstg01returnedmanifest.status=available,file_count=73, andparseable_count=35; alignments smoke parsed the merged result withtotal_hits=100,files_parsed=1, andblob_name=f172ae44-472a-41e6-8d02-408472d895c0/job-776c62c7b5af4654813da9c3e2fc8081/merged_results.out.gz. - Progress refresh regression:
uv run pytest -q api/tests/test_local_to_blast_job.py api/tests/test_blast_tasks.py::test_merge_progress_payload_completes_previous_running_steps-> 12 passed;uv run ruff check api/services/blast_job_state.py api/tests/test_local_to_blast_job.py-> passed.