BLAST task hardening¶
Motivation¶
BLAST submission is the highest-risk workflow in the control plane: it bridges browser requests, Celery retries, Table-backed job state, the terminal sidecar, ElasticBLAST's Azure adapter, AKS jobs, and private Storage. The previous task implementation had several reliability and observability gaps:
- state updates used the wrong
JobStateRepository.update(...)calling convention, so progress could be silently lost; - config was written with
bash -c, but the terminal exec server only allows audited binaries such aselastic-blast,azcopy,kubectl,az, andelb; - generated config did not bind the requested AKS cluster name, query URL, database URL, or results root correctly;
- transient terminal/capacity failures were converted into terminal
failedresults instead of Celery retries; - status checks used the ElasticBLAST CLI config path even though the CLI does not re-apply the idempotency key for status/delete.
User-facing change¶
BLAST jobs now surface clearer, more recoverable lifecycle state:
- submit pipes the INI config through
elastic-blast submit --cfg - --json; job_idis passed as both idempotency key and correlation id, so Celery retries can safely resume the same ElasticBLAST submission;- transient terminal-sidecar and ElasticBLAST capacity failures schedule a
retry and write a
retry_scheduledhistory event; - successful/running phases clear stale
error_codevalues; - status checks use the direct Kubernetes API helper scoped by
BLAST_ELB_JOB_ID, avoiding cross-job status bleed on shared clusters; - cancel deletes Kubernetes Jobs with
app in {blast, submit}andelb-job-id = job_id, avoiding the ElasticBLAST CLI delete path's in-processelb_job_idreconstruction gap; - relative query/database paths reject
..traversal before a config is sent to the terminal sidecar.
API / task diff summary¶
api/tasks/blast.py- Replaced shell temp-file config writes with
stdintoelastic-blast. - Added URL normalization for
queries,blast-db, andresultsStorage roots using Azurehttps://<account>.blob.core.windows.net/...URLs. - Added structured JSON tail parsing for ElasticBLAST Azure adapter output.
- Added retry classification for
transient,capacity, andconflictcategories plus known retryable ElasticBLAST exit codes. - Fixed state/history writes to call
JobStateRepository.update(job_id, ...)andappend_history(job_id, event, payload)correctly. - Switched
check_statustok8s_check_blast_status(..., job_id=job_id). - Switched
canceltok8s_cancel_blast_job(..., job_id=job_id). api/services/k8s_monitoring.pyvia theapi.services.monitoringfacade- Added
k8s_cancel_blast_job, which uses the direct Kubernetes API to delete only Jobs labelled with the currentelb-job-id. api/tests/test_blast_tasks.py- Added focused regression coverage for config generation, stdin argv shape, structured JSON parsing, retry classification, traversal rejection, and state repository / K8s cancellation contracts.
Validation evidence¶
$ cd /home/moonchoi/dev/elb-dashboard && uv run ruff check api/tasks/blast.py api/tests/test_blast_tasks.py
All checks passed!
$ cd /home/moonchoi/dev/elb-dashboard && uv run pytest -q api/tests/test_blast_tasks.py
......... [100%]
9 passed in 0.58s
$ cd /home/moonchoi/dev/elb-dashboard && uv run python -m py_compile api/services/monitoring.py api/services/k8s_monitoring.py
exit=0
$ cd /home/moonchoi/dev/elb-dashboard && uv run pytest -q api/tests
........................................................................ [ 72%]
............................ [100%]
100 passed in 9.46s