2026-05-14 — Job submit & warmup hardening¶
Motivation¶
Two production failures observed during AKS cluster bring-up + first job submit:
-
Warmup DaemonSet hit
Init:CrashLoopBackOffwithazcopy 403 AuthorizationFailure. Root cause: AKS kubelet UAMI was granted Storage Blob Data Contributor byassign_aks_roles_activityimmediately after cluster creation, but the AAD-side propagation can take 60–180 s. The DaemonSet pod tries azcopy before the role is effective, fails, then sits in CrashLoopBackOff until k8s back-off (5 min cap) finally lets it retry. The orchestrator's poll then declaresfailedwhile the role would have been ready 30 s later. UI showedFailed: []because the failure path usedcheck.get('failed_jobs', [])but the activity returns the count underfailed. -
elastic-blast submitfrom inside the pre-bootedelb-openapipod failed atkubectl apply ... -o jsonwithfield is immutableon leftover BLAST jobs from an earlier incomplete submit. elastic-blast's own reuse-mode cleanup only runs when its_db_already_loaded()check passes, so a half-finished submit leaves staleapp=blast/submit/setup/ finalizerJobs that the next submit cannot replace. A separate path triggered the same code withoutfmt = "7 std staxids ssciname"; the embedded double-quotes broke the generated batch_*.yaml (line 78: did not find expected key).
User-visible symptom: AKS card showed "Warmup failed []" with no diagnostic; job submit appeared to succeed in the orchestrator status but never produced a BLAST Job in the cluster.
Changes¶
api/activities/blast.py¶
- New helper
_build_submit_args(config_b64, job_id)builds the bash one-liner for both_submit_via_k8s_execand_start_submit_via_k8s_jobpaths so future fixes apply once. - Submit args now:
- Retry
az login --service-principal5 × 5 s to ride out Workload Identity federated-token race on first scheduling. - Hard-code
PYTHONPATH=/opt/venv/lib/python3.11/site-packagesso the elastic-blast CLI (system python) can importazure.mgmt.*(venv). set -o pipefail; abort with exit 2 if all 5 az login attempts fail.- New helper
_cleanup_stale_blast_jobs(session, server)deletes leftover Jobs labelledapp=blast|submit|setup|finalizerfrom default ns before every submit. Idempotent, best-effort. _submit_via_k8s_execand_start_submit_via_k8s_jobnow copyELB_*env vars (in addition toAZURE_*/AZCOPY_*) from the runningelb-openapipod into the submit Job. Without this, the elastic-blast CLI falls back to discovery code that fails inside an isolated submit pod.activity_k8s_warmup_dbinit container retries azcopy 6 × 30 s before declaring failure. RBAC propagation now absorbed without surfacing as pod-level CrashLoopBackOff.activity_k8s_check_warmup_dbonly declaresstatus=failedonce a pod has accumulated ≥ 5 init container restarts (≈ 10–15 min). When it does fail it captures the last 60 lines of init container logs (sanitised) and surfaces them underlogs,failed_pod,init_failed,restart_max.
api/orchestrators/warmup_db.py¶
- New
RBAC_PROPAGATION_SECONDS = 60timer betweenassign_aks_roles_activityand the warmup DaemonSet apply. - Failure branch now reads
check.get('logs')/init_failed/restart_max/failed_podand renders a real error message instead ofFailed: [].
api/services/blast_config.py¶
outfmtvalue is now rejected at the boundary (ValueError) when it contains shell/YAML-breaking characters ("';&|$(){}`). Failure surfaces ingenerate_blast_config_activityinstead of 60 s later in the cluster.
Validation¶
ruff check+py_compileclean for changed files (lint count went 68 → 62, no new warnings).- Manual reproduction:
kubectl exec -n default deploy/elb-openapi -- bash /tmp/test-submit.shonelb-cluster-01(16S_ribosomal_RNA, blastn) reached[1/5] Writing configuration ...→Splitting queries→Upload workfiles→ reachedkubectl applystep. Stalefield is immutableno longer fires after_cleanup_stale_blast_jobsruns.- Warmup DaemonSet manually verified: after force-deleting failing pods, azcopy succeeded on retry. New retry loop should make manual delete unnecessary going forward.
- Deployed via
scripts/dev/deploy-api.sh→func-elb-prod-ga5754pr7jw3uhealth probe returned 200.
Out of scope (follow-ups)¶
- End-to-end blastn smoke test driven from the SPA (requires a finished cluster + small query set + 5 min runtime).
init-pvjob hangs onconfigmap "elb-scripts" not foundif the ConfigMap is manually deleted; elastic-blast re-creates it via_cleanup_stale_jobsonly on warm reuse. Worth a separate boundary check._submit_via_k8s_execretry aroundkubectl applyon transient 500/conflict from kube-apiserver.