2026-05-15 — DB Warmup is a no-op (verification-only); legacy DaemonSet path missing¶
Motivation¶
Following the wiring fix in
2026-05-15-warmup-endpoint-fix.md,
clicking Start warmup in the AKS cluster detail modal now returns a
real Celery AsyncResult and the SPA observes Completed. Despite that,
the user reports "DB warmup이 문제가 있어" because the actual behaviour
on the AKS cluster is unchanged — no database is loaded onto node SSDs.
Symptom¶
- SPA →
POST /api/warmup/startwith{"db": "blast-db/core_nt", "db_display_name": "core_nt", ...}. - Backend returns
200in ~150 ms with{instance_id, db: "core_nt", status: "queued"}. - The Celery worker picks up
api.tasks.storage.warmup_database, issues aContainerClient.list_blobs(prefix="core_nt")style probe againsthttps://elbstg01.blob.core.windows.net/blast-db, confirms the DB has files, and returnsCompleted/succeededin <2 s. - SPA renders "Warmup (core_nt): Completed" with a green check.
GET /api/monitor/aks/warmup-statuskeeps reporting:
{ "warm": true, "workspace_ready": 3, "workspace_desired": 3,
"databases": [], "vmtouch_ready": 0, "namespaces": [] }
warm: true is from the always-on create-workspace DaemonSet —
it just means /workspace is mounted on every node. databases: []
is the truth: no DB is actually cached on any node, so the next
ElasticBLAST submit will pay the full cold-start price.
Root cause¶
api/tasks/storage.py::warmup_database (current HEAD~1 version was
already a thin wrapper over terminal_exec → elastic-blast get-blastdb;
the unstaged diff that the parallel session is currently sitting on
intentionally downgrades it further to a pure verification check):
databases = list_databases(get_credential(), storage_account)
match = next((db for db in databases if db.get("name") == database_name), None)
if match and int(match.get("file_count") or 0) > 0:
return { "status": "completed",
"output": "Database is prepared in workload storage." }
return { "status": "failed",
"error": f"database {database_name!r} is not prepared in workload storage" }
That check answers the question "is this DB staged in
https://<account>.blob.core.windows.net/blast-db/?" — it does not
warm node SSDs. The label "warmup" is misleading.
The real warmup path that the dashboard's UI was originally designed against still lives in the retired Functions tree:
- Orchestrator — legacy/functionapp/orchestrators/warmup_db.py
:1-2006-phase flow:enabling_storage → configuring → roles → warming_up → polling → disable_storage. Polls up toWARMUP_POLL_MAX_ATTEMPTS=480 × 15 s = 120 min(sized forcore_nt≈ 283 GB). - Activity — legacy/functionapp/activities/blast.py
:1170-1310activity_k8s_warmup_dbcreates a Kubernetes DaemonSetwarmup-{safe_db}(labelapp=db-warmup,db={safe}) with aninitContainerthat runs:
export AZCOPY_AUTO_LOGIN_TYPE=MSI
for i in 1..6; do
azcopy cp "$DB_URL/*" "$TMP_DIR/" --recursive --log-level=WARNING && break
sleep 30 # tolerate kubelet RBAC propagation
done
find "$TMP_DIR" -name "${db_name}*" -exec mv {} "$DB_DIR/" \;
blastdbcmd -db "$db_name" -info -json > "$db_name.njs"
Volume hostPath /workspace (DirectoryOrCreate) → /workspace,
requests cpu=1 memory=1Gi, limits memory=4Gi,
pause container is registry.k8s.io/pause:3.9.
activity_k8s_check_warmup_db polls pods via
labelSelector=app=db-warmup,db={safe} and surfaces init-container
errors + logs.
The SPA's WarmupSection (see web/src/components/WarmupSection.tsx)
still renders phase strings (enabling_storage, warming_up,
Loading DB to nodes... (X/Y)) that imply this DaemonSet is being
created — but the new Celery task never creates it.
Why this PR does NOT include the fix¶
api/tasks/storage.py, api/routes/stubs.py, api/main.py, and
api/celery_app.py are all currently M in another active editor
session's working copy:
$ git status --short api/
M api/celery_app.py
M api/main.py
M api/routes/stubs.py
M api/tasks/acr.py
M api/tasks/azure.py
M api/tasks/blast.py
M api/tasks/storage.py
?? api/tasks/openapi.py
That session's unstaged diff already touches the warmup region of both
api/tasks/storage.py (the verification logic shown above) and
api/routes/stubs.py (the wiring fix in
2026-05-15-warmup-endpoint-fix.md).
Porting the legacy DaemonSet flow now would force a merge in three
files that are mid-edit, and the verification-only behaviour appears
to be intentional in their direction (the old terminal_exec → elastic-blast
get-blastdb call was deliberately removed). Coordination is required
before code lands.
Recommended fix path (for the next session that owns these files)¶
- Restore the real warmup path. Either:
- Port the legacy DaemonSet flow into a new
api/services/k8s_warmup.py(uses the existing_get_k8s_session(...)helper fromapi/services/k8s_monitoring.pyfor the cluster API; never callManagedClusters.begin_run_commandper AGENTS.md tripwire #9), then havewarmup_databasecall into it after the storage verification step succeeds; or - Re-introduce the
terminal_exec → elastic-blast get-blastdbfallback the previous version had, but route it through theterminalsidecar (api/services/terminal_exec.py) so it does not depend on the local-dev shell having the BLAST+ toolchain. - Drop the
_update_state(job_id, "downloading", ...)lie — either actually download something, or relabel the phase to"verifying"so the SPA does not promise work that is not happening. - Make
WarmupSection.tsxhonest about the verification-only path while the real warmup is being implemented: - Rename the button to "Verify DB staged" (or similar) when the backend is in verification-only mode.
- Hide the
Loading DB to nodes... (X/Y)substring untilcustom_status.steps.warming_upactually carriesready/total. - Ensure
_update_stateis no-op-safe whenAZURE_TABLE_ENDPOINTis unset (already best-effort, but each call logs aWARNING— demote toDEBUGin local-dev so the log isn't spammed for every warmup attempt; the parallel session's diff already wraps the call intry/except, so this is just log-noise).
Side observation — single-worker uvicorn wedged¶
While reproducing the warmup behaviour on the local dev stack
(uvicorn api.main:app --reload, single worker), the api became
unresponsive: /api/health and /api/monitor/aks/warmup-status
both timed out at 60 s, with ss -tn 'sport = :8080' reporting
102 open connections piling up. The reloader process (663406)
was healthy; the spawned worker (901579) had 6 threads, 4 stuck on
futex_wait. SIGTERM did not unwedge it; SIGKILL followed by an
automatic respawn restored the api, after which everything recovered
cleanly (including the warmup status query).
This is almost certainly a side effect of the SPA's per-card 30 s
TanStack Query polls stacking up against the synchronous K8s API
calls in k8s_warmup_status and the Storage SDK calls in
warmup_database running in the threadpool while a slow request is
pending. Worth a follow-up:
- Wrap
requests.get(...)calls ink8s_monitoring.pywith stricter timeouts (already 10 s per call, but five sequential calls × 10 s - threadpool contention can stall an event loop) and consider a shorter shared timeout budget.
- Bump the local-dev uvicorn to
--workers 2or pin--limit-concurrencyso a stuck request does not block all others.
That observation is logged here for the next maintainer; this PR does not change uvicorn invocation.
API/IaC diff summary¶
No code changes. Documentation only.
Validation evidence¶
# 1) Warmup task: returns succeeded immediately even though no DB is loaded on nodes
$ curl -s -X POST -H 'Authorization: Bearer __dev_bypass__' \
-H 'Content-Type: application/json' \
-d '{"subscription_id":"...","resource_group":"rg-elb-01",
"storage_account":"elbstg01",
"db":"blast-db/core_nt","db_display_name":"core_nt",
"aks_cluster_name":"elb-cluster"}' \
"http://127.0.0.1:8080/api/warmup/start"
{"id":"...","instance_id":"a8aa..","task_id":"a8aa..",
"db":"core_nt","statusQueryGetUri":"/api/tasks/a8aa..","status":"queued"}
$ curl -s -H 'Authorization: Bearer __dev_bypass__' \
"http://127.0.0.1:8080/api/warmup/a8aa../status"
{"instance_id":"a8aa..","runtime_status":"Completed",
"custom_status":{"phase":"completed","db":"core_nt"},
"output":{"status":"succeeded","db":"core_nt"}}
# 2) Cluster reports the DB is NOT actually warm on nodes
$ curl -s -H 'Authorization: Bearer __dev_bypass__' \
"http://127.0.0.1:8080/api/monitor/aks/warmup-status?...&cluster_name=elb-cluster"
{"warm":true,"workspace_ready":3,"workspace_desired":3,
"databases":[], "vmtouch_ready":0, "namespaces":[]}
databases: [] is the smoking gun: no db-warmup DaemonSet exists
on the cluster, so the verification-only "succeeded" claim is
operationally meaningless.
Cross-repo consistency¶
None. The sibling elastic-blast-azure
repo already handles real warmup via the same DaemonSet pattern
during elastic-blast submit; the dashboard's separate "warmup
button" is purely an optimization to pre-cache the DB before
submitting. Restoring it requires only control-plane code, not a
sibling change.