Auto-warmup: delete stale Jobs pinned to removed VMSS nodes¶
Motivation¶
After fix 2026-05-18-autowarmup-reenqueue-and-openapi-update-gate, the beat
reconciler correctly re-enqueues a warmup_database task for any DB that is
not in the Ready state. But the task itself still failed for core_nt
on the production ca-elb-control Container App — every cycle reported
status="failed", nodes_failed=10, and core_nt never went back to warm.
Investigation showed the root cause is Job.spec.template.spec.nodeName
immutability combined with AKS stop/start rotating VMSS instance names:
- ElasticBLAST node-local warmup builds Jobs named
warm-<db>-<shard>and pins each one to a specific VMSS node viaspec.template.spec.nodeName(seeapi/services/warmup_jobs.py:database_status_from_warmup_jobsand the Job builder around line 521). - When AKS is stopped and started, the underlying VMSS instances are
replaced. The previously-succeeded
warm-core-nt-{00..09}Jobs still exist withstatus.succeeded=1, failed=0, but theirnodeNamepoints at instances that are no longer in the cluster. api/services/k8s_monitoring.py::_mark_stale_warmup_nodescorrectly classifies the database asStalein this state (it setsnodes_failed = total_jobs), which is why the dashboard shows the DB as not-warm.api/services/k8s_monitoring.py::_ensure_job_manifeststhen refuses to recreate the Jobs because the names already exist — it short-circuits withexisting.append(name); continue.- The result is a permanent failure loop: reconcile fires →
warmup_databaseruns → ensure finds existing Jobs and does nothing → status remainsStale→ reconcile fires again.
Memory file /memories/repo/aks-warmup-storage.md already noted this
hazard: "After AKS stop/start, completed elb-db-warmup Jobs may point
at removed node names; treat them as stale even if status.succeeded=1."
User-facing change¶
- After this fix, when the dashboard's auto-warmup reconciler triggers a
warmup for a database whose Jobs are pinned to nodes no longer in the
cluster, the worker deletes the stale Jobs (with
propagationPolicy=Backgroundso the pods clean up too) and recreates fresh Jobs on the current ready nodes. core_nt(and any other DB previously stuck after an AKS stop/start) returns toReadyon its next warmup cycle without manual intervention.- No UI change beyond what was already visible: the warmup card status
flips from
StaletoWarming→Readyas expected.
API / IaC diff¶
api/services/k8s_monitoring.py— new helperk8s_release_stale_warmup_jobs(credential, subscription_id, resource_group, cluster_name, db_name, current_node_names, namespace='default'). Lists Jobs labelledapp=db-warmup,db=<sanitized>, compares each Job'sspec.template.spec.nodeNameagainst the current ready-node set, and deletes those whose nodeName is no longer in the cluster. Returns{status, database, namespace, deleted: [...], kept: [...], errors: [...]}. Mirrors the existingk8s_release_warmup_cachepattern but filters per-Job rather than wiping the whole label.api/tasks/storage.py::warmup_database— calls the new helper betweenk8s_ensure_warmup_scripts_configmapandk8s_ensure_job_manifests, passing the full set of currently-Ready warmup nodes (not the per-roundplan.nodes, so that Jobs still pinned to live but not-selected nodes are preserved). Thestale_jobssummary is added to both_record_task_progressand the persisted state so the audit log shows which Jobs were dropped.
No IaC, infra, or frontend changes.
Validation evidence¶
- Targeted tests:
uv run pytest -q api/tests/test_k8s_release_stale_warmup_jobs.py→ 4 passed (deletes only Jobs on dead nodes; keeps live; skips unpinned; reports partial on delete error). - Existing warmup tests:
uv run pytest -q api/tests/test_auto_warmup.py api/tests/test_blast_tasks.py api/tests/test_warmup_jobs.py→ 92 passed. - Full suite:
uv run pytest -q api/tests→ 649 passed. - Lint:
uv run ruff check api→ All checks passed.