Warm Cache Management UX¶
Motivation¶
DB warmup is a node-cache management workflow, not just a submit checkbox. Users need to see whether a downloaded/sharded database is already warm on AKS nodes, whether the current node memory headroom can support warmup, and whether a cache can be released when memory pressure or workload priorities change.
User-Facing Change¶
The AKS detail warmup panel now renders a per-database warm cache table instead of a single database selector. Each row shows storage state, shard state, warm-cache state, per-node warmup fit, and direct actions for Warm, Rewarm, or Release when applicable. The panel also shows current warm-cache capacity from node memory metrics.
The compact AKS card database strip now shows only actual node warm-cache state. Downloaded-only and sharded-only databases are hidden there; when no DB is warm, the card prompts the user to enable Auto warm in the BLAST Databases modal or open Details to warm a DB immediately. If a DB is warming, the strip shows the node progress instead of a sharded/downloaded chip.
Warmup progress now carries timing metadata from the Kubernetes warmup Jobs. While shards are active, the detail panel shows percent complete, elapsed time, and an estimated remaining time once at least one shard has finished. The compact AKS database strip also includes a short ETA in the warming chip and the full elapsed/remaining detail in the tooltip.
Warmup progress is now phase-aware instead of only Job-count-aware. The backend tails warmup pod logs and reports whether each shard is waiting for an image pull, copying shard files to node-local disk with azcopy, validating BLAST database files, touching files into RAM, completed, or failed. The compact database strip and detail panel surface the active phase, phase counts, current message, elapsed time, and ETA, so a long 0%/90% warmup no longer looks opaque.
The warmup bootstrap path now ensures the elb-scripts ConfigMap before creating warmup Jobs, grants the AKS kubelet identity both AcrPull and Storage Blob Data Reader when permissions allow it, and cleans stale .azDownload-* partial directories between retries. This makes warmup Jobs recover cleanly from image-pull, Storage authorization, and interrupted azcopy-copy states.
The BLAST Databases modal now includes an Auto warm preference per database. core_nt is selected by default. Checked databases are stored server-side and warmed by Celery when the AKS workload cluster becomes ready/running, provided the DB is downloaded and not already warming or warm.
Auto warm no longer depends on an open dashboard tab. Dashboard Start passes the current Auto warm preference into the AKS start Celery task, which stores the preference and queues reconciliation after AKS reports started. Celery beat also runs the same reconciliation periodically, so a cluster started from Azure Portal can still trigger warmup after the backend observes it as workload-ready.
When an AKS cluster is stopped, starting, stopping, provisioning, or otherwise not workload-ready, the compact cluster card now shows a concise lifecycle readiness panel instead of the submit/runtime bento. The panel reports the current lifecycle state, next automatic refresh behavior, and static cluster summary without showing misleading Down, submit counts, live activity, or node-metrics rows.
API / IaC Diff Summary¶
- Added
POST /api/warmup/releaseto release Kubernetes warmup resources for a selected database. - Added a direct Kubernetes helper that deletes node-local warmup Jobs and legacy warmup DaemonSets by DB label.
- Extended the frontend monitoring client with
releaseWarmupand optional warmup shard/status fields. - Added server-side Auto warm preferences (
PUT/GET /api/warmup/auto-preference) with Azure Table Storage persistence and a local file fallback for dev. - Added
api.tasks.storage.reconcile_auto_warmup, a Celery task that polls AKS/Kubernetes/Storage state and queues idempotentwarmup_databasetasks for configured DBs. - Added a Celery beat schedule that runs Auto warm reconciliation on the
storagequeue every 60 seconds. - Extended the AKS Start action so the dashboard passes Auto warm context to the
start_aksCelery task; after the cluster starts, the task stores the preference and queues reconciliation. - Replaced the frontend-only cluster-transition warmup trigger with preference sync to the backend.
- Added a non-workload-ready readiness branch in the compact cluster card and suppresses deep operational details until AKS is workload-ready.
- Fixed
/api/warmup/startto publish through the configured Celery app with the explicitstoragequeue, avoiding FastAPI producer context drift. - Fixed
/api/warmup/{instance_id}/statusso a Celery task that completes with a failed payload is reported as failed instead of succeeded. - Added
progress_pct,started_at,elapsed_seconds, andestimated_remaining_secondsto Kubernetes warmup Job aggregation for live ETA display. - Added
active_phase,active_phase_label,active_message,active_last_log,phase_counts, andpod_statusesto Kubernetes warmup status by combining Job state, pod state, and recent warmup pod logs. - Added the idempotent
elb-scriptsConfigMap builder/ensure path for the AKS warmup shell scripts. - Added warmup-time AKS kubelet RBAC ensures for AcrPull and Storage Blob Data Reader using SDK model payloads and deterministic role assignment ids.
- Hardened node-local warmup scripts to remove
.azDownload-*partial directories with recursive cleanup and to report actionable azcopy/download/vmtouch phases. - No IaC changes.
Validation Evidence¶
uv run pytest -q api/tests/test_auto_warmup.py api/tests/test_warmup_route.py api/tests/test_warmup_jobs.pypassed: 20 tests, including concurrent local Auto warm preference saves.uv run ruff check api/services/auto_warmup.py api/tasks/storage.py api/tasks/azure.py api/routes/stubs.py api/tests/test_auto_warmup.py api/tests/test_warmup_route.py --select F821,F401,RUF012passed.npm run buildinweb/passed.npm run buildinweb/passed after the compact cluster readiness-panel update.- Local smoke after restarting API/worker/beat:
GET /api/healthreturned 200,PUT /api/warmup/auto-preferencereturned 200 withstatus=saved, repo-root.logs/local/state/auto_warmup.jsonwas created, andhttp://127.0.0.1:8090/returned 200. - Runtime reconciler smoke with the saved local preference returned
status=completedandcluster_name=elb-clusterwithstatus=not_readywhile AKS was stopped, so it did not enqueue warmup prematurely. - Session browser reload at
http://127.0.0.1:8090/renderedElasticBLAST DashboardandAzure Kubernetes Service Clusterwith no error boundary. - Session browser check while AKS reported
Startingshowed the compact cluster card using the readiness panel, with noSubmit pipeline · 15mbento and noDownhealth pill in the cluster section. - Browser check at
http://127.0.0.1:8090/showed the DB Warmup panel with warm cache capacity (10 nodes · Standard_E16s_v5, memory headroom) and per-DB Warm actions for downloaded databases. - Browser check showed the compact AKS database strip no longer renders sharded/downloaded-only DBs and instead shows
No warmed databases yet...when no node warm cache exists. - Browser check showed
core_ntchecked by default in the BLAST Databases modal Auto warm control, while non-downloaded DBs have Auto warm disabled. - Live route check:
POST /api/warmup/startwith an intentionally unknown debug DB published to Redisstorage, the fresh worker consumed it, andGET /api/warmup/{task}/statusreturnedoutput.status = failedwithunknown database. uv run pytest -q api/tests/test_azure_tasks.py api/tests/test_warmup_jobs.py api/tests/test_warmup_route.pypassed: 25 tests for AcrPull/Storage role payloads, warmup phase inference, stale pod filtering, and warmup route status handling.- Live diagnosis:
core_ntwarmup initially stopped before RAM warming because pods first lackedelb-scripts, then lacked ACR pull permission, then Storage denied shard manifests/data. The live cluster was remediated by creating/updatingelb-scripts, assigning AcrPull and Storage Blob Data Reader to the AKS kubelet identity, and allowing the AKS managed subnet through the Storage account firewall with a Microsoft.Storage service endpoint. - Live validation:
GET /api/monitor/aks/warmup-statusreportedcore_ntasLoading,active_phase=copying_files,nodes_ready=9,nodes_active=1,progress_pct=90,active_message=Downloading shard files with azcopy, and an estimated remaining time while the final shard copied. - Browser validation at
http://127.0.0.1:8090/: the compact AKS database strip renderedcore_nt · copying · 9/10 · ~55s leftwith tooltip detail showingCopying files to node disk,Downloading shard files with azcopy,done 9, andcopying 1. - Node-side validation: inside the active shard-06 warmup pod,
azcopywas running, the node-local shard directory was growing, and/proc/meminfoshowed file-cache activity (Cachedabout 46 GiB,Dirtyabout 18 GiB), explaining whykubectl top nodescan still show low process memory even while warmup is populating Linux page cache.