Auto warmup Ready node guard¶
Motivation¶
AKS can report powerState=Running and the expected ARM node count before every Kubernetes workload node is Ready. The auto-warmup reconciler used ARM readiness only, so it could enqueue a 10-node database warmup while only 8 workload nodes were Ready. warmup_database then logged the mismatch but still created an 8-way warmup plan.
User-facing change¶
Auto warmup now waits until all expected workload nodes are Ready before enqueueing database warmup. If AKS was started from Azure Portal and only part of the workload pool is Ready, the reconcile result records waiting_for_warmup_nodes with expected_node_count, ready_node_count, and the reason waiting for all warmup nodes.
Manual warmup remains intentionally relaxed by default: a user-triggered warmup can still run against the currently Ready nodes. The strict guard is enabled for auto-warmup tasks via require_all_warmup_nodes=true.
API / IaC diff summary¶
- Added a Kubernetes Ready workload-node gate behind
api.tasks.storage.reconcile_auto_warmup; the policy now lives inapi.services.auto_warmup_reconcile. - Auto warmup expected node count is
AutoWarmupPreference.num_nodeswhen configured, otherwise the ARMnode_count. - Added a strict
require_all_warmup_nodesoption towarmup_database; auto-warmup passes it, manual warmup does not. - When strict mode sees fewer Ready nodes than requested,
warmup_databasereturnsstatus=deferred, records phasewaiting_for_warmup_nodes, and does not callbuild_warmup_job_plan. - No IaC changes.
Validation evidence¶
PYTHONPATH=$PWD uv run pytest -q api/tests/test_auto_warmup.py-> 9 passed.PYTHONPATH=$PWD uv run pytest -q api/tests/test_warmup_route.py api/tests/test_warmup_jobs.py-> 29 passed.uv run ruff check api/tasks/storage.py api/tests/test_auto_warmup.py-> all checks passed.uv run ruff check api-> all checks passed.PYTHONPATH=$PWD uv run pytest -q api/tests-> 710 passed.