AKS Provisioning — Step / Pool / ARM Sub-Progress¶
Motivation¶
The "Provisioning..." banner during AKS cluster creation showed almost nothing: just a phase label (when present) and an elapsed-time counter. Two latent bugs made the experience worse:
provision_akswrote phase strings to theJobStateRepositoryviahelpers.update_state(...)but never called Celery'stask.update_state(state="PROGRESS", meta={...}). The/api/tasks/{id}endpoint surfaces onlyresult.info, so the FE banner effectively saw an emptyprogresspayload — the phase line stayed at the default "Provisioning..." until the cluster appeared in the list.- The long
arm_create_or_updatephase (5–10 minutes) calledpoller.result()synchronously with no sub-progress publish, so the user stared at a flat banner for most of the run with no indication of per-pool progress, ARM state, or cluster visibility.
User-facing change¶
The cluster provisioning banner now renders:
- Step indicator —
Step 3/5 · Creating AKS cluster (5–10 min) · 4m 12s - Sub-message —
AKS state: Creating · ARM 1m 40s(refreshes every 20 s during the ARM phase) - Progress bar — interpolates within the long ARM step using
arm_elapsed_secondsso the bar moves smoothly during the wait. - Per-pool chips —
systempool · Succeeded · 1n/blastpool · Creating · 10n, color-coded (success = green, in-progress = accent, failed = danger).
The banner degrades gracefully when fields are missing — the first poll before the task starts publishing still shows the basic phase + elapsed UX.
API / IaC diff¶
api/tasks/azure/helpers.py:- New
record_task_progress(task, phase, **meta)andpublish_progress( task, job_id, phase, *, step, total_steps, status, message, **extra)helpers. Both are best-effort (catch all exceptions).publish_progresswrites to both the state repo and Celeryresult.infoin one call. api/tasks/azure/provision.py:- New
_PROVISION_STEPSordered list of(phase, label)driving theStep N/Mindicator. - New
_publish(self, job_id, phase, ...)wrapper that fills in step / total_steps / human label automatically. - New
_poll_arm_create(task, poller, ...)that pollspoller.done()every 20 s and publishes a snapshot ofManagedCluster.provisioning_state+ per-AgentPoolprovisioning_stateso the banner can render live pool chips. Olderpollerfakes without.done()skip the loop entirely so the test suite keeps working. - All five phases (
creating_cluster,ensuring_resource_group,arm_create_or_update,ensuring_rbac,completed) now publish via_publish. The duplicateupdate_state(...)write per phase is gone. web/src/components/cards/ClusterCard/useClusterProvisioning.ts:- New
taskProgressstate (fullprogresspayload) returned from the hook. - The task poller now writes
setTaskProgress(progress)on every tick. web/src/components/cards/ClusterCard/ProvisioningBanner.tsx:- New
taskProgressprop andProvisionProgressinterface. - Renders step indicator, sub-message, progress bar, per-pool chips.
web/src/components/cards/ClusterCard/ClusterCard.tsx:- Threads
prov.taskProgressinto the banner.
No infra changes. No new dependencies.
Validation¶
uv run pytest -q api/tests/test_azure_provision_aks.py— 6 passed (includes the newtest_provision_aks_publishes_step_progress_with_pool_stateswhich asserts step / total_steps on every publish and a full pool snapshot during the ARM poll loop).uv run pytest -q api/tests/test_azure_tasks.py api/tests/test_azure_provision_aks.py— 11 passed.- Full backend regression: 1392 passed, 1 unrelated pre-existing failure
(
test_response_contracts.py::test_preflight_returns_admission_decisionis broken by uncommitted edits inapi/routes/blast/preflight.pythat predate this change). uv run ruff check api/tasks/azure/provision.py api/tasks/azure/helpers.py api/tests/test_azure_provision_aks.py— All checks passed.cd web && npm run build— built in 6.16 s, no TypeScript errors.