AKS Provisioning UX Overhaul — P0+P1+P2 (preflight rigour, modal lifecycle, error card, live preflight, smart defaults)¶
Motivation¶
Previous flow (still on the dashboard until this change):
- User picks defaults (
Standard_E16s_v5× 10 +Standard_D2s_v3× 1 = 162 vCPUs) inkoreacentralwhere their subscription has 100 regional vCPU quota. Pre-flight ran and surfaced a yellowQuota may be insufficientrow — but classified the shortfall aswarn, not blocking submit. - User clicked Create → modal closed immediately on
aksApi.provision()acceptance → dashboard banner showed "Provisioning…" for ~1 m 10 s. - ARM rejected with
ErrCode_InsufficientVCPUQuota: ... left regional vcpu quota 100, requested quota 162. - The dashboard rendered a raw red wall of text (full Azure SDK error,
verbatim, with duplicated
Code:/Message:tail) above the cluster list. Form values from the modal were gone — to retry the user had to re-open the modal and re-enter everything.
Three categories of fault combined to produce this UX:
- A. Pre-flight semantics were too lenient (quota = warn).
- B. Modal lifecycle closed before ARM validation finished.
- C. Error rendering was raw and offered no actionable next step.
User-facing change¶
The Create AKS Cluster flow is now:
- Live pre-flight — the moment the modal opens (and on every form
edit, debounced 500 ms), the SPA calls
/api/aks/preflightand renders the result inline. Pre-flight is not triggered by the cluster-name field (name has no effect on SKU/quota/RG checks). - Smart default — if the very first pre-flight result returns a
quota failure that has a workable fit (e.g. needs 162 vCPUs, can
only fit 6 nodes of the chosen SKU), the modal auto-applies that
max_blast_nodes_fitto the Node Count input as long as the user has not manually touched it. The result is the next pre-flight passes without a click. - Quota = fail (blocking) — once the user does edit the form
into an infeasible state, the Compute quota row turns red and the
Create Clusterbutton readsFix errors aboveand refuses submit. The row includes an inlineApply N nodesbutton (or a "no node count fits" hint when the SKU alone exceeds the headroom). - Live cores total — under the SKU description, the modal now
reads
16 cores, 128 GB RAM, E-v5 memory · 10 × 16 = 160 cores totalso the user can see at a glance which knob to turn. - Modal stays open through ARM validation — clicking Create no longer closes the modal. A compact "Provisioning · Step ⅗ · arm_create_or_update · 0m 45s" panel renders in the footer until either:
- the task publishes
cluster_statefor the first time (ARM accepted), at which point the modal auto-closes and the dashboard banner takes over; or - the task fails — the modal stays open with all form values intact and renders the new error card. Closing the modal manually (ESC or backdrop) during this window shows a confirm dialog so an accidental key press doesn't yank the live progress out from under the user.
- Structured error card — replaces the raw red text. The card carries:
- title + classifier-driven one-line summary
(
Quota too small in koreacentral — needs 162 vCPUs, you have 100 free.) - category-specific secondary message
- action buttons (Request quota increase ↗ that deep-links to the subscription/region quota blade, SKU/region docs links, Edit & retry, Dismiss)
- raw Azure response folded into a
<details>accordion for debugging only - The same card is used inside the modal and on the dashboard (when the user closed the modal before the error landed).
API / IaC diff¶
Backend¶
api/services/aks_availability.pyrun_provision_preflight: quota shortfall now emitsstatus="fail"(waswarn). The row'sdetailscarrymax_blast_nodes_fit(largest blast node count that fits under the binding family/total cap, accounting for system pool overhead) plusblast_cores_per_node,system_cores_total,binding_familyso the FE can render an "Apply N nodes" button.overall_okis now derived from rendered rows (any status=="fail") so future additions offailpaths can't drift out of sync with the manual flag.- No new routes; the existing
/api/aks/preflightpayload picks up the richerdetailsautomatically. api/tests/test_aks_availability.pyaddstest_run_provision_preflight_fails_when_quota_shortwhich asserts the new contract (status="fail",binding_family="Total Regional vCPUs",max_blast_nodes_fit=6for the canonical reproduction scenario).
Frontend¶
New / changed files under web/src/components/cards/ClusterCard/:
- NEW
armErrorClassifier.ts+.test.ts— regex-based ARM error classifier returning{category, summary, details, actions[]}. Categories:quota,sku_blocked,region,rg_permission,auth,unknown. The classifier: - extracts requested-vs-free numbers from quota messages,
- strips the "Provisioning task failed:" wrapper our poller adds,
- drops the duplicated
Code:/Message:tail Azure emits, - deep-links to the subscription + region scoped quota blade
(
portal.azure.com/.../QuotaMenuBlade/myQuotas) instead of the generic docs URL. 5 vitest cases lock in the regex contracts. - NEW
ProvisionErrorCard.tsx— reusable structured failure card used in both the modal (when the task fails) and the dashboard (when the user closes the modal before the error lands). - MODIFIED
useClusterProvisioning.ts: - removed the
closeModal()call fromhandleProvision; modal close is now driven from auseEffectthat watchestaskProgress.cluster_state(or the cluster appearing in the list). - added live debounced pre-flight (500 ms, modal-open gated, silently swallows failures so the Create-flow pre-flight remains the authoritative path).
- added smart-default effect that applies
max_blast_nodes_fiton the first failing live pre-flight whilenodeCountUserTouched === false. - added
resetError()(clears error + invalidates cached pre-flight so the next Create runs fresh). - added
modalOpenprop so the live pre-flight effect doesn't run when the modal is closed. - MODIFIED
ProvisionModal.tsx: - new props:
taskPhase,taskProgress,elapsed,subscriptionId,onErrorReset. - new compact "Live provisioning" panel rendered while
provStatus === "creating". - error rendering replaced with
<ProvisionErrorCard>. - quota fail row in the preflight check list gains the
Apply N nodesbutton (and a "no node count fits" hint whenmax_blast_nodes_fit === 0). - SKU description gains the running cores total
(
10 × 16 = 160 cores total). - ESC + backdrop dismiss both require a confirm while
provStatus === "creating". - MODIFIED
ClusterCard.tsx: - passes the new props through.
- the dashboard
provErrorslot now renders the sameProvisionErrorCard(only when the modal is closed — we don't want both surfaces to render the same error).
No infra / Bicep / Celery task changes. No new dependencies.
Validation¶
uv run pytest -q api/tests/test_aks_availability.py api/tests/test_azure_provision_aks.py api/tests/test_azure_tasks.py— 19 passed.uv run ruff check api/services/aks_availability.py api/routes/aks/preflight.py api/tasks/azure/provision.py api/tests/test_aks_availability.py— All checks passed.cd web && npx vitest run src/components/cards/ClusterCard/armErrorClassifier.test.ts— 5 passed.cd web && npm run build— built in 5.78 s, no TypeScript errors.- Manual scenario (the exact reproduction from the user report):
- Open Create modal on a subscription with 100 regional vCPU quota in koreacentral → the Node Count auto-snaps from 10 to 6 within ~1 s as the live preflight runs and the smart default applies.
- Click Create → modal stays open, live progress panel ticks
through
creating_cluster→ensuring_resource_group→arm_create_or_update→ as soon as ARM publishescluster_state="Creating"the modal auto-closes and the dashboard banner takes over. - If the user instead manually picks
10(touching nodeCount) and clicks Create, the form rejects submit with a structured "Quota too small" card carrying anApply 6 nodesbutton.
Deferred (P3 — not in this change)¶
Documented for the next pass:
- Cancel provisioning (
/api/aks/cancel-provision/{task_id}+ Celeryrevoke()+ partial-resource cleanup). - Failed-provision persistence across browser reloads (the JobStateRepository already has the row; the dashboard just doesn't surface "last failed task in 24 h" on first paint).
- Quota deep-link blade name verification — the classifier links
Microsoft_Azure_Capacity/QuotaMenuBlade/myQuotaswhich is the current portal blade; Azure occasionally renames blades. If a future Azure portal rev makes the link 404, fall back tohttps://aka.ms/quotas/view-quotas.