2026-05-22 — Workspace degraded-state diagnostics¶
Motivation¶
The dashboard's monitor cards (AKS, Storage, ACR) silently rendered
"Ready" / "OK" status chips even when the backend's _graceful helper had
swallowed an ARM 401/403/404 and returned an empty {"degraded": true,
"degraded_reason": "http_401"} payload. Users wasted time investigating an
apparently-healthy dashboard whose underlying calls were actually failing.
The trigger that surfaced the bug: a developer with two Azure CLI profiles
(az-jungha / az-moonchoi) had their az login session on tenant A but the
SPA's saved workspace config pointed at a subscription in tenant B. ARM
returned InvalidAuthenticationTokenTenant; the backend logged the error but
the SPA never showed anything actionable.
User-facing change¶
- Per-card status chips now reflect the actual ARM result. When the
backend returns a degraded payload,
MonitorCardshows reason-specific labels (Wrong tenant,No access,Not found,Network blocked, …) with the full description in the tooltip, replacing the misleadingOK/Ready. WorkspaceDiagnosticsBannerrenders above the dashboard grid when the failure is workspace-wide: anyauth_wrong_tenanton any card, or two or more cards reporting auth /not_found. A single forbidden leaf resource still defers to its own card.- The banner has a Reset workspace button that clears the saved
SetupWizardconfig fromlocalStorageand re-runs the wizard, plus a per-reason dismiss button.
Backend change¶
api/routes/monitor/common.py extracts the reason classifier into
_classify_exception. Stable degraded_reason codes are now part of the
SPA contract:
| Reason | When |
|---|---|
auth_wrong_tenant |
ARM 401 with InvalidAuthenticationTokenTenant / AADSTS50020 markers |
unauthorized |
Other ARM 401 |
forbidden |
ARM 403 |
not_found |
ARM 404 / ResourceNotFoundError |
azure_error |
Other AzureError |
http_<status> |
Other HTTP errors |
<ClassName> |
Unknown exception (forward-compat) |
No new endpoint, no new permissions, no Bicep change.
SPA change¶
- New shared helper
web/src/utils/monitorDegraded.tstranslating the taxonomy into UI labels + aggregating across cards. - New helper
web/src/components/cards/cardStatusOverride.tsmapping a degraded payload into aMonitorCard.statusOverrideprop. MonitorCardaccepts the newstatusOverrideprop; when present it wins over the regularstatuschip but the colour tones reuse the existing glass-theme chip palette (gt-g/o/r/m).AcrCard/StorageCard/ClusterCardopt in to the override.- New
WorkspaceDiagnosticsBannercomponent sits inDashboard.tsxbetween the header and the grid. It re-uses the existingmonitoringApi.aks/storage/acrqueries (TanStack Query dedupes byqueryKey, so no extra network calls) and consumes the aggregated diagnostics.
Validation¶
$ uv run pytest -q api/tests/test_monitor_graceful.py api/tests/test_route_contracts.py api/tests/test_smoke.py
96 passed in 5.39s
$ uv run ruff check api/routes/monitor/common.py api/tests/test_monitor_graceful.py
All checks passed!
$ cd web && npx vitest run src/utils/monitorDegraded.test.ts
Test Files 1 passed (1)
Tests 13 passed (13)
$ cd web && npm run build
✓ built in 6.76s
Manual smoke (about to be re-verified live):
- With
azlogged into tenant A and the SetupWizard pointing at a subscription in tenant B, the banner shows "Wrong Azure tenant for the selected subscription" with a Reset workspace button. Cards show "Wrong tenant" chip instead of "OK". - Reset workspace clears
localStorageand re-runs the wizard.
Follow-ups (separate PR)¶
SetupWizardpre-validation: probe RG/ACR/Storage names against ARM before saving them, to prevent stale settings from being written in the first place./api/meexpose the list of subscriptions visible to the current credential so the wizard can show a dropdown instead of a free-form text field.