Skip to content

Self-upgrade — 40-point critique hardening (2026-05-22)

Motivation

Widened the critique surface from 20 to 40 items covering auth, state machine, ARM/ACR, git/build, storage, SPA UX, ops/docs, and tests. Picked up four High/Medium items that the 20-point pass had missed plus the obvious doc + storage-retention gaps.

Findings + fixes

High → Hardened

#6 — enqueue_failed exception object on phase_detail could leak broker credentials (redis URL with embedded password). api/tasks/upgrade.py::start_upgrade_inline now records only the exception class name and runs it through api.services.sanitise.sanitise. Full traceback still goes to the api log via LOGGER.exception.

Medium → Hardened

#3 — UpgradeAdmin role match was case-sensitive. api/services/upgrade/auth.py::is_upgrade_admin casefolds both sides so upgradeadmin, UPGRADEADMIN, … all resolve. The oid allowlist stays case-sensitive because AAD canonicalises GUIDs.

#5 + #26 + #27 — SPA spammed the browser console with 403s when a non-admin opened /upgrade. web/src/pages/UpgradePage.tsx::refreshAll now probes the escape-hatch endpoint first; if it 403s the page sets adminBlocked=true and short-circuits the rollback-preflight request. The build-log viewer now also skips polling when jobId is empty (web/src/components/BuildLogViewer.tsx).

#10 — rolling_out could be locked for up to 15 minutes when the producing worker crashed between the state commit and begin_update. api/tasks/upgrade.py::reconcile_rolling_out_inline adds a 2-minute fast-fail (PATCH_NEVER_LANDED_GRACE_SECONDS): if the row has been rolling_out longer than that AND the live ACA template still carries the old api image ref, the row moves straight to failed_rollout. The old 15-minute stuck guard remains as the upper bound.

#24 — upgrade-logs / upgrade-history containers have no retention policy. docs/user-guide/upgrades.md now ships a copy-pasteable az storage account management-policy create example that targets both prefixes with a 180-day expiry.

#32 — docs/troubleshooting.md had no self-upgrade section. Added five concrete scenarios: header badge never appears, Start 403, Start 409, rolling_out past budget, "ACR no longer carries the snapshotted tags", and empty build log.

#36 — test_version.py reload could pollute downstream tests with a stray APP_VERSION env or a stale api.__version__. Added an autouse fixture that pops APP_VERSION and reloads api after every test in the file.

Low — Logged, no code change

The remaining 32 items are bounded by Azure-side guarantees, by the existing throttles, or are explicit operational tradeoffs (e.g. the /tmp/elb-upgrade accumulation that the next sidecar restart wipes). Test gaps (UpgradePage / BuildLogViewer unit tests, integration test) are noted for a follow-up — the public route + service unit tests already cover the contract.

Tests

No new test files; the existing suite continues to exercise the changed code paths.

Validation

  • uv run ruff check api/services/upgrade api/routes/upgrade.py api/tasks/upgrade.py api/tests/test_upgrade_*.py api/tests/test_version.py — clean.
  • uv run pytest -q api/tests — 1201 passed.
  • cd web && npm run build && npm run lint — clean.