PR3 — self-upgrade apply + rollback + escape hatch (2026-05-22)¶
Motivation¶
PR1 added discovery, PR2 added build. PR3 wires the actual ARM PATCH that swaps the deployed Container App template to the freshly built images, the rollout watcher / reconciler that finalises the state on the post-restart revision, the rollback API, and the escape-hatch command set the operator runs when the new revision fails to boot.
After PR3, the in-app self-upgrade flow is end-to-end functional from the API surface. UI polish (badge, modal, retention countdown, progress streaming) follows in PR4.
User-facing change¶
Two new mutating endpoints (both UpgradeAdmin-gated):
POST /api/upgrade/rollback— replays the snapshot taken before the upgrade (rollback_target_jsonon the row) by issuing a fresh ARM PATCH targeting the previous image refs. Allowed fromrolling_out,succeeded, andfailed_rollout. Returns the updated state row.GET /api/upgrade/escape-hatch— returns a copy-pasteableaz containerapp updatecommand set (one per container) that the operator can run from anyaz login-ed shell to restore the snapshot. Used when the new revision is unreachable and even the rollback API is dead.
State machine grows the post-PATCH portion:
... -> building -> patching -> rolling_out --> succeeded
\-> failed_rollout
\-> rolling_back -> rolled_back
\-> rollback_failed
reconcile_rolling_out (beat, 60 s) finalises rolling_out to
succeeded when the api's own __version__ matches target_version
(i.e. we're the new revision), or to failed_rollout when the latest
revision is in a terminal failure state or has been stuck longer than
the 15-minute rollout budget.
Backend changes¶
pyproject.toml— addsazure-mgmt-appcontainers==3.1.0.api/services/upgrade/aca_template.py(new) — readsMicrosoft.App/containerAppstemplate, extracts per-role image refs (api/frontend/terminal; theapirole is shared by api/worker/ beat containers), and exposesswap_images()(PATCH for upgrade) andapply_images()(PATCH for rollback).compute_target_images()is the public helper used by tasks to turn a semver into the{api,frontend,terminal}image-ref tuple.api/services/upgrade/rollout_watcher.py(new) — polling helper that returns when a named revision reportsrunning=Running+provisioning=Provisioned, or raisesRevisionUnhealthy/RevisionTimeout. Reserved for PR4 streaming progress; PR3 only uses the cheaperrevision_status()snapshot from the reconciler.api/services/upgrade/escape_hatch.py(new) — pure string builder. Emits oneaz containerapp update --subscription … --name … --resource-group … --container-name … --image …line per container (api,worker,beat,frontend,terminal). Each command carries--subscriptionexplicitly so the operator's default profile is never mutated; noaz account setis issued. No secrets are baked in.api/tasks/upgrade.py— extendsexecute_upgrade_inlinewith the post-build pipeline:building → patching(CAS): record snapshot of current images and target images; log the escape-hatch command set;patching → rolling_out(CAS): commit BEFORE the ARM PATCH so a producing-revision death is recoverable;aca.swap_images(target_version, revision_suffix=…)— the PATCH; on failure jumps tofailed_rollout. Addsstart_rollback_inline,reconcile_rolling_out_inline, and the Celery task wrappers. The reconciler has a 15-minute stuck guard so a perpetually-rolling_out row never blocks future starts.api/routes/upgrade.py— registers/rollbackand/escape-hatchendpoints; both gated byrequire_upgrade_admin.api/celery_app.py— addsupgrade-reconcile-rolling-outbeat entry (60 s).
Test changes¶
api/tests/test_upgrade_aca_template.py(new) — fake ARM resources + client, asserts image extraction, swap_images mutating api/worker/ beat together while leaving sidecars likeredisuntouched, and apply_images on the rollback path.api/tests/test_upgrade_rollout_watcher.py(new) — drives the polling loop with injectednow/sleep/client; covers healthy, transient unhealthy, terminal failure, and timeout.api/tests/test_upgrade_escape_hatch.py(new) — command shape, subscription/rg/app coverage, secret hygiene, env-missing placeholder fallback, and the no-az-account-setinvariant.api/tests/test_upgrade_task.py— extended to the full state machine (queued → ... → rolling_out → succeeded via reconciler) and rollback round-trip; new tests forfailed_rollouton PATCH refusal and rollback-without-snapshot 409.api/tests/test_upgrade_routes.py—/rollbackadmin gate, rollback no-snapshot 409, rollback happy path,/escape-hatchhappy path + no-snapshot 404.
Validation¶
uv run ruff check api/services/upgrade api/routes/upgrade.py api/tasks/upgrade.py api/tests/test_upgrade_*.py— clean.uv run pytest -q api/tests— 1165 passed (no regression vs prior 1143).- End-to-end against a real ACA still requires manual operator
validation (the SDK round-trip cannot be unit-tested cheaply); the
unit tests run the full task with a fake
acasurface.
IaC / infra¶
No Bicep changes. The user-assigned MI already has Contributor on
the workspace RG, which subsumes Microsoft.App/containerApps/write.
Operator setup (additional from PR2)¶
No new env variables. The existing PR2 set (UPGRADE_GIT_REMOTE,
PLATFORM_ACR_NAME, UPGRADE_ADMIN_OIDS) plus the existing
AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, CONTAINER_APP_NAME
(set by azd up) are all PR3 needs.
Known limitations (tracked for PR4)¶
- The rollback path does not yet verify the snapshotted image tags are still resolvable in ACR before issuing the PATCH. If ACR retention expired the rollback PATCH succeeds but ACA fails to pull and the new (rollback) revision crashloops. PR4 ships the retention-countdown UX and adds an ACR data-plane check ahead of rollback.
- The
started_by_oidis overwritten on rollback so the row only carries the most recent actor. A separaterollback_by_oidfield will follow with the PR4 history page.
Out of scope (PR4)¶
- SPA UX (badge, modal, progress streaming, rollback diff, retention countdown).
- History page tailing the upgrade audit blob.
- Major-version (
A) extra confirmation. - ACR retention pre-flight for
/rollback.