PR2 — self-upgrade build pipeline (2026-05-22)¶
Motivation¶
PR1 added the read-only surface. PR2 lands the build half of the
self-upgrade flow: an operator clicks "Upgrade", and the deployed app
clones the requested git tag into the terminal sidecar and runs
az acr build for each of elb-api, elb-frontend, elb-terminal.
The ARM PATCH that swaps the Container App template to the new images
remains deferred to PR3 — PR2 stops in state=succeeded after the last
image is pushed, with no traffic impact. This means PR2 is safe to
deploy even though the visible UI (PR4) isn't there yet: the routes
just sit idle until something/someone POSTs /api/upgrade/start.
User-facing change¶
Backend-only. Once the operator sets both UPGRADE_GIT_REMOTE (from
PR1) and PLATFORM_ACR_NAME, and lists their oid in
UPGRADE_ADMIN_OIDS, two new mutating endpoints become available:
POST /api/upgrade/start— body{target_version, target_sha?, confirm_downtime: true}. Returns202 Acceptedwith the queued state row. Requires theUpgradeAdminrole gate. Refuses to start a second upgrade while one is in flight (409 Conflict). Refuses withoutconfirm_downtime=true(422).GET /api/upgrade/jobs/{job_id}/build-log/{component}— streams the per-component Append Blob captured during the build.componentmust beapi,frontend, orterminal. Admin-gated.
State row gains transitions: idle → queued → fetching → building →
succeeded for the happy path, and → failed_pre for any pre-PATCH
failure (no customer impact, because no PATCH has been issued yet).
Backend changes¶
terminal/exec_server.py—ALLOWED_BINgainsgit. The terminal sidecar now permitsgit clone …andgit -C … config …invocations from the api/worker callers via the existing exec-token-gated loopback channel.api/services/terminal_exec.py— docstring updated to mirror the new allowlist.api/services/upgrade/state.py— addscas_state()andStateTransitionRefusedso transitions enforce a precondition. Theidle → queuedgate prevents two operators from racing into a parallel upgrade.api/services/upgrade/auth.py(new) —require_upgrade_adminFastAPI dependency. Admin signal is either an MSALrolesclaim (UpgradeAdmin) or the caller oid appearing inUPGRADE_ADMIN_OIDS(comma-separated). The env path is the bootstrap so an operator with no App Registration changes can still use the feature.api/services/upgrade/git_workspace.py(new) — drivesgit clone --depth 1 --single-branch --branch v<ver>through the terminal sidecar, into the absolute path/tmp/elb-upgrade/<job_id>(outside the exec server's owned temp dir so the clone survives the request). After cloning it scrubsremote.origin.urlviagit configto strip any embedded credentials — forward-compat with the PR3 PAT flow.api/services/upgrade/build_logs.py(new) — Append Blob writer for per-component build logs (upgrade-logs/<job_id>/build-<c>.log). Swappable in-memory backend for tests; refuses to construct outside tests without the explicit opt-in env.api/services/upgrade/image_builder.py(new) —build()runsaz acr build --registry $PLATFORM_ACR_NAME --image elb-<c>:vA.B.0 --file <dockerfile> <source_dir>throughterminal_exec.stream, forwarding every output line into the build log blob. Builds run sequentially by design; parallelisation lands in PR4.api/tasks/upgrade.py— addsstart_upgrade_inline(CAS gate + enqueue) andexecute_upgrade/execute_upgrade_inline(the worker pipeline). On any pre-PATCH failure the row is moved tofailed_previa CAS so a concurrent writer in a later state isn't overwritten.api/routes/upgrade.py— registersPOST /startandGET /jobs/{job_id}/build-log/{component}. The start handler runsstart_upgrade_inlinewhich auto-rolls back toidleif the broker enqueue itself fails. Build-log responses pass blob bytes through astext/plaindirectly (no SAS).
Test changes¶
api/tests/test_upgrade_git_workspace.py— argv shape, exit-code handling, version/job_id validation, cleanup safety guard, and the credential-scrub round-trip (x-access-token:supersecret@…→ masked back into the cloned repo'sremote.origin.url).api/tests/test_upgrade_build_logs.py— name validation, append semantics, write_lines iterator, and the buffer-retention path on backend failure.api/tests/test_upgrade_image_builder.py— happy path argv + log capture, non-zero exit propagation, version/env guards, sequential iteration over the three components.api/tests/test_upgrade_task.py— full state-machine walk via theenqueueinjection seam (no Celery worker required), double-start refusal (409 path),failed_preon remote-unset, clone failure, and build failure.api/tests/test_upgrade_routes.py—/startadmin gate,confirm_downtimeenforcement, queued + enqueued result, second-call conflict,/build-loghappy path, 404, 400 on invalid component, 403 on missing admin.
Validation¶
uv run ruff check api/services/upgrade api/routes/upgrade.py api/tasks/upgrade.py api/tests/test_upgrade_*.py terminal/exec_server.py— clean.uv run pytest -q api/tests/test_upgrade_*.py— 57 passed.uv run pytest -q api/tests— 1143 passed (no regression vs prior 1114).- End-to-end smoke deferred until PR3 wires the ARM PATCH; PR2's
pipeline is exercised purely via unit tests because a real
az acr buildinvocation takes minutes and depends on the platform ACR being reachable from the test environment.
IaC / infra¶
No Bicep changes. RBAC already covers everything PR2 needs:
acrPush(existing) —az acr buildrequires push on the registry.Contributoron the workspace RG (existing) — covers the Storage Blob append for build logs.
Operator setup¶
To exercise the build pipeline once PR2 is deployed:
- Set
UPGRADE_GIT_REMOTEto the git remote that hosts the release tags (e.g.https://github.com/<org>/elb-dashboard.git). - Set
PLATFORM_ACR_NAMEto the platform ACR name (without the.azurecr.iosuffix). - Set
UPGRADE_ADMIN_OIDSto the comma-separated oid(s) permitted to start/rollback upgrades. A future PR replaces this with an App Registration role claim; the env stays as bootstrap. - From the SPA (or a curl),
POST /api/upgrade/startwith{target_version, confirm_downtime: true}. PR4 wires the modal.
Out of scope (PR3 / PR4)¶
- PR3 —
aca_templatesnapshot,applier(ARM PATCH),rollout_watcher,rollback,escape_hatch. Drives the actual revision swap and thesucceeded → rolling_out → succeeded | failed_rollouttransitions. - PR4 — SPA UX (badge, modal, progress streaming, rollback diff, retention countdown). ACR retention guidance in docs/.