CLI Rolling Update (`git pull` + build + deploy)¶

This page is the workstation-driven path for rolling out new code to a deployed dashboard. It is paired with the script scripts/dev/cli-upgrade.sh.

Quick rolling update (TL;DR)

Pick the path that matches who you are. Both deploy all six sidecars (api / worker / beat / frontend / terminal / redis) by rebuilding the three custom images (elb-api, elb-frontend, elb-terminal) and swapping the Container App template via postprovision.sh.

Operator (not editing code)Code contributor (deploying local edits)

You are deploying a tagged release from origin/main (or a release branch) without local edits. This is the safest path — the SPA header vA.B.<build> · <short-sha> will match exactly what is in git, so future "what shipped?" questions are trivial.

# 1. Land the release on your workstation.
git fetch --tags origin
git checkout main && git pull --ff-only

# 2. Preview the plan (no build, no PATCH).
scripts/dev/cli-upgrade.sh full --dry-run

# 3. Deploy.
scripts/dev/cli-upgrade.sh full --yes

No --allow-dirty: the script refuses to proceed if the tree is dirty, which is exactly the guardrail you want.
--pull is intentionally not passed in step 3 — you already pulled in step 1 and saw what landed.

You are iterating on api/, web/, terminal/, or infra/ and want to ship the working tree. Commit first so the SPA header SHA matches the deployed code (az acr build packages whatever is on disk regardless of git state — see § "Working tree, git, and the SPA header").

# 1. Commit (or stash, then unstash after deploy).
git add -A && git commit -m "feat(scope): summary"

# 2. Preview the plan.
scripts/dev/cli-upgrade.sh full --dry-run

# 3. Deploy.
scripts/dev/cli-upgrade.sh full --yes

If you absolutely must deploy uncommitted edits (e.g. quick production hotfix you will commit immediately after), add --allow-dirty to acknowledge the SHA mismatch:

scripts/dev/cli-upgrade.sh full --allow-dirty --yes

Then commit the same diff right after the deploy succeeds, and record the commit SHA in the per-feature change note under docs/features_change/.

Only edited api/ code (no infra/, terminal/, or sidecar layout change)? The faster api scope rebuilds one image and patches api+worker+beat in ~60 s:

scripts/dev/cli-upgrade.sh api --yes

For either path: snapshot + /api/health poll + auto-rollback still run. Tune the budget with --health-timeout 300 when the terminal sidecar was rebuilt or the app was scaled to zero.

If you see RBAC errors after the upgrade (e.g. cluster create fails)

The dashboard's user-assigned managed identity is granted only Reader at subscription scope by design. The very first SPA "Create Cluster" click on a fresh subscription fails with AuthorizationFailed on Microsoft.Resources/subscriptions/ resourcegroups/write because the future cluster RG (e.g. rg-elb-cluster) does not exist yet and the MI has no sub-scope write. The fix is a single bootstrap command that creates the RG and grants the MI Contributor + User Access Administrator at that RG scope only (no escalation to sub-scope Contributor):

cd ~/dev/elb-dashboard
bash scripts/dev/grant-runtime-rbac.sh \
  --cluster-rg rg-elb-cluster \
  --region koreacentral \
  --yes

The script is idempotent — re-running on an already-correct setup exits in under two seconds with skipped=N. After it succeeds, wait 1–5 minutes for Azure RBAC propagation, then click Edit & retry in the SPA error card. Detail in Identity Architecture § 6.1 and the grant-runtime-rbac.sh --help output.

For a comprehensive RBAC audit (all expected {scope, role} pairs for the dashboard MI), run the doctor:

bash scripts/dev/check-mi-rbac.sh             # read-only audit
bash scripts/dev/check-mi-rbac.sh --auto-fix  # opt-in: also grant missing

cli-upgrade.sh full invokes the doctor automatically after a successful health check; pass --auto-fix-rbac to let it grant the missing roles in-line under your current az login identity.

Prefer the in-browser upgrade when possible

The browser-driven In-app Upgrade does the same thing without a workstation: it polls the configured git remote for a new release tag, runs az acr build for the three sidecar images, PATCHes the Container App template, and auto-rolls back on failure. Use the CLI path only when that flow is not available.

When to use which path¶

Situation	Use
`UPGRADE_GIT_REMOTE` is configured and the SPA is reachable	In-app Upgrade — no shell needed.
In-app upgrade is disabled (`UPGRADE_GIT_REMOTE` unset) or no `UpgradeAdmin` is available	`cli-upgrade.sh <scope>` from a workstation that has `az login`.
Sidecar layout / probes / scale rules changed (anything outside container images)	`cli-upgrade.sh full` — runs the full `postprovision.sh` template swap.
The SPA is down — the browser cannot drive a rollback	`cli-upgrade.sh rollback` against the snapshot file.
You only edited code in `api/` and want a 60-second cycle	`quick-deploy.sh api` directly (no snapshot envelope).
You need to refresh all three custom images (api+worker+beat / frontend / terminal) but did not touch sidecar layout / Bicep / secrets	`quick-deploy.sh all` — three parallel `az acr build` jobs + per-container PATCH (sequential, to avoid a Container App revision read-modify-write race), no template swap, no snapshot envelope. Faster than `cli-upgrade.sh full` (skips the Bicep redeploy) but also skips the snapshot + `/api/health` auto-rollback safety net.

What the script does (envelope around `quick-deploy.sh` / `postprovision.sh`)¶

%%{init: {"theme": "base"}}%% flowchart TD preflight["Preflight az login · azd env · clean tree"] --> pull{"--pull?"} pull -- yes --> gp["git pull --ff-only (refuses non-fast-forward)"] pull -- no --> snap gp --> snap["Snapshot current revision + per-sidecar image refs → /tmp/elb-upgrade-snapshot-<app>.json"] snap --> confirm{"--yes?"} confirm -- no --> ask[/"Proceed? [y/N]"/] confirm -- yes --> deploy ask -- y --> deploy["Dispatch api/frontend/terminal → quick-deploy.sh full → postprovision.sh"] deploy --> health{"Poll /api/health/ready ≤ --health-timeout"} health -- 200 --> done(["✓ healthy print rollback hint"]) health -- timeout --> ar{"--auto-rollback?"} ar -- yes --> restore["Re-PATCH per-sidecar image from snapshot"] ar -- no --> manual["print manual rollback command exit 1"] restore --> health2{"Poll /api/health/ready"} health2 -- 200 --> rolled(["⚠ rolled back exit 1"]) health2 -- timeout --> dead(["✗ rollback failed exit 1"])

Working tree, git, and the SPA header¶

az acr build packages the current working tree (filtered by .dockerignore) as the build context. It does not care whether files are staged, committed, or pushed — whatever is on disk at build time goes into the image. --allow-dirty only suppresses the dirty-tree guardrail; it does not change what gets packaged.

The SPA header vA.B.<build> · <short-sha> is resolved on the build host by scripts/dev/quick-deploy.sh and scripts/dev/postprovision.sh and passed to az acr build as --build-arg. The short-sha comes from git rev-parse --short HEAD, i.e. the last commit. Consequence:

Pre-deploy git state	Code shipped	SPA header SHA	Traceability
Clean (committed)	HEAD	matches HEAD	✅ trivial — `git show <sha>` reproduces it
Dirty (`--allow-dirty`)	working tree	matches previous HEAD	⚠ header lies — diff exists only on your laptop

Verification when you want to confirm a specific file made it into the deployed image:

az containerapp exec \
  --name "$CONTAINER_APP_NAME" --resource-group "$AZURE_RESOURCE_GROUP" \
  --container api --command "sha256sum /app/api/main.py"
sha256sum api/main.py    # local comparison

Same hash → shipped as intended. Different → check .dockerignore or whether a later build stage overwrote the file.

Preflight checklist¶

The script enforces these automatically and refuses to proceed if any fails:

Check	What it guards against
`az account show` succeeds	Stale or missing `az login`
`az account show` subscription is treated as the source of truth; if it differs from `AZURE_SUBSCRIPTION_ID` in azd env, the script auto-syncs azd env (`azd env set AZURE_SUBSCRIPTION_ID <current>` plus `AZURE_TENANT_ID` when different) and exports the values in-process before continuing	Silently pushing the new image to the wrong ACR / Container App when your `az login` and `azd env` point at different subscriptions. The script never switches `az account set` for you, so `az account show` keeps showing the subscription you actually selected.
`AZURE_RESOURCE_GROUP`, `ACR_NAME`, `ACR_LOGIN_SERVER`, `CONTAINER_APP_NAME`, `CONTAINER_APP_FQDN` are set (auto-loaded from `azd env get-values`)	Pointing at the wrong app
`git status --porcelain` is empty	Building with uncommitted edits silently shipping debug code (override with `--allow-dirty`)
`--pull` only on the branch you started on	Accidental `pull` of a feature branch into `main`
`git pull --ff-only`	Non-fast-forward pulls leaving a merge commit you did not intend
Snapshot of current revision + image refs taken before any PATCH	Losing the previous tags to roll back to
Workload Storage parity: refuses when `publicNetworkAccess=Disabled` AND no approved Private Endpoint exists on the account	Deploying into a state where the Container App has no network path to Storage (worker would fail every minute on `403 AuthorizationFailure`). Override with `--skip-parity-check`.
Runtime RBAC grant: ensures the deployed dashboard MI has `Contributor` + `User Access Administrator` on the AKS cluster RG	The on-cluster OpenAPI deploy task creates `id-elb-openapi` + a federated credential + three role assignments inside the AKS cluster's RG (typically `rg-elb-cluster`), which `infra/modules/controlPlaneRoles.bicep` does not grant. Without this, the SPA's "Deploy elb-openapi" button fails with `workload identity setup failed; OpenAPI pod would have no AZURE_CLIENT_ID.` The preflight calls `scripts/dev/grant-runtime-rbac.sh` which is idempotent (already-correct setups exit in <2 s) and is best-effort — a failure here does not block the cli-upgrade itself, only logs a recovery hint for a tenant/sub admin. Override with `--skip-rbac-grant`.
Exclusive lock on the snapshot file (`flock` on `/tmp/elb-upgrade-snapshot-<app>.json.lock`)	Two operators racing concurrent deploys against the same Container App and corrupting the rollback snapshot — the second run is rejected with a clear error

Deploy history¶

Every run appends one JSON line per terminal outcome to $ELB_UPGRADE_HISTORY (default ~/.elb-upgrade-history.jsonl):

{"ts":"2026-05-23T03:22:48Z","scope":"full","app":"ca-elb-dashboard","tag":"20260523122407-58cc179","head_sha":"58cc179","result":"success","elapsed_seconds":127,"message":""}

Possible result values:

Result	Meaning
`success`	Upgrade completed; `/api/health/ready` returned 200 within the timeout
`dry_run`	Skipped — dry-run never writes a history entry (the recording function early-returns)
`parity_rejected`	Storage parity preflight blocked the deploy
`build_in_progress`	The build step (`quick-deploy.sh` or `postprovision.sh`) was running when the script exited — the last successful state before a build failure
`upgrade_failed_rolled_back`	New tag failed `/api/health/ready`; auto-rollback to the snapshot succeeded
`rollback_failed`	Auto-rollback PATCH applied but `/api/health/ready` still fails — manual intervention needed
`rollback_success`	Explicit `cli-upgrade.sh rollback` scope completed and healthy
`aborted_by_user`	Interactive `Proceed?` prompt was declined
`aborted`	Catch-all for Ctrl+C, SIGTERM, internal errors, or any path that exited before setting an explicit result

Useful queries:

# Most recent 5 runs
tail -5 ~/.elb-upgrade-history.jsonl | jq .

# Outcome counts in the last 30 days
jq -r 'select(.ts > "'$(date -u -d '30 days ago' +%Y-%m-%d)'") | .result' \
  ~/.elb-upgrade-history.jsonl | sort | uniq -c | sort -rn

# Average elapsed_seconds for successful 'full' deploys
jq -r 'select(.result=="success" and .scope=="full") | .elapsed_seconds' \
  ~/.elb-upgrade-history.jsonl | awk '{s+=$1; n++} END {print s/n}'

The file is best-effort: a missing $HOME or read-only filesystem never blocks a deploy. Single-line appends are < PIPE_BUF (4 KiB) so the shell's O_APPEND redirect is atomic against concurrent writers — no additional locking needed.

Recommended workflow¶

Routine code-only update (api sidecar)¶

# 1. Pull, build, deploy api+worker+beat, then auto-rollback on /api/health failure.
scripts/dev/cli-upgrade.sh api --pull

# 2. Watch the new revision's logs (optional).
scripts/dev/cli-upgrade.sh api --pull --logs

Frontend SPA bundle change¶

# Vite build args (VITE_AZURE_CLIENT_ID etc.) are picked up by quick-deploy.sh
# from azd env values automatically — no manual env juggling.
scripts/dev/cli-upgrade.sh frontend --pull

Sidecar layout / Bicep / terminal base image changed¶

# Runs the full 3-image rebuild + template swap (5-10 min).
scripts/dev/cli-upgrade.sh full --pull

Refresh all custom images without touching sidecar layout (fast path)¶

Use this when api / frontend / terminal code changed together but infra/*.bicep, sidecar env / secrets / probes / scale rules, and the terminal base image did not. It is the fastest "deploy everything" shape because it skips the Bicep redeploy and the snapshot envelope.

# Required env (once per shell): source the values azd already knows about.
azd env get-values > /tmp/azd-env.sh && source /tmp/azd-env.sh

# Build api → frontend → terminal sequentially, then PATCH each container.
# worker and beat reuse the api image and are PATCHed in the same step.
scripts/dev/quick-deploy.sh all
scripts/dev/quick-deploy.sh all --logs   # same, then tail the api revision logs

Tradeoffs vs cli-upgrade.sh full:

✅ Builds api / frontend / terminal in parallel (per-image logs at .logs/quick-deploy/<tag>/build-<image>.log) and opens the ACR firewall once for all three. Wall time bounded by the slowest image, not the sum.
✅ Skips the ~2-3 min Bicep template swap that postprovision.sh runs.
⚠ PATCHes stay sequential (api → worker → beat → frontend → terminal) because az containerapp update --container-name has no ETag protection — parallel PATCHes would race and silently revert some sidecars on the final revision.
⚠ No snapshot file is written, so there is no auto-rollback on /api/health failure — verify manually with curl -fsS "https://$CONTAINER_APP_FQDN/api/health/ready" and roll back with the manual rollback steps below if needed.
⚠ No dirty-tree / fast-forward / Storage parity / RBAC preflight. Run those checks (git status, scripts/dev/check-mi-rbac.sh) yourself, or fall back to cli-upgrade.sh full when in doubt.
❌ Do not use this when sidecar layout, secrets, probes, scale rules, or the terminal base image changed — those require cli-upgrade.sh full (Bicep template swap).

Roll back from a workstation¶

# Read the snapshot taken on the most recent upgrade run on this workstation
# and re-PATCH every sidecar back to those image refs.
scripts/dev/cli-upgrade.sh rollback --yes

The snapshot file is per-app (/tmp/elb-upgrade-snapshot-<app>.json by default; override with ELB_UPGRADE_SNAPSHOT). If you move workstations between the upgrade and the rollback, copy the snapshot file across — or fall back to the manual rollback below.

Manual rollback (when the script is unavailable)¶

The script's safety net is a single az containerapp update --container-name <name> --image <previous-image> per sidecar. Reproduce it by hand:

# 1. Find the previous active revision (the one BEFORE the broken one).
az containerapp revision list \
  --name "$CONTAINER_APP_NAME" --resource-group "$AZURE_RESOURCE_GROUP" \
  --query "sort_by([], &properties.createdTime)[-2:].{name:name, active:properties.active, created:properties.createdTime}" \
  -o table

# 2. Pull its per-sidecar image refs.
az containerapp revision show \
  --name "$CONTAINER_APP_NAME" --resource-group "$AZURE_RESOURCE_GROUP" \
  --revision "<previous-revision-name>" \
  --query "properties.template.containers[].{name:name, image:image}" \
  -o table

# 3. PATCH each container back to the captured image.
az containerapp update \
  --name "$CONTAINER_APP_NAME" --resource-group "$AZURE_RESOURCE_GROUP" \
  --container-name api --image "$ACR_LOGIN_SERVER/elb-api:<previous-tag>"
# (repeat for worker, beat, frontend, terminal as needed)

# 4. Wait for /api/health.
curl -fsS "https://$CONTAINER_APP_FQDN/api/health"

Health-check budget¶

The script polls https://<fqdn>/api/health/ready every 5 seconds for --health-timeout seconds (default 180). Tune it with --health-timeout 300 when:

The terminal sidecar was rebuilt (cold container, large layer).
The Container App was scaled to zero before the upgrade (revision warmup).
A managed-identity refresh is in progress (typically <30 s).

/api/health/ready is the deep readiness probe — it checks the Redis broker, the Managed Identity credential, the terminal sidecar's loopback exec server, and a cheap list_tables(top=1) call against the workload Storage Table data plane. A 200 means the api sidecar is up AND every critical downstream is actually reachable. On any 503 the script dumps the response body to stderr so you can see which component is down before the auto-rollback kicks in.

The cheap /api/health (liveness) endpoint stays in place for Container Apps platform probes — never use it as a deploy verification gate, it does not call Azure at all.

Common failure modes¶

Symptom	Most likely cause	Fix
`ACR no longer carries the snapshotted tags` (rollback)	ACR retention policy purged the previous tag.	Bump retention before next upgrade: `az acr config retention update --registry "$ACR_NAME" --status enabled --days 180 --type UntaggedManifests`. Re-build the older release locally to restore the missing tag.
`Auto-rollback` says PATCH succeeded but `/api/health` still 5xx	The previous tag also depends on a sidecar image that was purged, OR Storage / Key Vault private endpoint is down.	Inspect `az containerapp logs show --container api --type system --tail 100` and `az containerapp logs show --container api --tail 100`.
`git pull --ff-only failed`	A teammate force-pushed or the working branch is diverged.	Rebase locally and resolve manually; do not pass `--allow-dirty` to bypass.
`403` on `az containerapp update`	Caller's `az login` identity lacks `Contributor` on the Container App.	Use the deploying account, or have the deployer add a `Container Apps Contributor` role assignment.
New revision crash-loops with `ImagePullBackOff`	Build succeeded but ACR pull permission for the Container App's MI is broken.	Run `scripts/dev/postprovision.sh` once to re-grant `AcrPull`.
Health check passes but the SPA fails to load	`VITE_API_BASE_URL` leaked from `web/.env.local` into the frontend build.	The script unsets it; if you bypassed it, `cli-upgrade.sh frontend --pull` will overwrite.
Preflight rejects with `Storage '...' is unreachable from the Container App`	Workload Storage is `publicNetworkAccess=Disabled` (most often left over from a local-debug `storage-public-access.sh off` / `local-run.sh storage-off` / `auth-off`) AND the deployment never created Private Endpoints (`LOCKDOWN_PRIVATE_NETWORKING=false`).	Quick: `scripts/dev/storage-public-access.sh on --account <acct> --rg <rg>`. Proper: `azd env set LOCKDOWN_PRIVATE_NETWORKING true && azd provision`. Last-resort override: `--skip-parity-check` (workload will still fail Storage calls).
`/api/health/ready` returns 503 with `azure_storage: down` in the body	The api sidecar can reach Azure AD but not the Storage data plane. Same cause as the preflight rejection above, OR transient Azure outage, OR the workload Managed Identity is missing `Storage Table Data Contributor` on the workload storage account.	Confirm MI role: `az role assignment list --assignee <mi-principalId> --scope <storage-id>`. If correct, run the Storage recovery from the row above. The `azure_storage.error_class` field in the same body (e.g. `HttpResponseError`, `ServiceRequestError`, `ClientAuthenticationError`) tells you the SDK exception category at a glance.
Preflight rejects with `another cli-upgrade run holds /tmp/elb-upgrade-snapshot-<app>.json.lock`	Another `cli-upgrade.sh` is already running against the same Container App on this workstation, or a previous run was killed before releasing the `flock(2)` advisory lock.	If a peer is genuinely deploying, wait for them to finish. If the lock is stale (no `cli-upgrade.sh` process exists), remove the lockfile: `rm /tmp/elb-upgrade-snapshot-<app>.json.lock`.

What this script does not do¶

No azd provision. Infra under infra/*.bicep is not re-applied. Use azd up (or azd provision && cli-upgrade.sh full) for Bicep changes.
No multi-revision blue/green. The bundled Container App is minReplicas: 1, maxReplicas: 1, revisionsMode: single. Rollback is a fast re-PATCH, not a revision activate.
No cross-tenant deploy. The script honours the current az login context — there is no tenant-switching flag.
No automatic git push. It only pulls. Whatever you build is the tip of the branch on the workstation at that moment.

Deployment Reference — the prerequisites, Bicep modules, and the full azd up flow.
In-app Upgrades — the browser-driven equivalent.
Runtime Plan — RBAC + identity matrix the az containerapp update PATCH depends on.
Container Apps Architecture — sidecar layout and the quick-deploy.sh constraints.

CLI Rolling Update (git pull + build + deploy)¶