cli-upgrade hardening round 2 — concurrent-deploy lock, deploy history, richer probe diagnostics¶
Motivation¶
Round 1 of the cli-upgrade deep-health work (PR series ending
72098a7) closed the silent-Storage-failure footgun. Critique then
surfaced three "deferred" items: a probe diagnostic field, an
operator-friendly snapshot race condition, and missing deploy-side
observability. This change addresses all three in one PR rather than
fragmenting across follow-ups.
User-facing change¶
Three independent additions, all opt-in / additive:
-
Richer Storage probe diagnostics. When
/api/health/readyreportsazure_storage: down, the response body now also includeserror_class(e.g.HttpResponseError,ServiceRequestError,ClientAuthenticationError,TimeoutError). Operators can map the class to a 1st-line action without parsing the free-formerrorstring. Theerrorfield itself is unchanged for back-compat with any existing parsers. -
Concurrent-deploy lockout.
cli-upgrade.shnow takes an exclusiveflock(2)on/tmp/elb-upgrade-snapshot-<app>.json.lockbefore reading or writing the snapshot. Two operators racingcli-upgrade.sh fullagainst the same Container App can no longer corrupt the rollback snapshot (the second run is rejected with a clear error pointing at the lockfile). The lock is released on normal exit,die, Ctrl+C, or SIGTERM viaset -Etrap behavior. -
Deploy history. Every terminal outcome of
cli-upgrade.shappends one JSON line to$ELB_UPGRADE_HISTORY(default~/.elb-upgrade-history.jsonl) withts,scope,app,tag,head_sha,result,elapsed_seconds,message. Implemented via a single EXIT trap so every exit path (success, parity rejection, build failure, rollback, Ctrl+C, internal error) gets recorded exactly once, without scattering log calls across the script.
API / IaC diff¶
api/routes/health.py_probe_storage_table()augments thedownpayload witherror_class: type(exc).__name__. The existingerror: str(exc)[:200]field is unchanged.api/tests/test_smoke.py- Narrowed the
_reset_storage_probe_cache_between_testsautouse fixture so it only runs fortest_readiness_storage*tests, not for the entire file. Saves a couple hundred microseconds per unrelated test and clarifies the fixture's actual scope. - New
test_readiness_storage_down_payload_includes_error_classpins the additive field contract. scripts/dev/cli-upgrade.shtake_snapshotandrestore_from_snapshotare now wrapped in an advisory file lock acquired at script start viaexec 9>...; flock -n 9. Second concurrent run dies with a recoverable error.- New
record_history()+set_result()+ EXIT trap that writes one JSONL line to$ELB_UPGRADE_HISTORY. Outcomes covered:success,parity_rejected,build_in_progress,upgrade_failed_rolled_back,rollback_failed,rollback_success,aborted_by_user,aborted. Dry-run and--helpare intentionally excluded. docs/operate/cli-upgrade.md- Two new "Common failure modes" rows: concurrent-deploy lock error
and the new
error_classdiagnostic hint. - New "Deploy history" section with format, result values, and three jq one-liners (recent runs, outcome counts, average elapsed).
No Bicep, Container App template, or response-shape changes for any
existing client. The new error_class is additive.
Validation¶
uv run pytest -q api/tests/test_smoke.py -k readiness_storage→ 6 passed (existing 5 + new error_class case).uv run ruff check api/routes/health.py api/tests/test_smoke.py→ All checks passed.bash -n scripts/dev/cli-upgrade.sh→ syntax OK.- End-to-end against deployed
ca-elb-dashboard: - Two concurrent
cli-upgrade.sh full --allow-dirty --dry-runinvocations → first succeeds, second exits 1 withanother cli-upgrade run holds /tmp/elb-upgrade-snapshot-...lock. - History file populated with one entry per run;
--helpand dry-run produce no entries as designed. - Bad scope (
cli-upgrade.sh xxxxxxx) records oneabortedentry withexit=1, scope=unknown.
Operator note¶
The cli-upgrade.sh lockfile lives at
/tmp/elb-upgrade-snapshot-<CONTAINER_APP_NAME>.json.lock. If a
previous run was force-killed (kill -9) before the kernel released
the flock, the lockfile may persist as stale. Remove it manually
when no cli-upgrade.sh process is alive:
The history file is per-host, so deploys from a CI runner vs a
developer workstation will not appear in the same jsonl. For
cross-host aggregation, redirect ELB_UPGRADE_HISTORY to a shared
location (Azure Files mount, SMB share, etc.) — at your own risk
since the atomicity guarantee relies on local-fs O_APPEND semantics.