cli-upgrade.sh — deep readiness probe + Storage isolation parity preflight¶
Motivation¶
On 2026-05-23 the test Container App spent ~37 minutes silently failing
every auto-warmup-reconcile Celery beat tick because workload Storage
was set to publicNetworkAccess=Disabled while no Private Endpoint
existed for it (Bicep lockdownPrivateNetworking=false skipped the PE
branch). The api / worker / beat sidecars booted up healthy, /api/health
returned 200, but every Storage Table call returned the misleading
403 AuthorizationFailure — which is Azure Storage's response when the
public endpoint is hit while public access is disabled, indistinguishable
in the error code from RBAC denial.
The CLI deploy envelope did not catch this. cli-upgrade.sh only polled
the cheap /api/health liveness endpoint and called it a success, so
the silent failure mode could persist through a full full --allow-dirty
rolling update without any signal.
User-facing change¶
Three improvements that compose into one safety envelope:
-
Deeper post-deploy gate.
cli-upgrade.shnow polls/api/health/ready(readiness), which checks Redis + Azure credential + terminal sidecar + Storage Table data plane. On a 503 it dumps up to 5 KB of the JSON response body to stderr so the operator seesazure_storage: { status: down, error: ... }immediately, before auto-rollback fires. -
Pre-deploy parity check. Before the snapshot stage,
cli-upgrade.shqueries the workload Storage account and refuses the deploy whenpublicNetworkAccess=Disabledand zero Private Endpoints reference the account. The error message offers two recovery paths (quickstorage-public-access.sh onvs properazd env set LOCKDOWN_PRIVATE_NETWORKING true && azd provision) plus an explicit--skip-parity-checkoverride. -
Readiness component coverage.
/api/health/readygained a 4th component,azure_storage, that performs the cheapest possible Table data-plane call (list_tables(results_per_page=1)) with 3 s connect / 3 s read timeouts.AZURE_TABLE_ENDPOINTunset → reportsskipped; reachable →ok; raises →down+ overall 503.
API / IaC diff¶
api/routes/health.py::readiness- Add
azure_storagecomponent check after the existing 3 - 3 s timeouts via
connection_timeout/read_timeoutso a slow Storage cannot tarpit unauthenticated readiness callers - Liveness
/api/healthuntouched (Container Apps platform probes still use it) api/tests/test_smoke.py- 3 new cases:
skipped,ok,downpaths via monkeypatchedTableServiceClient(no real Azure calls in unit tests) scripts/dev/cli-upgrade.shpoll_health()URL →/api/health/ready- Capture last response body (5 KB cap) and
cat >&2on non-200 - Help / plan-summary / rollback-failure strings updated to match
- New
preflight_storage_parity()between az-login preflight and snapshot; new--skip-parity-checkflag +--helpdocumentation docs/operate/cli-upgrade.md- Preflight checklist row for the Storage parity check
- Health-check budget section rewritten for
/health/ready - Mermaid flowchart updated to show
/api/health/ready - Two new rows under Common failure modes covering preflight
rejection and
azure_storage: down503 cases
No Bicep or Container App template changes; this is pure CLI envelope plus a readiness-endpoint hardening.
Validation¶
uv run pytest -q api/tests/test_smoke.py api/tests/test_route_contracts.py→ 84 passed (3 new readiness storage cases included).uv run ruff check api/routes/health.py api/tests/test_smoke.py→ All checks passed.bash -n scripts/dev/cli-upgrade.sh→ syntax OK.- End-to-end reproduction against deployed
ca-elb-dashboard: cli-upgrade.sh full --allow-dirty --dry-runwhile Storage wasEnabled→ parity check passes, plan printsHealth: .../api/health/ready.storage-public-access.sh off ...→ simulates the broken state.cli-upgrade.sh full --allow-dirty --dry-runnow exits 1 with the parity error message and recovery options.cli-upgrade.sh full --allow-dirty --dry-run --skip-parity-checkskips with aWARN:and proceeds to plan/dry-run.storage-public-access.sh on ...restores connectivity.- Confirmed worker
reconcile_auto_warmupsucceeds again after recovery (Task ... succeeded in 0.087s: {status: completed}).
Operator note¶
On ca-elb-dashboard (test) workload Storage stays
publicNetworkAccess=Enabled, defaultAction=Allow while Bicep is
deployed with lockdownPrivateNetworking=false. Do not run
storage-public-access.sh off against this app until the Bicep is
re-deployed with LOCKDOWN_PRIVATE_NETWORKING=true so the
blob/dfs/table Private Endpoints actually exist.