F2 — ACR retention pre-flight for rollback (2026-05-22)¶
Motivation¶
PR3's rollback path issued the ARM PATCH unconditionally. If ACR retention had purged any of the snapshotted tags between the upgrade and the rollback attempt, ACA would accept the PATCH, attempt to pull the missing image, and crashloop the new "rollback" revision — delivering downtime instead of recovery. F2 adds a data-plane check that catches this before the CAS so the operator gets a clean refusal and can take the escape-hatch route.
Change¶
- New dependency:
azure-containerregistry==1.2.0for ACR data-plane manifest lookups (uses the same Managed Identity viaazure-identity). api/services/upgrade/acr_inventory.py(new) —lookup_images()batches per-endpoint manifest probes and returnsImageInfo(exists, created_on, error)per ref. Never raises; distinguishes "tag not found" from "registry offline". Test seam:set_client_factory_for_testsinjects a fakeContainerRegistryClient.api/tasks/upgrade.py::start_rollback_inlinerunslookup_imagesfor the three rollback target refs and raisesRollbackStartRefused("ACR no longer carries the snapshotted tags: …")before the rollback CAS when any tag is missing. ACR-side errors (registry offline) are logged and the rollback proceeds — we'd rather attempt the PATCH than block on a transient SDK glitch.api/routes/upgrade.py— new admin-gated endpointGET /api/upgrade/rollback-preflightreturns per-image existence- creation timestamp so the SPA can warn proactively.
web/src/api/upgrade.ts— addsrollbackPreflight()+ types.web/src/pages/UpgradePage.tsx— renders the preflight result inside the Rollback card. Whenavailable=falsea red banner lists the missing tags and the Roll-back button is disabled (escape-hatch card still works). Whenavailable=truea muted line confirms ACR pre-flight passed and shows the snapshot creation date.
Tests¶
api/tests/test_upgrade_acr_inventory.py(new) — parse, batch lookup, missing-tag flag, malformed-ref tolerance,image_existsshortcut.api/tests/test_upgrade_task.py— fixture seeds a default "always-exists" ACR stub; newtest_rollback_refuses_when_acr_tag_retention_purgedconfirms the refusal path leaves state untouched.api/tests/test_upgrade_routes.py— fixture acr stub for routes fixture; new tests cover/rollback-preflightavailable/missing/ no-snapshot/auth.
Validation¶
uv run ruff check api/services/upgrade api/routes/upgrade.py api/tasks/upgrade.py api/tests/test_upgrade_*.py— clean.uv run pytest -q api/tests— 1186 passed (vs prior 1176).- SPA built (after clearing
web/.tsbuilddue to a pre-existing unrelated incremental cache issue affectingcards/storage/*.tsx).
Known limitations¶
rollback_available_untilis still empty in the state row — the preflight endpoint surfacescreated_onper image which is enough for the SPA to render a date. A retention-policy → "expires on" conversion is a follow-up that needsazure-mgmt-containerregistrypolicy reads.- The preflight is a separate round-trip rather than being embedded
in
/upgrade/status. Keeps the status endpoint cheap; the SPA only hits preflight once per page render.