Security audit 2026-05-22 — items #12, #13, #14, #18, #20¶
Motivation¶
The final small-radius bundle from the 2026-05-22 sweep. Each item is a contained fix that closes a "trust caller / upstream more than necessary" gap or removes a centralisation gap that could lead to inconsistency:
- #12 (HIGH) —
/api/aks/openapi/proxyauto-injects the adminX-ELB-API-Tokenand forwards over plain HTTP to whatever IPk8s_get_service_ipreturns. If the operator wireselb-openapias a public LoadBalancer, the admin token travels over the open internet between the api sidecar and the public LB. - #13 (MEDIUM) —
ensure_local_storage_accessflipsdefaultAction=Allow(intentional, see service docstring) but relies onallowSharedKeyAccess=falseto keep AAD-only data-plane auth. The invariant was assumed but never checked at flip time. - #14 (MEDIUM) —
POST /api/arm/resource-group/tagschecked only theelb-prefix on tag names. Tag name length (Azure cap: 512), value length (cap: 256), tag count (cap: 50), and the set of characters Azure rejects in tag names (<>%&\?/+ control chars) were unchecked, leaving the api to leak raw SDK exceptions back to the SPA. - #18 (MEDIUM) — The frontend catch-all proxy accepted any method / path. Path-traversal segments, control characters (CR/LF / NUL), and exotic methods (TRACE / CONNECT) were forwarded to the nginx sidecar without sanitisation.
- #20 (MEDIUM) —
https://{account}.blob.core.windows.netwas hardcoded in 7 call sites (services + routes). A sovereign-cloud deployment (US Gov, China) would have to chase every occurrence.
User-facing change¶
- #12 —
GET|POST /api/aks/openapi/proxy?...returns502 {"code": "openapi_unsafe_transport", "message": "..."}when the resolved upstream IP is not in an RFC1918 / loopback / link-local range. The admin token is never emitted in the response body.
Follow-up (same day): the gate accepts an explicit opt-in via
OPENAPI_ALLOW_PUBLIC_LB=trueon the api sidecar. The deployed Container App ships with the opt-in on because_build_manifestscreateselb-openapias a public LoadBalancer by default — see 2026-05-22-openapi-public-lb-opt-in.md for the rationale and operator runbook. Local dev keeps the opt-in unset so the refusal path stays test-covered. - #13 —POST /api/storage/local-debug/open(and the implicit auto-open path) returns{"action": "failed", "error": "shared_key_access_enabled", ...}when the storage account hasallowSharedKeyAccess=true. The operator must set the property tofalse(via Bicep oraz) before opening the network window. - #14 — Tag posts now return400with a clear message when: - tag count > 50 - tag name > 512 chars or contains<>%&\?// control chars - tag value > 256 chars or contains control chars - #18 — Catch-all frontend route returns400 path contains parent-traversal segmentor400 path contains control charactersfor the matching probes, and405 method not allowed for frontend assetsfor anything outside{GET, HEAD, POST, PUT, PATCH, DELETE, OPTIONS}. - #20 — Newapi/services/storage_endpoint.pyhelper module (blob_account_url,table_account_url,dfs_account_url,blob_host_for_account,azure_storage_suffix). All 7 production call sites now route through it. Sovereign-cloud deployment is a one-env-var change:AZURE_STORAGE_SUFFIX=core.usgovcloudapi.net.
API / IaC diff summary¶
| Layer | File | Change |
|---|---|---|
| Routes | api/routes/aks/openapi.py | New _is_private_ipv4 helper + gate before token injection in aks_openapi_proxy. |
| Routes | api/routes/arm.py | New _validate_tag_name, _validate_tag_value helpers + count cap; wired into set_rg_tags. |
| Routes | api/routes/frontend_proxy.py | New _FRONTEND_ALLOWED_METHODS set + traversal/control-char/method guards before forwarding. |
| Services | api/services/storage_public_access.py | ensure_local_storage_access now reads allow_shared_key_access and refuses to open the window when True. |
| Services (new) | api/services/storage_endpoint.py | Centralised Storage endpoint helper module. |
| Services (migrated) | api/services/storage_data.py, api/services/job_artifacts.py, api/services/storage_url_validation.py, api/services/db_sharding.py, api/services/blast_oracles.py, api/services/blast/task_config.py, api/services/db_order_oracle.py, api/routes/storage/prepare_db.py | All hardcoded f"https://{account}.blob.core.windows.net" strings replaced with blob_account_url(account) / blob_host_for_account(account). |
| Tests | api/tests/test_security_audit_12_13_14_18_20.py | New 14-test file: ARM tag limits × 4, frontend path/method guards × 3, Storage shared-key refusal × 2, endpoint helper × 5. |
| Tests | api/tests/test_openapi_proxy_route.py | Switched stub Service IP from 20.30.40.50 (public) to 10.0.0.50 (RFC1918) so existing happy-path tests reflect the real deployment; added 3 new tests for public-IP refusal + IPv6 conservative-default refusal × 2. |
No IaC changes. No new dependencies. No deploy required.
Validation evidence¶
uv run ruff checkon every touched file → passed.uv run pytest -q api/tests/test_security_audit_12_13_14_18_20.py— 14 passed.uv run pytest -q api/tests/test_openapi_proxy_route.py— 22 passed.uv run pytest -q api/tests— 961 passed (was 943 → +18 from #12–#20 + IPv6 hardening).
Hardening pass (same day)¶
A self-critique surfaced one additional weakness; fixed in the same change:
- HIGH — IPv6 was silently accepted as "non-private" → refused. The
first draft of
_is_private_ipv4only handled IPv4 viaIPv4Address(value)and fell through toFalsefor IPv6, which the proxy interprets as "non-private → refuse". That happens to be the conservative default (refuse on IPv6 too), but a future operator enabling dual-stack AKS would be confused by the 502 with no diagnostic. The docstring now states the limitation explicitly and the test file pins both IPv6 cases (private ULAfd00::/8and globally-routable Cloudflare DNS) to the same 502 so a future fix is forced to update the URL builder and this check together.
Non-goals (explicitly deferred)¶
- #15 (token cache TTL) — Reducing the 5-min cap is a security / UX trade-off (faster revocation propagation vs. JWKS call rate). Wants user input.
- #16 (
azallowlist) —azis itself a script interpreter; the fix is either (a) removeazfrom the exec_server allowlist or (b) document thatazrequires the same trust level as the user. Needs design discussion. - #17 (HTTP rate limit) — Adds a new dependency (
slowapior similar) and requires user input on the rate ceiling. - #19 (execToken rotation tracking) — IaC + Container App secret rotation pipeline.
- #1 / #2 — Already covered by docs/copilot/security-audit-followup.md design doc; both require larger sprints (App Role / per-ticket tmux).
Audit progress¶
14/20 audit findings now closed (#3, #4, #5, #6, #7, #8, #9, #10, #11,