Security audit 2026-05-22 — items #3, #6, #7¶
Motivation¶
Read-only security sweep of the api sidecar surfaced 20 findings ranked by severity. The three lowest-risk-to-patch and highest-immediate-impact items were addressed in this change:
- #3 (CRITICAL) — Three Celery diagnostic endpoints under
/api/health/celery*were unauthenticated. They exposed broker URL, worker stats, and arbitrary task results, and accepted anonymous enqueue. Any unauthenticated caller could (a) inventory tenant infrastructure detail and (b) burn worker capacity by spammingenqueue-noop. - #6 (HIGH) —
/api/operations/{id}and the legacy alias/api/tasks/{id}validated the bearer token but did not check ownership of the underlying job. A logged-in tenant member could poll any other user's Celery task and read its result payload (BLAST job metadata, ARM responses, error tracebacks). - #7 (HIGH) —
/api/audit/logreturned the rawpayload_jsonblob stored in the jobhistory table. The blob may legitimately contain SAS query strings, bearer tokens, and subscription ids that audit consumers must not see in raw form.
User-facing change¶
- Calling
GET /api/health/celery,POST /api/health/celery/enqueue-noop, orGET /api/health/celery/result/{task_id}without a valid MSAL bearer now returns401 invalid token(or401 missing bearer token). GET /api/operations/{id}andGET /api/tasks/{id}return403 {"detail": "not owner"}when the JobState row for the task records a differentowner_oid. Tasks without a JobState row (system / diagnostic tasks such asdiag_noop) remain accessible — they carry no per-user payload.GET /api/audit/logredacts SAS query strings, bearer tokens, account / access keys, connection strings, and Azure GUIDs from thepayloadfield before returning. The repo layer still stores raw blobs for forensic use; redaction is a presentation-boundary concern.
API / IaC diff summary¶
| Layer | File | Change |
|---|---|---|
| Routes | api/routes/health.py | Added Depends(require_caller) to the three /health/celery* endpoints. |
| Routes | api/routes/operations.py | New _enforce_task_ownership helper; lookup JobState by task_id, return 403 if owner_oid differs from caller. |
| Routes | api/routes/tasks.py | Same ownership gate as operations.py (legacy alias parity). |
| Routes | api/routes/audit.py | Wrap payload_json with api.services.sanitise.sanitise before returning. |
| Services | api/services/state_repo.py | New JobStateRepository.find_by_task_id(task_id) returns the summary row matching a Celery task id (uses _JOBSTATE_SUMMARY_SELECT, no payload). |
| Tests | api/tests/test_smoke.py | Added the three /health/celery* paths to the anonymous-rejection parametrised test plus four new regression tests for ownership (403 / 200 paths) and audit sanitisation. |
No IaC changes. No new dependencies. No deploy required — backend routes only.
Validation evidence¶
uv run ruff check api— passed.uv run pytest -q api/tests— 883 passed in 24.24s (was 871 → +12 new parametrised + regression cases).- Targeted run:
uv run pytest -q api/tests/test_smoke.py— 73 passed.
Hardening pass (same day)¶
A self-critique surfaced three additional weaknesses; fixed in the same change:
- CRITICAL — ownership check failed open on lookup error. The first draft swallowed every state-repo exception and let the request through with a log line. That is exactly the "try/except blind" pattern the user memory warns against. Fixed:
_enforce_task_ownershipnow fails closed with HTTP 503{"code": "ownership_check_unavailable", "retryable": true}when the state lookup raises.AUTH_DEV_BYPASS=trueis the single exception (dev loop without a real state backend would otherwise hard-fail on every poll); the synthetic identity is already trust-flagged.JobStateRepositoryimport promoted to module scope so the testmonkeypatchtarget is unambiguous and the helper has no hidden hot import cost.- HIGH — audit fallback leaked raw exception strings. State-repo /
Storage SDK exceptions routinely embed account URLs, request-id GUIDs,
and occasionally SAS query strings. The fallback now passes through
api.services.sanitise.sanitisebefore truncation. - Coverage gap. The original
pragma: no cover - defensivemasked the degraded branch. New regression tests: test_operations_fails_closed_when_state_repo_raises_in_production— asserts 503 +ownership_check_unavailablewhen dev bypass is off.test_operations_fail_open_in_dev_bypass— asserts 200 + Celery projection when dev bypass is on.test_audit_log_error_branch_is_sanitised— asserts SAS / subscription GUID redaction in the fallbackerrorstring.
Non-goals (deferred to follow-up notes)¶
-
1 / #4 (role-based authz on top of bearer validation) and #2 (per-ticket¶
tmux session) need a design pass before any code change. Drafts live in docs/copilot/security-audit-followup.md. -
5, #8, #9, #11, #12 remain open as tracked items in the audit memory note.¶