2026-05-15 — Fix shared_task.delay() resolving to phantom Celery app (AKS provision silently dropped)¶
Motivation¶
Clicking Provision in the SPA returned 200 OK from POST /api/aks/provision,
but the AKS cluster never appeared. The Celery worker was healthy and registered
all tasks (including api.tasks.azure.provision_aks), yet nothing ever
moved from queue → worker.
Root cause¶
The api sidecar imports api.tasks.azure lazily inside the route handler:
@aks_router.post("/provision")
def aks_provision(...):
from api.tasks.azure import provision_aks # ← lazy, FIRST import
result = _safe_delay(provision_aks, ...) # → uses task.app (current_app)
api.tasks.__init__ did not import api.celery_app first. So at the moment
the @shared_task decorators ran, no Celery app had been instantiated yet, and
current_app returned celery's internal phantom default app (main='default',
broker='amqp://', task_routes={}, task_default_queue='celery').
shared_task bound the tasks to that app forever.
When provision_aks.delay() ran later:
- It used the phantom app's router → no route matched → fell through to its
task_default_queue='celery'.
- The worker container subscribes only to default,azure,blast,storage.
- Messages stacked up in the celery Redis list and were never consumed.
A live diagnostic in production confirmed the split:
task_app_id : 132600762876176 ← phantom default app
celery_app_id : 132600762186832 ← our real app
current_app_id : 132600762876176 ← phantom default app
task_app_main : "default"
task_app_routes : {}
task_app_default_queue : "celery"
resolved_route : {"queue": "celery"}
User-facing change¶
- AKS provision now actually runs (verified end-to-end on
ca-elb-control--0000036,diag_nooptask: enqueue → SUCCESS in ~4 s). - Same fix unblocks every other Celery-backed flow: ACR build, BLAST submit/cancel/status,
storage warmup, scheduled
check_database_updates. - No SPA changes needed.
Code change summary¶
api/celery_app.py— explicitset_as_current=True+set_default()+set_current()after construction (belt-and-braces against any other module instantiating a Celery app first).api/tasks/__init__.py— loadapi.celery_appbeforeacr/azure/blast/storage(load-bearing import order, with comment).api/main.py— eagerfrom api import celery_app as _celery_appso the api sidecar's uvicorn workers each have our Celery instance registered ascurrent_appbefore any request handler runs.api/tasks/azure.py— addeddiag_noopCelery task for permanent enqueue↔consume round-trip diagnostics.api/routes/health.py— added unauthenticated diagnostic endpoints (GET /api/health/celery,POST /api/health/celery/enqueue-noop,GET /api/health/celery/result/{task_id}) so future drift surfaces immediately instead of "POST 200 + cluster never appears".
Validation evidence¶
uv run pytest -q api/tests→ 67 passed (was 56; +diag tests).- Local repro with
redis:7-alpineon127.0.0.1:16379: - Production
ca-elb-control--0000036after deploy:POST /api/health/celery/enqueue-noop?message=after-fix → 200 task_app_id == celery_app_id == current_app_id resolved_route.queue = "azure" GET /api/health/celery/result/<task_id> (after 4 s) status: SUCCESS result.message: "after-fix" GET /api/health/celery queues: {default:0, azure:0, blast:0, storage:0, celery:0} redis_keys_db0: [] ← no orphaned messages
Out of scope¶
- The legacy messages already sitting in the
celeryRedis list from before the fix are dropped on the next revision swap (in-memory Redis sidecar, no AOF). Any user who clicked Provision before this deploy must click again. - The diagnostic endpoints under
/api/health/celery/*are unauthenticated by design (read-only, no secrets) — same posture as/api/healthand/api/health/ready.