BLAST Job Table Sync — Resilience, Multi-User, Performance¶
Motivation¶
Three classes of problems made the BLAST jobs list look inconsistent or stale:
- AKS-only jobs got lost on cluster recreation — jobs submitted via the
elastic-blastCLI inside AKS lived only in the external OpenAPI plane's ConfigMaps and never reached the dashboard's Table Storage. - Celery failures left zombie rows — if the broker was down when a user
submitted, the Table row stayed
queuedforever. If a worker died mid-flight, the Table row stayedrunninguntil manually fixed. - Delete button looked broken — clicking Delete soft-deleted the row in Table, but the next external-OpenAPI poll resurrected the row because the merge step did not check the tombstone.
There were also performance and ownership concerns:
- Several UI components polled the same
/api/blast/jobsendpoint concurrently; each call fired an upstream OpenAPI HTTP request from scratch. - Sync looked up N rows in N round-trips against Azure Table Storage.
- In a multi-user deployment, the first caller to discover an external job claimed ownership and hid the row from every other caller with the same ARM scope.
User-Facing Change¶
- AKS-originated jobs persist on the dashboard. First time the dashboard
sees a job in the external OpenAPI plane, it copies the row into Table Storage
with
owner_oid=""(cluster-shared). Subsequent polls update the row's status/phase in place when the external plane has moved on. - Delete actually deletes. Clicking the trash icon flips the row to a
deletedtombstone that the list endpoint hides and that the next external sync respects (so the row stays gone forever, not just until the next poll). - Broker outage no longer leaves zombie rows. If the Celery broker is
unreachable at submit time, the row created moments earlier is immediately
flipped to
failed / broker_unavailable, and the API returns 503 so the dashboard surfaces a real error instead of a perpetual "queued" entry. - Worker-died rows get reconciled. A new beat task scans all
queued / pending / running / reducingrows every 60 s and brings them back to truth: CeleryFAILURE/REVOKED→failed,SUCCESS→completed, otherwise asks the external plane, otherwise (silence past the stale threshold) marks the rowfailed / phase=worker_lost / error_code=worker_lost. - Multi-user environments work. External rows are stored with
owner_oid=""andlist_for_ownernow matches(owner_oid eq <caller> or owner_oid eq '')so every user with ARM scope on the cluster sees the same cluster-shared jobs. The dashboard's own submit path still writes the caller's OID, so per-user privacy of submitted jobs is unchanged.
API / IaC Diff Summary¶
Backend¶
- api/services/state_repo.py
create()handlesResourceExistsErrorby returning the existing row instead of raising, so concurrent sync calls are safe.- New
get_many(job_ids)performs a single OData query across N PartitionKeys instead of N round-trips. - New
list_active(job_type='blast', limit)returns rows inqueued / pending / running / reducingfor the reconciliation beat. list_for_owner()now filters out tombstones and includesowner_oid=""rows.- api/routes/_blast_shared.py
_sync_external_jobs_to_table()now returns(created, updated, tombstoned_ids):- Existing row with status drift →
update. - Existing row with no drift → no-op (no
jobhistoryrow per poll). - Existing tombstoned row → recorded in
tombstoned_idsso the caller drops it from the response.
- Existing row with status drift →
- New 15 s in-memory cache for
external_blast.list_jobs(**kwargs)(_external_list_jobs_cached) collapses several near-simultaneous polls into one upstream HTTP request. - New
_reset_external_jobs_cache()test hook. - api/routes/blast.py
/api/blast/jobsnow collects external candidates, runs the sync once, and uses the returnedtombstoned_idsto skip tombstoned rows from the in-memory list (root cause of the "delete does nothing" bug).POST /api/blast/submitcatches the 503 from_safe_delayand flips the just-created row tofailed / phase=broker_unavailable / error_code=broker_unavailablebefore re-raising.- api/tasks/blast.py
- New
reconcile_stale_jobstask — scans active rows, consults CeleryAsyncResult, falls back to the external plane, and marks long-silent rowsworker_lost. - api/celery_app.py
- Beat schedule wires
reconcile_stale_jobsto run every 60 s on theblastqueue. - api/conftest.py
- Autouse fixture clears the external-jobs cache between tests so mocks cannot leak across cases.
Frontend¶
- web/src/pages/BlastJobs/useBlastJobsState.ts
- Delete mutation now invalidates both
["blast-jobs", …]and["blast-jobs-for-pulse", …]and drops the per-job detail cache.
No IaC changes.
Validation Evidence¶
uv run pytest -q api/tests/→ 699 passed (was 676 before this change).uv run ruff check api/→ all checks passed.cd web && npm run build→ succeeded (existing chunk-size warning only).- End-to-end against live Azure Table Storage
elbstg01: - Direct
DELETE /api/blast/jobs/<id>→ Table row flips tostatus=deleted. - Subsequent
GET /api/blast/jobs?...→ tombstoned row is hidden. - Repeated polls (with external OpenAPI returning the same row) → row stays hidden, no resurrection.
- Browser sequence (Trash → Permanently delete) → dashboard count goes from 6 to 5 to 4 jobs as deletes accumulate; no stale row reappears across reloads.
- Regression tests added:
test_create_returns_existing_on_resource_existstest_get_many_batches_into_single_querytest_list_active_filters_to_in_flight_statestest_list_for_owner_includes_cluster_shared_rowstest_sync_external_jobs_creates_missing_rowstest_sync_external_jobs_updates_drifted_statustest_sync_external_jobs_skips_unchanged_statustest_external_jobs_cache_serves_repeat_requeststest_sync_skips_tombstoned_deleted_rowstest_submit_marks_row_broker_unavailable_when_celery_downtest_reconcile_celery_failure_marks_row_failedtest_reconcile_celery_success_marks_row_completedtest_reconcile_skips_recently_updated_unknown_tasktest_reconcile_marks_old_quiet_row_worker_lost