2026-05-22 — BLAST submit pipeline hardening (live logs, ANSI, error banner, missing re-export)¶
Motivation¶
Live monitoring of a dashboard-driven elastic-blast submit (job
04d04496-…-3b0d73) surfaced five defects that all manifested in the same
run:
- The Submit Job step stalled with a single "Starting elastic-blast submit helper job…" line for ~33 s, then dumped five lines at once right before the task crashed.
- The job ended in
failedafter 1 m 45 s witherror_code = "module 'api.tasks.blast' has no attribute '_tail_text'", even thoughelastic-blast submititself had streamed its output successfully. - The dashboard "Job Failed at Submit Job" banner showed
←[33m[parallel-prep] running 4 azcopy checks concurrently←[0m— a benign helper log line wrapped in raw ANSI colour codes — instead of the real AttributeError above. - After the parent row was reconciled to
failed, the Execution Steps timeline kept spinning on "Submit Job · ⅘ · Uploading workfiles" for 12+ minutes. The crashed worker had leftoutput.steps.submitting.status = "running"and the reconcile beat task wrotestatus="failed"throughrepo.update(...)directly, bypassing the payload merge that would have demoted the orphan step. - Worker chatter included Azure SDK INFO HTTP request/response dumps and
~1 Hz
AzureCliCredential.get_token_info succeededlines. That is noise only and not addressed in this change; tracked separately.
User-facing change¶
- Submit Job step now streams elastic-blast output line-by-line in near-real-time (subject to the existing 15 s state-write debounce). Long silent gaps caused by stdout block-buffering in the Python child are gone.
- The failure banner shows the authoritative orchestrator error
(
job.error/output.error) first, so a Celery task crash is no longer hidden behind the last benign helper log line. - ANSI colour codes (
\x1b[33m,\x1b[0m, cursor / erase sequences, …) are stripped before any state, log artefact, or UI surface sees them. - A submit that previously crashed in the post-stream tail snapshot now completes through to the K8s status refresh / artefact gate. The AttributeError class is also caught by a static cross-reference test so the same kind of re-export gap fails fast in CI instead of in production.
- The Execution Steps timeline stops spinning the moment the parent row is
marked
failed: orphanrunningstep entries are demoted tofailed(backend payload merge) and, defensively, the frontend resolves the failure phase before checkingstep.status === "running".
API / IaC diff summary¶
No HTTP route, Bicep, or Celery task name changed.
api/tasks/blast/__init__.py (Fix A)¶
Re-export _tail_text from api.tasks.blast.progress so
submit_task.py's post-stream _blast._tail_text(...) call resolves.
terminal/exec_server.py (Fix B)¶
_child_env() now sets env.setdefault("PYTHONUNBUFFERED", "1"). With
Popen(bufsize=0) alone, the child Python (elastic-blast) still
block-buffered stdout in 8 KB chunks when stdout was a pipe; the new env
flag forces line-by-line flush so the NDJSON /exec/stream response sees
each line as it is printed.
api/services/sanitise.py (Fix C)¶
Added _ANSI_CSI_RE = re.compile(r"\x1b\[[0-9;?]*[ -/]*[@-~]") and apply
it before the existing secret-masking passes. Removes colour, cursor, and
erase CSI sequences from any string headed to JSON state, append-blob log
artefacts, or the dashboard.
web/src/components/BlastStepTimeline/predicates.ts (Fix D)¶
Re-ordered getFailureText candidates so authoritative error fields
(job.error, output.error, output.message, customStatus.error,
customStatus.message) are checked before per-step error / output /
last_output. Previously a populated step.last_output masked a real
job.error.
api/tasks/blast/progress.py (Fix F)¶
_merge_progress_payload now sweeps every other step entry whose status
is running and demotes it to failed (with success=False,
source="orphan_inferred", completed_at=now, propagated error) whenever
the parent merge writes status="failed". Mirror of the existing
status != "failed" upgrade sweep that promotes earlier running steps to
completed.
api/tasks/blast/reconcile_task.py (Fix G)¶
The FAILURE/REVOKED and worker_lost branches now go through
_blast._update_state(...) instead of repo.update(...) directly. The
direct update wrote top-level status="failed" but never re-ran
_merge_progress_payload, which is exactly what left the orphan
submitting step spinning in the timeline.
web/src/components/BlastStepTimeline/stepState.ts (Fix E — defensive)¶
getTimelineStepState now checks FAILURE_PHASES.has(phase) before
isStepRunning(step). A step left in status: "running" while the parent
is in a failure phase resolves to error / done / skipped instead of
active, so even if the backend payload merge is bypassed (legacy rows,
external writers, future regressions) the UI cannot render an infinite
spinner.
web/src/components/BlastStepTimeline/StepLogSection.tsx (Fix H — dedup)¶
buildStepLog for submitting embeds the JobState snapshot under
--- Live Console Output ---. StepLogSection also subscribes to the SSE
event bus and used to append a second --- Live Stream --- block, so the
same lines rendered twice (the snapshot is just a debounced sample of the
same stream). New stripConsoleOutputBlock(log) helper removes the
snapshot block when SSE has events for the step, then a single
--- Live Console Output --- block is appended from the live stream.
web/src/components/BlastStepTimeline/constants.ts (Fix I — submitted mapping)¶
PHASE_TO_STEP now maps submitted → "running". submitted is the
transit phase between submit completion and the first poll_running_status
tick reporting pods=Running (10-30 s). Without the mapping, every step
resolved to pending and the timeline showed no active spinner during the
gap, looking dead even though the worker was waiting on K8s. Added a
matching PHASE_MESSAGES["submitted"] so the BLAST Run step shows
"Job accepted by AKS. Waiting for pods to start running..." instead of the
default placeholder.
scripts/dev/local-run.sh (Fix J — reap stale celery workers/beat)¶
Live monitoring caught the same _tail_text AttributeError reappearing
after the original Fix A was already applied. Root cause: orphaned celery
worker processes from earlier worker: start runs (the supervisor was
killed, but the prefork master + forks were reparented to PID 1 and kept
consuming from Redis with stale, pre-fix code). New submits landed
randomly on whichever worker was first, so the bug appeared intermittent.
The worker) and beat) branches now pkill -TERM / pkill -KILL any
matching python3 -m celery -A api.celery_app:celery_app worker and
celery -A api.celery_app beat processes before exec-ing the wrapper —
mirroring the api) branch's port-lock guard. Without this, every
worker: start invocation could silently add a duplicate worker.
web/src/components/BlastStepTimeline/buildStepLog.ts (Fix K — misleading polls text)¶
The running done branch previously rendered
✓ BLAST completed after ${polls ?? "?"} polls (~${(polls ?? 0) * 30}s).,
producing the literal text "after ? polls (~0s)" for short runs whose
state row never carried a polls field. Now the elapsed value comes from
the closed step's duration_ms (always present) and the polls phrase is
omitted unless polls is a positive number. The exporting_results
done branch also dropped its "(no output)" placeholder so empty export
logs no longer look like a missing capture.
web/src/components/BlastStepTimeline/StepLogSection.tsx (Fix L — open by default)¶
isOpen = expanded[step.key] ?? (state === "active" || state === "error")
keeps only the currently-active or failed step expanded by default. An
earlier iteration auto-expanded every non-pending step (so completed
runs landed with every step open), but the user found that too busy;
collapsed-by-default for done/skipped steps is the canonical look, with
the user toggling individual steps as needed.
web/src/pages/BlastResults.tsx (Fix L2 — auto-switch to Descriptions on completion)¶
A previousPhaseRef watches effectivePhase; on the rising edge of
completed while the user is still on tab=run (typically because the
page auto-routed them there during submit), the URL is rewritten to
tab=descriptions so the analytics view shows up the moment results
are ready. The transition only fires once per phase change, so a user
manually navigating back to run on an already-completed job is NOT
flipped away again.
web/src/hooks/useStickToBottom.ts + web/src/pages/blastResults/ExecutionStepsCard.tsx (Fix M — auto follow tail)¶
New window-scope useStickToBottom({ version, enabled }) hook
implements the "follow the tail" pattern: scroll to the bottom on mount
and on every version change while the user remains anchored within
96 px of the bottom; pause auto-scroll once the user manually scrolls
up; re-arm when they scroll back to the bottom. ExecutionStepsCard
composes the version from (phase, updated_at, submitting.log_line_count)
so every new log line or step transition triggers a scroll. The user
lands on the latest output without manually scrolling and stays glued to
new output as the run progresses.
web/src/pages/blastResults/useBlastResultsState.ts (Fix N — terminal backfill polling)¶
refetchInterval used to return false the moment the job entered any
terminal phase. But the reconcile beat writes trailing artefacts —
notably the K8s pod log tail into running.last_output — AFTER
phase=completed, so the dashboard stayed pinned to the partial
pre-reconcile snapshot (the "BLAST completed after ? polls" placeholder
without pod logs). Now refetchInterval returns
TERMINAL_BACKFILL_POLL_INTERVAL_MS = 10s until updated_at ages
beyond TERMINAL_BACKFILL_WINDOW_MS = 5 min, at which point polling
stops entirely. Combined with Fix K, completed runs now show their full
K8s pod log tail. Both jobQuery and executionStepsQuery now also set
refetchOnWindowFocus: true + refetchOnMount: "always" so a user
returning to the page hours later still pulls fresh state instead of the
tail-end cached snapshot.
api/routes/blast/jobs.py (Fix P1 — execution-steps endpoint prefers live state)¶
The /api/blast/jobs/{id}/execution-steps route used to return the
persisted snapshot blob first and fall back to live state only when the
blob was missing. The snapshot is written ONCE by
finalize_job_artifacts at the moment the job reaches a terminal phase;
trailing artefacts (K8s pod log tails on running.last_output, …) can
be backfilled later but the snapshot is NEVER re-written automatically.
That made the dashboard show stale "BLAST completed." with no pod logs
even though running.last_output had the tail. Now the endpoint:
prefers live Table state, falls back to the persisted blob only if the
Table read raises, and 404s if both are unavailable. Locked by
test_blast_execution_steps_route.py (3 tests).
api/tasks/blast_artifacts.py (Fix P3 — self-retry on empty pod log capture)¶
finalize_job_artifacts previously upserted artifact_state="ready"
even when persist_completed_job_pod_logs returned {} — leaving a
frozen snapshot with no pod logs forever, because
artifact_build_should_enqueue then refuses to re-enqueue the task. The
function now accepts a pod_log_attempt kwarg (defaults to 1) and, when
the persistence returned empty, schedules a delayed self-retry via
apply_async(countdown=60) up to _POD_LOG_RETRY_MAX = 3 attempts.
K8s pods often haven't finished flushing stdout when the job container
exits; the delayed retries give them 60-180 s to flush their tail
before the artefact bundle is considered final.
Validation evidence¶
uv run pytest -q api/tests
943 passed in 27.65s
uv run pytest -q api/tests/test_sanitise.py api/tests/test_terminal_exec.py \
api/tests/test_blast_tasks.py
140 passed in 13.69s
cd web && npm test -- predicates
✓ src/components/BlastStepTimeline/predicates.test.ts (4 tests) 2ms
Test Files 1 passed (1)
Tests 4 passed (4)
cd web && npm run build
✓ built in 7.05s
uv run ruff check api/tasks/blast/__init__.py api/services/sanitise.py \
terminal/exec_server.py \
api/tests/test_sanitise.py api/tests/test_terminal_exec.py \
api/tests/test_blast_tasks.py
All checks passed!
New regression tests¶
| Test | File | Guards |
|---|---|---|
test_submit_task_helpers_are_reexported_on_blast_package |
api/tests/test_blast_tasks.py |
Cross-reference every _blast.X name used by submit_task and friends against api.tasks.blast.__all__ / hasattr so a missing re-export fails at import-time instead of mid-submit. |
test_merge_progress_payload_demotes_orphan_running_steps_on_failed_update |
api/tests/test_blast_tasks.py |
Lock the orphan-running sweep on status="failed" merges. |
test_strips_ansi_csi_color_codes, test_strips_ansi_with_cursor_and_erase_sequences |
api/tests/test_sanitise.py |
Lock ANSI stripping into the sanitise contract. |
test_child_env_forces_pythonunbuffered |
api/tests/test_terminal_exec.py |
Stub az that echoes PYTHONUNBUFFERED and assert the spawned subprocess sees =1. |
getFailureText prefers job.error (+ 3 siblings) |
web/src/components/BlastStepTimeline/predicates.test.ts |
Lock the new candidate ordering. |
never renders a spinner when the parent job is in a failure phase |
web/src/components/BlastStepTimeline/stepState.test.ts |
Lock that an orphan submitting.status="running" no longer produces active under phase="failed". |
activates BLAST Run while phase is the transit submitted |
web/src/components/BlastStepTimeline/stepState.test.ts |
Lock PHASE_TO_STEP["submitted"] = "running" so the timeline does not sit silent for 10-30 s after submit. |
stripConsoleOutputBlock (+ 3 siblings) |
web/src/components/BlastStepTimeline/StepLogSection.test.ts |
Lock the SSE / snapshot de-duplication. |
Manual evidence captured during the failing run¶
- Job state snapshot (
GET /api/blast/jobs/04d04496-…) immediately after the crash showed: status="failed",error="module 'api.tasks.blast' has no attribute '_tail_text'",output.steps.submitting.last_outputcontaining raw\x1b[33m…\x1b[0mcodes — exactly the string the UI rendered as the failure message.submitting.started_at=17:28:04,updated_at=17:29:01(one state write in 57 s) withlog_line_count=5→ confirms the pipe-buffering symptom Fix B targets..logs/monitor/run-022727Z.log(kept locally, not checked in) captured the full sequence includingAttributeError: module 'api.tasks.blast' has no attribute '_tail_text'in the worker traceback.