2026-05-21 — Persist k8s pod logs at BLAST job finalization¶
Motivation¶
After 2026-05-21 BLAST log discovery fix
the live SSE stream now finds blast / results-export / init-ssd pods
correctly while a job is running, but those pod logs were still live-only.
Once the SSE client disconnected or the job reached a terminal phase, the
real BLAST execution output evaporated — only staging_db.last_output
(the submit task's azcopy stdout) survived. That is the literal second
half of the user complaint "실행 로그가 azcopy 로그 정도만 나오는데".
User-facing change¶
When a BLAST job transitions to a terminal phase (completed / failed),
the artifact finalizer now fetches a tail of every matching pod/container
log via the Kubernetes API and persists it to:
- Per-step
last_outputsummary inpayload._progress.steps.<phase>(truncated to ~6 KB with head + tail markers if long). This is what the Running / Exporting Results step log block shows in the UI. - Chunked log artifacts in the platform Storage account at
{job_id}/execution-steps/logs/<phase>/<seq>.json(≤ 100 events per chunk). These mirror the format already used by the submit task and keep the full tail (up to 200 lines per container) for forensic review even when the inline summary is truncated.
Containers included in the inline summary are limited to the "primary" producers per phase so the panel stays focused:
| Phase | Primary containers |
|---|---|
running |
blast, results-export |
exporting_results |
results-export |
staging_db |
get-blastdb, import-query-batches |
warming_up |
vmtouch |
Other containers (logger, sidecar shims) still land in the chunked
artifacts but stay out of the summary blob.
API / IaC diff summary¶
- New module
api/services/job_logs/persist.py
with
persist_completed_job_pod_logs(credential, state). Idempotent; never raises (all I/O failures degrade to loggedINFO). - api/services/job_logs/k8s.py:
added
fetch_k8s_pod_log_tail(...)— afollow=falsevariant ofstream_k8s_log_linesthat safely fetches the tail of terminated pods. - api/tasks/blast_artifacts.py:
finalize_job_artifactscalls the persist helper beforewrite_execution_steps_snapshotso the snapshot picks up the mergedlast_outputblobs. Errors are caught and logged. - No Bicep / IaC changes. No new dependency. No browser code change.
Security¶
- All Kubernetes calls go through the existing
_get_k8s_sessionhelper (shared user-assigned MI, no Run Command, charter §11). - Lines are sanitised via
api.services.sanitise.sanitise(...)and truncated to 4 000 chars per line before persistence — same contract as the live SSE path. - Storage writes go through the existing
write_execution_log_chunk/repo.updatecodepaths which already use the platform MI and private endpoints.publicNetworkAccessremainsDisabled.
Validation¶
uv run pytest -q api/tests/test_job_log_persist.py→ 4 new tests pass (groups by phase, writes chunks, skips when inputs missing, skips when no targets, does not clobber longer existinglast_output).uv run pytest -q api/tests→ 864 passed (full suite).uv run ruff check api→ clean.
Out of scope¶
- Re-fetching logs after pod garbage-collection. Kubernetes deletes
completed pods after ~6 h by default. If the finalizer runs after the
pod is gone, the helper logs an
INFOand falls back to whatever was already persisted by the live SSE path. A "log retention via CronJob → blob copy" pattern would close that window but is a separate IaC change. - Streaming live
last_outputupdates while the job is still running. Persistence is intentionally a single tail snapshot at finalize; live monitoring continues to flow through SSE as before.