Bound worker memory: exec_server output cap + Celery lifecycle limits¶
Motivation¶
Three orthogonal sources of unbounded growth across the worker / terminal sidecars:
exec_server._run_bufferedusedproc.communicate(), which loads the child's full stdout + stderr into the terminal sidecar's RAM (then another copy when we decode to UTF-8). A verboseelastic-blast submitoraz --debugrun could emit tens of MB and OOM the sidecar.celery_app.confhad noworker_max_tasks_per_child, so a worker process accumulated allocator fragmentation plus one-shot dependency leaks (XML parsers, gzip buffers, K8s clients, Azure SDK pipelines) indefinitely. Steady BLAST traffic pushes the worker RSS into multi-GB territory before any restart.celery_app.confhad notask_time_limit/task_soft_time_limit, so a hungterminal_execstream, stuck Kubernetes wait, or runaway Storage call held a worker slot forever.
User-facing change¶
None directly. Steady-state worker / terminal RSS stays bounded. The HTTP
response from terminal_exec.run() now also carries stdout_truncated /
stderr_truncated booleans so callers can degrade cleanly if output was
capped.
API / IaC diff¶
terminal/exec_server.py_run_output_max_bytes()resolves the cap fromEXEC_RUN_MAX_OUTPUT_BYTES(default 8 MiB) at request time so ops can rotate the limit without redeploy and tests can override per-call._drain_capped(pipe, cap)reads from a pipe until EOF, keeping at mostcapbytes; over-cap bytes are discarded but the pipe keeps being drained so the child does not block on a full SIGPIPE-style backpressure._run_bufferedreplacesproc.communicate()with two reader threads feeding_drain_capped. Response gainsstdout_truncated/stderr_truncatedflags.api/celery_app.pyworker_max_tasks_per_child=200(overrideCELERY_WORKER_MAX_TASKS_PER_CHILD).task_soft_time_limit=3300/task_time_limit=3600(1 h hard ceiling).result_expires=3600so Redis db 1 does not retain stale dicts (preemptively addresses #27 in the same config block).api/tests/test_terminal_exec.pyaddstest_run_truncates_stdout_above_capto lock in the 64 KiB cap path end-to-end.
Validation¶
uv run pytest -q api/tests/test_terminal_exec.py— 15 passed (new cap test included; concurrency/timeout coverage unchanged).uv run ruff check api/celery_app.py terminal/exec_server.py api/tests/test_terminal_exec.py— clean.