Dashboard polling bottleneck removal¶
Date: 2026-05-22
Motivation¶
The dashboard's auto-refresh loop fans out 15–20 parallel queries per tick (every 30 s by default, configurable down to 5 s). Profiling the request path surfaced three avoidable hot spots that did not need any change in UX, API contracts, or polling cadence to fix:
RequestIdMiddlewarewas draining the entire response body iterator into memory and rebuilding the response for every/api/*request so the per-request HTTP inspector panel could capture the payload. The high-volume monitor / blast polling GETs (AKS, storage, ACR, jobs, databases) paid this buffering cost on every tick even though their inspector value is low — the dashboard refetches them constantly and the same payload dominates the ring buffer, pushing one-shot calls (POST submit, DELETE) out.k8s_warmup_statusmade up to 18 sequential Kubernetes API calls per invocation (6 top-level reads + node list + pod list + up to 12 pod log tails), called once every 60 s per cluster fromuseClusterDbChips. All six top-level reads are independent of each other; the per-pod log fetches are likewise independent._get_k8s_sessionbuilt a brand-newrequests.Sessionon every call, wrote the CA / client cert / client key to freshNamedTemporaryFilehandles, then unlinked them when the caller closed the session. The dashboard hits 7+ K8s helpers per tick, so this was 7+ session constructions + ~14 temp-file writes/deletes per tick per cluster with zero connection pool reuse across calls.
None of these required changing polling intervals, API contracts, or the inspector behaviour for the calls users actually care about.
User-facing change¶
- Faster dashboard refresh ticks for AKS / storage / ACR / jobs cards (response body is no longer materialised in the middleware for those GETs; HTTP keep-alive is reused across K8s reads).
k8s_warmup_statuswall time goes from "sum of 18 serial GETs" to "max of one parallel fan-out" — visibly snappier warmup chip strip on clusters with many warmup pods.- HTTP inspector panel still captures every POST/PUT/PATCH/DELETE and
every non-polling GET. The polling GETs listed in
_INSPECTOR_EXCLUDE_GET_PREFIXESare simply not buffered.
No UI changes, no API schema changes, no polling cadence changes, no new configuration required.
API / IaC diff summary¶
api/main.py- New
_INSPECTOR_EXCLUDE_GET_PREFIXEStuple covering high-volume polling GETs (/api/monitor/aks,/api/monitor/storage,/api/monitor/acr,/api/monitor/terminal,/api/monitor/cluster,/api/monitor/jobs,/api/blast/jobs,/api/blast/databases,/api/warmup,/api/me). _inspector_should_capture(path, method="POST")— method-aware overload. Old single-arg call sites keep working (default treats them as non-GET).RequestIdMiddlewarepasses the request method through so the polling GET exclusion can fire.api/services/k8s_monitoring.pyk8s_warmup_statusfans out six independent reads viaThreadPoolExecutor(max_workers=6). Phase-2 dependents (_mark_stale_warmup_nodes,_warmup_pods_and_logs) also run in parallel once the warmup-jobs response is in._warmup_pods_and_logsparallelises up to 12 pod-log fetches withThreadPoolExecutor(max_workers=min(12, len(pod_names))).api/services/k8s/client.py- New
_K8sSessionEntry+_K8S_SESSION_POOLkeyed by(subscription_id, resource_group, cluster_name, admin)with a 300 s TTL (override viaK8S_SESSION_POOL_TTL_SECONDS). - Per-entry TTL is clamped by both the kubeconfig material's own
expires_atand (for Bearer-auth sessions) the AAD token'sexpires_onminus a 60 s safety margin, so a pooled session never outlives its underlying credentials. - Pool size capped at
_K8S_SESSION_POOL_MAX_ENTRIES = 32; when the cap is exceeded the entry closest to expiry is evicted first. - When the effective TTL collapses to non-positive (e.g. the AAD
token is about to expire),
_get_k8s_sessionhands out a one-shot non-pooled session whoseclose()does a real teardown + temp-file unlink — preserving the historical contract for callers that usetry: ... finally: session.close(). _get_k8s_sessionreturns a pooled session on hits;session.close()is overridden to a no-op for pooled sessions so existingtry: ... finally: session.close()call sites release back to the pool instead of tearing down the connection pool + temp files.reset_k8s_session_pool()test helper +atexitdrain so interpreter exit unlinks temp files._retire_entryreusesrequests.Session.closedirectly to bypass the no-op override.api/services/k8s_monitoring.py- Re-exports
reset_k8s_session_poolfor symmetry withreset_k8s_credential_cache. api/conftest.py- Autouse fixture now also calls
reset_k8s_credential_cache()+reset_k8s_session_pool()before and after each test so the pool cannot leak state across tests.
Validation¶
uv run pytest -q api/tests— 1067 passed in 32.69 s (was 1022 baseline; +45 new focused regression tests across two hardening rounds).uv run ruff check api— clean.- New focused tests:
api/tests/test_inspector_exclude.py— method-aware exclusion, backward-compat single-arg call sites, polling vs non-polling GETs.api/tests/test_k8s_session_pool.py— pool reuse, key isolation (admin / cluster), TTL clamp by material expiry, TTL clamp by AAD token expiry with safety margin, throwaway path actually unlinks temp files, max-entries eviction picks the soonest-expiring entry, pooledclose()is a no-op untilreset_k8s_session_pool()retires the entry, eviction never holds the pool lock during retire IO, throwaway close() is idempotent,K8S_SESSION_POOL_MAX_ENTRIESenv override clamps into [1, 4096].api/tests/test_k8s_warmup_status_parallel.py— every expected Kubernetes URL is issued exactly once across the parallel fan-out, non-200 responses degrade to empty defaults instead of raising, pod-log fan-out finishes in well under serial time.
Critical-review hardening (round 2)¶
After the initial implementation a second critical-review pass turned up four real risks; each was fixed and locked in by a regression test:
- Lock held during IO:
_get_k8s_session's slow path used to call_retire_entry(...)(TCP teardown + temp-file unlink) while holding_K8S_SESSION_POOL_LOCK. Every other_get_k8s_sessioncaller across every cluster blocked on that IO. Fixed by collecting victims under the lock and retiring them after release. Covered bytest_pool_lock_released_during_retire_io. atexitdeadlock risk:_atexit_drain_poolused a blocking acquire; daemon threads forcibly terminated during interpreter shutdown could hold the lock and deadlock the atexit chain. Now useslock.acquire(blocking=False)and silently skips on contention.- Hardcoded pool cap:
_K8S_SESSION_POOL_MAX_ENTRIES = 32was not configurable, inconsistent with the TTL helpers. Now routed through_k8s_session_pool_max_entries()which honoursK8S_SESSION_POOL_MAX_ENTRIESclamped into[1, 4096]. Covered bytest_max_entries_env_override. reset_k8s_session_poolswallows sibling failures: the retire loop now isolates per-entry exceptions so one bad entry cannot strand the rest._inspector_should_capture(method=None): defensive normalisation so a caller forwarding an unset header cannot crash the middleware withAttributeError: 'NoneType' object has no attribute 'upper'.
No frontend or infra changes — Tier 1 + Tier 2a validation only, no redeploy required per .github/copilot-instructions.md §13.