2026-05-15 — Sidecars event-driven animation: critical hardening pass¶
Motivation¶
Post-implementation review of the new event-driven Sidecars animation (2026-05-15-sidecars-event-driven-animation.md) turned up five real defects that would surface under sustained load, adversarial network conditions, accessibility settings, or normal multi-tab usage. None were show-stoppers in light dev use, but each would cause a memory or latency regression in a longer-running session.
| # | Severity | Defect |
|---|---|---|
| L1 | leak | Users with prefers-reduced-motion: reduce get CSS animation: none, so onAnimationEnd never fires. Spawned particles accumulate in React state forever. |
| L2 | leak | Same root cause for hidden / collapsed / off-screen cards (display: none, background tab, dashboard collapsed): no animation event → no cleanup. |
| H1 | bottleneck | emit() used a 0.5 s socket timeout and only disabled the client on from_url failure. Each in-flight request paid up to 500 ms when Redis was flaky, and drain() had no parallel circuit-breaker. |
| H2 | race | Both the SSE stream and the /sidecars polling-fallback drained the same Redis hash atomically. With multiple concurrent readers (multi-tab, polling-on-error, monitoring probes) events scatter across consumers and badges silently drop. |
| M1 | re-render | setLastCounts(rawCounts) always created a new object reference, forcing a re-render every snapshot tick even when counts hadn't changed. |
Fixes (backend)¶
- Circuit breaker tuning (
api/services/event_emitter.py): - Defaults dropped to
EVENT_EMIT_CONNECT_TIMEOUT_SECONDS=0.05/EVENT_EMIT_SOCKET_TIMEOUT_SECONDS=0.05(override via env). Loopback Redis is sub-millisecond; anything above 50 ms is already a failure we'd rather record than wait for. _FAILURE_COOLDOWN_SECONDS=5.0keyed ontime.monotonic(). Trips on anyRedisErrorfrom_get_client,emit, ordrainand suppresses further attempts until the window elapses._int_env/_float_envhelpers swallow malformed env values and fall back to safe defaults — prevents a typo'd env var from crashing import._normalise_count(count)clamps atEVENT_EMIT_MAX_COUNT=1000so a misbehaving caller can't pin Redis writing huge integers.drain()now also arms the breaker on Redis error — same code path asemit(). Stops the next 5 s SSE tick from re-paying the full timeout while Redis is misbehaving.- No-drain snapshot mode (
api/services/sidecar_metrics.py,api/routes/monitor.py): collect_snapshot(..., drain_events: bool = True). WhenFalse, the function returns an all-zero events dict without ever touching the Redis hash.GET /api/monitor/sidecars(the SPA's initial mount + polling fallback) now calls withdrain_events=False. The SSE stream/api/monitor/sidecars/eventskeeps draining — it is the canonical drainer. This eliminates the poll-vs-SSE half of H2: only the live SSE consumer pulls events out of Redis.- In-process SSE fan-out broadcaster (
api/routes/monitor.py,_SidecarBroadcaster): - One background
asyncio.Taskowns the Redis drain. Every 5 s it callscollect_snapshot(drain_events=True)once and pushes the pre-serialised SSE frame into every subscriber's boundedasyncio.Queue. Two browser tabs (or any number of EventSource consumers) now see the same event counts every tick — neither can steal from the other. - Lifecycle: first
subscribe()spawns the drain task; lastunsubscribe()cancels it (no Redis traffic when nobody's watching). FastAPI lifespan callsclose()to wake any in-flight consumers with a sentinel before the process exits. - Slow-subscriber policy: per-queue cap is 8 frames; on overflow the oldest frame is dropped so the freshest snapshot still gets through. Currentness > completeness for a monitoring UI.
- Initial snapshot for new subscribers is captured with
drain_events=Falseunder the broadcaster lock, so a connecting tab never steals a tick from the running drain.
Fixes (frontend)¶
- Reduced-motion + visibility-aware spawning
(
web/src/components/cards/SidecarsCard.tsx): - New
useReducedMotion()andusePageVisible()hooks. useEventParticlesearly-returns before spawning DOM particles when either flag is set. ThelastCountsbadge still updates so the per-row counter stays accurate.- Hard timeout cleanup (the L1/L2 fix proper):
- Each spawned particle gets a
setTimeout(remove, PARTICLE_LIFETIME_MS + delay*1000).onAnimationEndis still wired up; the timer is the belt to the animation suspenders. Idempotentremovemakes both paths safe. timersRefis aMap<number, Handle>; an unmount-cleanup effect clears all pending timers.- Queue hard cap:
PARTICLE_QUEUE_HARD_CAP = 64. New particles drop the oldest when over the cap (FIFO). Even pathological bursts (Redis hash returning 1000 events for one row) can't grow the React state past 64 nodes per row. - Stable counts:
setLastCountsonly fires when at least one field actually changed, killing the no-op re-render every 5 s.
Tests¶
api/tests/test_event_emitter.py(now 13 tests, all pass):test_invalid_tuning_env_values_fall_backtest_emit_clamps_large_countstest_drain_clamps_bad_counter_valuestest_emit_opens_cooldown_after_redis_errortest_drain_failure_opens_cooldown(new — drain breaker symmetry)test_collect_snapshot_skips_drain_when_disabled(new — H2 race fix)api/tests/test_sidecar_broadcaster.py(new — 4 tests, all pass):test_two_subscribers_see_identical_frames— proves H2 fan-out.test_drain_task_stops_when_last_subscriber_leaves— no Redis traffic when the dashboard is closed.test_slow_subscriber_drops_oldest_not_block_others— slow consumers can't deadlock the broadcaster.test_close_wakes_subscribers_with_sentinel— lifespan shutdown is graceful.- Full backend suite:
uv run pytest -q api/tests→ 120 passed. npm run build(inweb/) → clean.
Validation evidence¶
End-to-end against the local 6/6 compose stack (http://localhost:18080/):
$ docker exec elb-control-local-redis-1 redis-cli -n 2 DEL sidecar:events
$ for i in $(seq 1 20); do
curl -fsS -X POST http://127.0.0.1:18080/api/health/celery/enqueue-noop >/dev/null &
done
$ for i in $(seq 1 8); do curl -fsS http://127.0.0.1:18080/api/me >/dev/null & done
$ wait
$ docker exec elb-control-local-redis-1 redis-cli -n 2 HGETALL sidecar:events
row1 28 # 20 enqueue-noop + 8 me — every emit landed
row2 20 # 20 enqueue-noop — every emit landed
Browser DOM right after a 30-request burst against the running SPA:
i.e. the row badge and (during the 1.6 s window) particle render. CPU
on the affected sidecars correspondingly spiked (api 95% / 56%,
worker 13.2%, beat 13.6%).
Multi-subscriber proof (in-process Python harness opening two SSE streams against the running compose stack, then bursting 15 + 7 requests):
consumer A non-zero frames: [{'row1': 22, 'row2': 15, 'row3': 0, 'row4': 0}]
consumer B non-zero frames: [{'row1': 22, 'row2': 15, 'row3': 0, 'row4': 0}]
identical? True
Both consumers see the same counts — the broadcaster is the sole
drainer, no event was stolen from either side. (22 = 15 enqueue + 7 me
on row1, 15 = 15 enqueue on row2.)
Out of scope¶
- No new dependencies.
- No legacy/ touched.
- No SAS issuance, no Storage public-access change.