2026-05-16 — AKS bento truthfulness + noise hardening¶
Motivation¶
Visual + code review of the production AKS bento card on a live cluster turned up 20+ "the card lies / the card is noisy" findings ([previous session in this conversation]):
Healthybadge was rendered whileAPI p95 = 2352 ms(red) — the health classifier branched overpower_state,cpuPct,memPct,apiErrors,failed15mbut notp95.- "Submit pipeline · 15m" headline read
0 submitswhile the sparkline underneath drew a peak — the spark was actuallymetrics.rpm(i.e. every/api/blast/*HTTP request, including dashboard polls) and not the submit timeline. - "CPU 1% / Memory 1%" because the cluster-wide average dilutes a hot user-pool node against four idle system nodes.
- Live activity rail filled with eleven near-identical
[RemovingNode] node/aks-blastp16v3-vmss00000Nrows (one per node), all with green check icons, all timestamped "56s ago". - Topology said
NODES 3 · POOLS 2while the events showed ten vmss ordinals — confusing scope. - Active jobs and Recent runtime cells both surfaced
job state store unavailablefrom the same root cause.
User asked to play critic and implement every cycle ("모든단계 진행하고 비평 하드닝").
User-facing change¶
Truthfulness fixes¶
| # | Before | After |
|---|---|---|
| 1 | Healthy while API p95 = 2352 ms |
Degraded once p95 > 2000 ms (configurable P95_DEGRADED_MS). |
| 2 | Hero sparkline = /api/blast/* request RPM, mislabelled as "submit pipeline". |
Hero sparkline is now a per-minute submit timeline built from clusterJobs.created_at. Annotated with peak: N/min and Submits per minute · last 60m. The original RPM signal moved to the Pulse strip as a dedicated API RPM · peak/min KPI. |
| 3 | CPU 1% / Memory 1% (cluster-wide average). |
CPU peak X% · avg Y% and Mem peak X% · avg Y%, where peak is the most-loaded user pool node from /api/monitor/k8s/top-nodes. Avg shown as the small grey hint. |
| 4 | NODES 3 · POOLS 2 (configured count only). |
NODES N ready · user M from live nodeSummary, with a M not-ready hint when applicable. POOLS system X · user Y derived from agent_pools[].mode. |
Noise / information density fixes¶
| # | Before | After |
|---|---|---|
| 5 | Eleven identical [RemovingNode] rows per scaledown event. |
groupEvents() collapses events sharing (reason, involved_kind) within a 90-second window into one row. Names with a long shared prefix render as aks-blastp16v3-vmss000000..009 (10). Up to 12 grouped rows visible; a quiet +N older events not shown footer surfaces overflow. |
| 6 | RemovingNode, NodeNotSchedulable, Drain, Cordon, scaling activity all rendered with a green check (Normal-type). |
New EventKind = "info" with a muted-blue Info icon, gated by an INFO_NOTABLE_REASONS set. K8s still calls these Normal, but the operator no longer reads them as "all good". |
| 7 | Event lines capped at 90 chars, truncating vmss ordinals. | Cap raised to 140 chars, and grouping moves the ordinal into the leading kind/name chunk so it is never the part that gets cut. |
| 8 | Live Activity rail had no namespace cue. | Non-default and non-kube-system namespaces are surfaced as ns/<name> so BLAST job churn is distinguishable from kubelet noise. |
| 9 | Active Jobs cell showed Active jobs · — (dash inside the eyebrow) when the job store was degraded; Recent Runtime cell repeated the same job state unavailable hint. |
Active jobs eyebrow drops the dash entirely when degraded; Recent Runtime cell collapses to a single muted — so the hint is not duplicated. |
| 10 | "Live Activity" header had no scope/count. | 30 events label in the rail header (raw event count from the /api/monitor/aks/events payload). |
Polish¶
| # | Before | After |
|---|---|---|
| 11 | Open button (no tooltip). |
Show details button with title="Show pool, node, and per-database detail". |
| 12 | 0 / 1h · 0 / 24h — slash readable as fraction. |
1h: N · 24h: M — colon makes the relationship explicit. |
| 13 | Hero 0 rendered as a hostile bare zero when the cluster was idle. |
New EmptySubmitState row: friendly empty card with a Run a search CTA wired to onOpenDetail. |
| 14 | API p95 KPI had no SLA reference. | KPI hint reads ms · SLA 2000 and a PressureBar underneath fills against the SLA so the operator can see the headroom at a glance. |
New tests¶
web/src/components/cards/ClusterBento/eventMapping.test.ts — 13
vitest cases locking down classification (info vs ok, warn vs
err) and grouping behaviour (vmss collapse, namespace prefix,
distinct reasons stay separate, malformed timestamps don't crash, the
single-event branch keeps the original message).
API / IaC diff¶
None — pure SPA presentation change. No backend route, schema, Bicep,
or Celery task touched. The card consumes the same
/api/monitor/aks/events, /api/monitor/metrics,
/api/monitor/k8s/top-nodes, and /api/blast/jobs endpoints as
before.
Files touched¶
web/src/components/cards/ClusterBento/ClusterBento.tsx— health classifier (+p95branch), hero submit timeline + empty state, peak CPU/Mem from user-pool nodes, topology live nodes + pool mode split, Active Jobs / Recent Runtime degraded dedupe,Show detailslabel,1h: · 24h:formatting,API RPMKPI, sparkline peak label.web/src/components/cards/ClusterBento/atoms.tsx—EventKindadds"info"(muted-blueInfoicon).web/src/components/cards/ClusterBento/eventMapping.ts— rewritten withgroupEvents(),INFO_NOTABLE_REASONS, namespace prefix, message-cap aware grouping, and a back-compattoEventLineView().web/src/components/cards/ClusterBento/eventMapping.test.ts— NEW, 13 vitest cases.
Validation¶
cd web && npx vitest run→ 41 passed (incl. 13 neweventMapping.test.ts).cd web && npm run build→ tsc + Vite bundle clean (✓ built in 5.47s, no warnings beyond the pre-existingchunkSizeWarningLimit).uv run pytest -q api/tests→ 411 passed (no regression — no backend change).- Live browser check at http://127.0.0.1:18080/:
- Header pill is
Healthywhilep95 = 22 ms; once p95 drifted to1790 msthe pill stayed Healthy (still<2000) but thePressureBarfilled to ~89% — the operator now has a visual "approaching SLA" cue. A subsequent run withp95 = 2352 ms(the case that originally triggered the review) would renderDegradedper the new branch inweb/src/components/cards/ClusterBento/ClusterBento.tsx. Live activityrail shows blue info icons forRemovingNode,NodeNotSchedulable,Drain,Cordon— no more green checks masquerading as health signals — and the30 eventscount reflects the raw payload size.CPU peak 13% · avg 2%andMem peak 18% · avg 2%— peak surfaces user-pool pressure that the prior cluster average hid at1%.TopologyshowsNODES 4 ready · user 3andPOOLS system 1 · user 1instead of the oldNODES 3 · POOLS 2.Active jobseyebrow has no trailing· —;Recent runtime · 24hrenders a single muted—instead of a duplicate degraded hint.- Hero CTA reads
Show details; subtotals read1h: 0 · 24h: 0.
Known follow-ups (intentionally out of scope here)¶
- #19 Trash-icon label — lives in
ClusterItem.tsx, not the bento itself. Belongs in a separate header/affordance pass. - #20 Refresh indicator tooltip — also outside the bento subtree
(
Dashboardpage footer/timer atom). - #16 K8s patch version — backend currently surfaces
kubernetes_version(e.g.1.34) but notcurrent_kubernetes_version(e.g.1.34.5). A separate change toapi/services/monitoring.pyplus theAksClusterSummaryshape is needed before the bento can render the patch. - DATABASES section header — sits below the bento; the visual
orphan is a
ClusterItem.tsxconcern.