Production Hardening Wave 2¶
Date: 2026-05-13
Motivation¶
A broad critique pass found recurring reliability and recovery risks across the browser control plane: transient Azure/API failures can interrupt user workflows, render errors can strand users on a blank route, and hardening work needs a traceable catalog rather than one-off fixes.
User-facing Change¶
- Authenticated API requests now include a client request ID for traceability.
- Transient API failures now retry with bounded exponential backoff,
Retry-Aftersupport, and per-attempt timeouts. - User-cancelled requests are respected before any network attempt starts.
- The global error screen now offers separate recovery actions: copy details, retry render, reload, or return to the dashboard.
API / IaC Diff Summary¶
web/src/api/resilience.tsadds shared request timeout, retry, retry-delay, and request-id helpers.web/src/api/client.tsroutes all authenticated API calls through the shared resilience helper.web/src/api/resilience.test.tscovers retryability,Retry-Afterparsing, backoff, transient recovery, and cancellation.web/src/components/ErrorBoundary.tsximproves user recovery after render failures.web/eslint.config.jsrestores ESLint 9 flat-config validation for TypeScript and React Hooks rules.- Hook dependency fixes stabilize shortcut, ACR, storage, job-list, and submit-page derived state.
- No backend API contract or IaC changes.
Critique Catalog¶
Functionality¶
- Add typed response validation for monitoring endpoints.
- Add typed response validation for BLAST job endpoints.
- Add typed response validation for terminal endpoints.
- Add typed response validation for storage endpoints.
- Add typed response validation for ACR endpoints.
- Show partial data when one dashboard resource fails.
- Add job retry from failed terminal states.
- Add failed-job clone-to-new-search action.
- Add downloadable orchestrator history.
- Add downloadable step logs.
- Add completed-job export integrity checks.
- Add result-file checksum display when available.
- Add result preview fallback for compressed outputs.
- Add explicit no-results-after-completion diagnosis.
- Add duplicate submission detection.
- Add dry-run validation before cloud submission.
- Add BLAST config preview diff before submit.
- Add database/query molecule compatibility preflight.
- Add storage container existence preflight.
- Add ACR image tag mismatch preflight.
- Add AKS cluster readiness preflight.
- Add quota warning before node provisioning.
- Add region/SKU availability validation.
- Add public network access propagation countdown.
- Add safe resume for interrupted job status polling.
- Add terminal az-login stale warning.
- Add terminal cloud-init completion proof.
- Add terminal tool-version verification display.
- Add VM password copy-once acknowledgement.
- Add safer resource-group selection confirmation.
UI / UX¶
- Add per-card stale-data labels.
- Add per-card retry buttons for failed loads.
- Add per-card collapsible details for errors.
- Add global disconnected/API-unavailable banner.
- Add skeleton states for all async panels.
- Keep page headers stable during loading.
- Keep action buttons stable during polling.
- Disable destructive actions while mutation is pending.
- Add confirmation reason text for destructive actions.
- Add keyboard focus management for dialogs.
- Add accessible labels to icon-only buttons.
- Add tooltip text for unfamiliar actions.
- Add copy feedback to every clipboard action.
- Add progress labels for long-running mutations.
- Add last successful refresh timestamp.
- Add failed refresh timestamp.
- Add query polling pause while tab is hidden.
- Add reduced-motion fallback for spinners/rings.
- Add mobile layout checks for dense result tables.
- Add responsive wrapping for long resource IDs.
- Add line wrapping for long Azure errors.
- Add visual distinction between warning and fatal states.
- Add empty-state variants for setup, loading, failed, and terminal.
- Add consistent status chip labels across pages.
- Add dashboard first-run workspace picker recovery.
- Add direct dashboard recovery from route render failures.
- Add user-readable auth setup missing state.
- Add explicit session-expired state.
- Add action-specific loading labels.
- Add result export loading state per format.
Reliability¶
- Add API request retry for 408.
- Add API request retry for 429.
- Add API request retry for 500.
- Add API request retry for 502.
- Add API request retry for 503.
- Add API request retry for 504.
- Respect
Retry-Afterseconds. - Respect
Retry-AfterHTTP dates. - Bound retry delays.
- Add jitter to retry delays.
- Add per-attempt request timeout.
- Respect caller cancellation before fetch.
- Avoid retrying auth failures.
- Avoid retrying RBAC failures.
- Avoid retrying missing resources.
- Add client request IDs to API calls.
- Preserve existing request headers.
- Preserve caller abort signals.
- Avoid timeout timer leaks during backoff.
- Add tests for retry status selection.
- Add tests for retry-after parsing.
- Add tests for transient response recovery.
- Add tests for cancellation behavior.
- Add tests for deterministic backoff.
- Add shared API error formatting for transient failures.
- Add background polling backoff after repeated failures.
- Add circuit breaker for repeated API outages.
- Add idempotency keys for mutation requests.
- Add retry-safe export downloads.
- Add retry-safe monitoring calls.
Recovery¶
- Add global render error boundary.
- Add copyable render error details.
- Add render retry without full reload.
- Add full reload recovery action.
- Add dashboard escape hatch from error screen.
- Add stable failed-job terminal UI.
- Hide Cancel for terminal failed jobs.
- Expand inferred failed execution step.
- Mark post-failure steps skipped.
- Show failed-step-specific results empty state.
- Suppress stale terminal-state toasts on first load.
- Recover from protected export link failures with authenticated downloads.
- Add export failure toast with API message.
- Add request timeout recovery message.
- Add network failure recovery message.
Security / Operations¶
- Validate all HTTP triggers have bearer validation.
- Audit raw
fetchusage in components. - Avoid leaking subscription IDs in user-visible logs where not needed.
- Sanitize SAS tokens from logs and previews.
- Sanitize bearer tokens from logs and previews.
- Sanitize Key Vault secret URIs when copied into logs.
- Keep SSH NSG limited to caller IP.
- Delete terminal secrets during teardown.
- Keep storage public network access temporary.
- Add public network access watchdog cleanup.
- Add audit events for destructive actions.
- Add correlation between frontend request ID and backend logs.
- Add CI lint configuration for ESLint 9.
- Add SWA smoke test after deploy.
- Add Function App health smoke test after deploy.
Implemented In This Wave¶
- 61-83: shared API retry, timeout, cancellation, request-id, and tests.
- 91-95: improved global render recovery actions.
- 104-105: user-facing transient/network error handling is now backed by retry before surfacing failures.
- 118: ESLint 9 validation is restored with TypeScript and React Hooks checks.
Validation¶
npm run lint: passed.npm run test: passed, 6 Vitest tests.npm run build: passed.pytest -q api/tests: passed, 13 tests.npm audit --audit-level=high: no high/critical vulnerabilities reported; npm still reports existing moderate Vite/esbuild development-server advisories that require a breakingnpm audit fix --forceupgrade.azd deploy web --no-prompt: deployed tohttps://kind-coast-0eb698500.7.azurestaticapps.net/.- Browser smoke check on
/blast/jobs/job-8e7f852e3406: page loads, keepsJob Failed at Warmup, keeps Cancel hidden, expands the Warmup failure log, and shows the Warmup-specific no-results state.