Skip to content

Production Hardening Wave 2

Date: 2026-05-13

Motivation

A broad critique pass found recurring reliability and recovery risks across the browser control plane: transient Azure/API failures can interrupt user workflows, render errors can strand users on a blank route, and hardening work needs a traceable catalog rather than one-off fixes.

User-facing Change

  • Authenticated API requests now include a client request ID for traceability.
  • Transient API failures now retry with bounded exponential backoff, Retry-After support, and per-attempt timeouts.
  • User-cancelled requests are respected before any network attempt starts.
  • The global error screen now offers separate recovery actions: copy details, retry render, reload, or return to the dashboard.

API / IaC Diff Summary

  • web/src/api/resilience.ts adds shared request timeout, retry, retry-delay, and request-id helpers.
  • web/src/api/client.ts routes all authenticated API calls through the shared resilience helper.
  • web/src/api/resilience.test.ts covers retryability, Retry-After parsing, backoff, transient recovery, and cancellation.
  • web/src/components/ErrorBoundary.tsx improves user recovery after render failures.
  • web/eslint.config.js restores ESLint 9 flat-config validation for TypeScript and React Hooks rules.
  • Hook dependency fixes stabilize shortcut, ACR, storage, job-list, and submit-page derived state.
  • No backend API contract or IaC changes.

Critique Catalog

Functionality

  1. Add typed response validation for monitoring endpoints.
  2. Add typed response validation for BLAST job endpoints.
  3. Add typed response validation for terminal endpoints.
  4. Add typed response validation for storage endpoints.
  5. Add typed response validation for ACR endpoints.
  6. Show partial data when one dashboard resource fails.
  7. Add job retry from failed terminal states.
  8. Add failed-job clone-to-new-search action.
  9. Add downloadable orchestrator history.
  10. Add downloadable step logs.
  11. Add completed-job export integrity checks.
  12. Add result-file checksum display when available.
  13. Add result preview fallback for compressed outputs.
  14. Add explicit no-results-after-completion diagnosis.
  15. Add duplicate submission detection.
  16. Add dry-run validation before cloud submission.
  17. Add BLAST config preview diff before submit.
  18. Add database/query molecule compatibility preflight.
  19. Add storage container existence preflight.
  20. Add ACR image tag mismatch preflight.
  21. Add AKS cluster readiness preflight.
  22. Add quota warning before node provisioning.
  23. Add region/SKU availability validation.
  24. Add public network access propagation countdown.
  25. Add safe resume for interrupted job status polling.
  26. Add terminal az-login stale warning.
  27. Add terminal cloud-init completion proof.
  28. Add terminal tool-version verification display.
  29. Add VM password copy-once acknowledgement.
  30. Add safer resource-group selection confirmation.

UI / UX

  1. Add per-card stale-data labels.
  2. Add per-card retry buttons for failed loads.
  3. Add per-card collapsible details for errors.
  4. Add global disconnected/API-unavailable banner.
  5. Add skeleton states for all async panels.
  6. Keep page headers stable during loading.
  7. Keep action buttons stable during polling.
  8. Disable destructive actions while mutation is pending.
  9. Add confirmation reason text for destructive actions.
  10. Add keyboard focus management for dialogs.
  11. Add accessible labels to icon-only buttons.
  12. Add tooltip text for unfamiliar actions.
  13. Add copy feedback to every clipboard action.
  14. Add progress labels for long-running mutations.
  15. Add last successful refresh timestamp.
  16. Add failed refresh timestamp.
  17. Add query polling pause while tab is hidden.
  18. Add reduced-motion fallback for spinners/rings.
  19. Add mobile layout checks for dense result tables.
  20. Add responsive wrapping for long resource IDs.
  21. Add line wrapping for long Azure errors.
  22. Add visual distinction between warning and fatal states.
  23. Add empty-state variants for setup, loading, failed, and terminal.
  24. Add consistent status chip labels across pages.
  25. Add dashboard first-run workspace picker recovery.
  26. Add direct dashboard recovery from route render failures.
  27. Add user-readable auth setup missing state.
  28. Add explicit session-expired state.
  29. Add action-specific loading labels.
  30. Add result export loading state per format.

Reliability

  1. Add API request retry for 408.
  2. Add API request retry for 429.
  3. Add API request retry for 500.
  4. Add API request retry for 502.
  5. Add API request retry for 503.
  6. Add API request retry for 504.
  7. Respect Retry-After seconds.
  8. Respect Retry-After HTTP dates.
  9. Bound retry delays.
  10. Add jitter to retry delays.
  11. Add per-attempt request timeout.
  12. Respect caller cancellation before fetch.
  13. Avoid retrying auth failures.
  14. Avoid retrying RBAC failures.
  15. Avoid retrying missing resources.
  16. Add client request IDs to API calls.
  17. Preserve existing request headers.
  18. Preserve caller abort signals.
  19. Avoid timeout timer leaks during backoff.
  20. Add tests for retry status selection.
  21. Add tests for retry-after parsing.
  22. Add tests for transient response recovery.
  23. Add tests for cancellation behavior.
  24. Add tests for deterministic backoff.
  25. Add shared API error formatting for transient failures.
  26. Add background polling backoff after repeated failures.
  27. Add circuit breaker for repeated API outages.
  28. Add idempotency keys for mutation requests.
  29. Add retry-safe export downloads.
  30. Add retry-safe monitoring calls.

Recovery

  1. Add global render error boundary.
  2. Add copyable render error details.
  3. Add render retry without full reload.
  4. Add full reload recovery action.
  5. Add dashboard escape hatch from error screen.
  6. Add stable failed-job terminal UI.
  7. Hide Cancel for terminal failed jobs.
  8. Expand inferred failed execution step.
  9. Mark post-failure steps skipped.
  10. Show failed-step-specific results empty state.
  11. Suppress stale terminal-state toasts on first load.
  12. Recover from protected export link failures with authenticated downloads.
  13. Add export failure toast with API message.
  14. Add request timeout recovery message.
  15. Add network failure recovery message.

Security / Operations

  1. Validate all HTTP triggers have bearer validation.
  2. Audit raw fetch usage in components.
  3. Avoid leaking subscription IDs in user-visible logs where not needed.
  4. Sanitize SAS tokens from logs and previews.
  5. Sanitize bearer tokens from logs and previews.
  6. Sanitize Key Vault secret URIs when copied into logs.
  7. Keep SSH NSG limited to caller IP.
  8. Delete terminal secrets during teardown.
  9. Keep storage public network access temporary.
  10. Add public network access watchdog cleanup.
  11. Add audit events for destructive actions.
  12. Add correlation between frontend request ID and backend logs.
  13. Add CI lint configuration for ESLint 9.
  14. Add SWA smoke test after deploy.
  15. Add Function App health smoke test after deploy.

Implemented In This Wave

  • 61-83: shared API retry, timeout, cancellation, request-id, and tests.
  • 91-95: improved global render recovery actions.
  • 104-105: user-facing transient/network error handling is now backed by retry before surfacing failures.
  • 118: ESLint 9 validation is restored with TypeScript and React Hooks checks.

Validation

  • npm run lint: passed.
  • npm run test: passed, 6 Vitest tests.
  • npm run build: passed.
  • pytest -q api/tests: passed, 13 tests.
  • npm audit --audit-level=high: no high/critical vulnerabilities reported; npm still reports existing moderate Vite/esbuild development-server advisories that require a breaking npm audit fix --force upgrade.
  • azd deploy web --no-prompt: deployed to https://kind-coast-0eb698500.7.azurestaticapps.net/.
  • Browser smoke check on /blast/jobs/job-8e7f852e3406: page loads, keeps Job Failed at Warmup, keeps Cancel hidden, expands the Warmup failure log, and shows the Warmup-specific no-results state.