WAF six-stage hardening pass¶
Motivation¶
The dashboard already had two performance-focused hardening batches in flight. This pass reviewed the same working tree against a six-stage Well-Architected Framework lens: reliability, security, cost optimization, operational excellence, performance efficiency, and sustainability/efficiency. The goal was to improve resilience and operator safety without changing user-facing features, API schemas, or deployment topology.
User-facing change¶
No feature behaviour changes. Misconfigured environments now fail earlier with clearer errors, high-volume buffers have stronger caps, and operational logs are more useful when shutdown or cache paths misbehave.
Six-stage critique and hardening summary¶
- Reliability
- Capped Redis wait attempts so extreme
REDIS_WAIT_TIMEOUTvalues cannot produce unbounded startup log spam. - Added tunable Storage health-probe cache TTLs and explicit cache-reset logging.
- Added Celery timeout ordering validation so hard timeouts cannot fire before soft timeouts.
-
Added a lower-bound guard for NCBI preview HTTP timeout.
-
Security
- Added a 100 MiB startup guard for
MAX_REQUEST_BODY_BYTES. - Added a lower-bound guard for
OPENAPI_RATE_LIMIT_MAX_KEYS. - Added an upper-bound guard for terminal
EXEC_TOKENsize. -
Bounded the JWKS cache tenant count and logged single-flight wait expiry.
-
Cost Optimization
- Added a Celery result-backend TTL upper bound to protect Redis memory.
- Documented Storage artifact snapshot byte caps as cost controls.
-
Preserved the previously added Storage/result listing caps and Docker build context exclusions.
-
Operational Excellence
- Added stack traces to app shutdown debug/warning paths.
- Improved
postprovision.shrequired-env failure guidance. -
Kept request-detail capture opt-in and documented correlation behaviour in the middleware header.
-
Performance Efficiency
- Made BLAST log SSE queue size tunable.
- Made Storage probe TTLs tunable for probe amplification control.
-
Bounded UI animation event retention with a hard cap.
-
Sustainability / Efficiency
- Expanded terminal
.dockerignoreto shrink build contexts. - Kept terminal toolchain versions pinned and included in the base-image content hash so rebuilds happen only when inputs change.
Concrete improvement inventory (20)¶
REDIS_WAIT_MAX_ATTEMPTSbounds Redis startup wait attempts.- Redis startup fatal logs now include the number of attempts.
- Storage probe OK TTL is configurable by environment.
- Storage probe degraded TTL is configurable by environment.
- Storage probe cache resets emit a log line.
- Celery soft/hard task timeout ordering is validated at startup.
- NCBI preview HTTP timeout has a safe lower bound.
- Request body hard cap has a 100 MiB startup guard.
- OpenAPI rate-limit key capacity has a safe lower bound.
- Terminal exec shared token has a maximum length guard.
- JWKS tenant cache has a maximum size.
- JWKS single-flight wait expiry is logged.
- Celery result-backend TTL has a 2-hour upper bound.
- UI event retention has a hard cap even if env is oversized.
- Job artifact byte caps are documented as Storage cost controls.
- App shutdown paths include stack traces in debug/warning logs.
- Request ID correlation contract is documented in middleware.
postprovision.shexplains how to recover missing azd env values.- BLAST log SSE queue size is environment-tunable with a safe floor.
- Terminal build context excludes more generated/cache artifacts.
API / IaC diff¶
No API schema changes and no Bicep changes in this WAF pass. Changes are limited to runtime guardrails, environment validation, logging, Docker ignore metadata, and documentation.
Validation evidence¶
uv run pytest -q api/tests/test_smoke.py api/tests/test_request_metrics_detail.py api/tests/test_openapi_rate_limit.py api/tests/test_storage_data.py api/tests/test_me_route.py— 129 passed.uv run pytest -q api/tests— 1374 passed.- Focused
uv run ruff checkon touched Python files — clean. cd web && npm run build— passed.az bicep build --file infra/main.bicep --stdout >/tmp/elb-main-bicep-build.json— passed with only the Azure CLI Bicep version-upgrade warning.bash -n scripts/dev/postprovision.sh scripts/dev/quick-deploy.sh scripts/dev/terminal-base-image.sh scripts/dev/acr-build-access.sh— passed.git diff --check— clean.