Container Apps Architecture Reference¶
This document is the authoritative reference for the shipped ElasticBLAST
TL;DR
The deployed control plane is one Azure Container App
(ca-elb-dashboard) pinned to minReplicas: 1, maxReplicas: 1 with
six sidecars: frontend (nginx + React/Vite SPA), api (FastAPI on
uvicorn at :8080), worker and beat (Celery), redis (in-revision
broker), and terminal (loopback ttyd with the elastic-blast toolchain).
The retired Azure Functions tree was deleted from the repository on
2026-05-19; new work goes in api/.
control-plane architecture on Azure Container Apps: the bundled
ca-elb-dashboard Container App with six sidecars, the cost model, the storage
network isolation rules, the browser ↔ storage proxy contract, and the
identity / RBAC layout. It replaced the original Azure Functions backend; the
legacy tree has been deleted from the repository.
Decision Summary¶
Use this target shape:
- Bundle the React SPA into the same Container App as a sixth sidecar
named
frontend(nginx serving the builtdist/). The Static Web App resource (Microsoft.Web/staticSites) goes away. - Replace the Function App backend with one Azure Container App that bundles
six sidecar containers in the same revision:
frontend(nginx),api(FastAPI),worker(Celery),beat(Celery beat),redis(Redis 7 alpine), andterminal(interactive shell withelastic-blasttoolchain). All six share the same network namespace, so the worker reaches the broker at127.0.0.1:6379, the api proxies the browser terminal to127.0.0.1:7681, and the api reverse-proxies non-/api/*requests to the frontend at127.0.0.1:8081. - Use Celery beat for scheduled work (BLAST schedules, DB refresh checks, periodic monitoring). No Container Apps Jobs and no Service Bus scheduled messages.
- No managed database. All durable state (job registry, audit log, schedule records, command history) is persisted to Azure Storage (blob and table) using managed-identity auth.
- No separate Redis VM. Redis runs as an in-revision sidecar. It is
ephemeral (no AOF, no Azure Files mount); the broker queue is rebuilt from
the
jobstatetable by thebeatreconciler on revision restart. - No separate Remote Terminal VM. The browser-accessible operator shell is
a sidecar in the same Container App. The api sidecar terminates the
WebSocket from the browser (after MSAL + role check) and proxies it to a
loopback
ttydinstance inside theterminalsidecar./home/azureuseris ephemeral; user files stage to workload Storage viaazcopyrather than to a local mount. - Move platform resources behind VNet integration and private endpoints.
- Hard requirement, day 1: every Storage account in scope has
publicNetworkAccess: Disabled. The Container App is the only client that can reach platform Storage, and it does so exclusively over private endpoints from inside the platform VNet. Nobypass: AzureServicesworkaround, no temporary public-window toggle for control-plane traffic. - Use one user-assigned managed identity for the Container App. The six
sidecars share it. The only other identity is
id-elb-openapifor the AKS workload.
Cost-minimisation choice¶
The control plane workload is low traffic and operator-driven. Splitting it
into separate Container Apps + a Redis VM + a Remote Terminal VM + a Static
Web App is over-provisioned. Bundling all processes into one Container App
with minReplicas: 1, maxReplicas: 1 makes the whole stack one billable
unit at the smallest viable size (1.0 vCPU / 2 GiB total split across the six
sidecars; the terminal image carries the elastic-blast toolchain so it needs
the largest single allocation, frontend nginx needs almost nothing).
Trade-offs:
- The whole app restarts when any one container image changes. Acceptable because the API surface is small and the deploy pipeline is single-tenant.
- API and worker cannot scale horizontally because beat must be a singleton and
Redis state must stay co-located. Acceptable for current and projected
traffic; if scale-out is ever needed, split
beat(and Redis) into a separate app first. - In-flight Celery tasks are lost on revision restart. Mitigated by per-task
Storage state rows + the periodic reconciler (run by
beat), which re-dispatches tasks that were observed asrunningbut whose worker disappeared. (Earlier revisions backed the Redis queue with an Azure Files AOF mount; that was dropped because SMB mounts require a Storage account key, which conflicts withallowSharedKeyAccess: false. Seeinfra/modules/storageState.bicepfor the rationale.)
Do not move the control plane into AKS as the first target. AKS is the workload plane for ElasticBLAST. Hosting the control plane outside AKS keeps recovery, upgrades, and cluster troubleshooting independent from the cluster being managed.
Explicitly out of scope (do not re-introduce)¶
| Removed | Reason | Replacement |
|---|---|---|
| Azure Service Bus | Adds a managed dependency we no longer need once the worker model is Celery-based. | Celery + in-revision Redis sidecar. |
| Cosmos DB / Azure Database for PostgreSQL | A managed database is over-scoped for the document/append workloads this control plane has. Adds cost and operational surface. | Azure Storage (blob for documents, table for indexed queries). |
| Azure Cache for Redis (managed) | Cost. Broker is internal-only and does not need geo-replication, AAD, or managed patching. | Redis 7 alpine sidecar inside the Container App. |
Self-hosted Redis VM (vm-elb-redis) |
Adds a VM, NIC, NSG, subnet, MI, and nightly backup job. | Redis sidecar in the same Container App revision; ephemeral, queue rebuilt from the jobstate table by the beat reconciler on restart. |
| Container Apps Jobs for scheduled work | Two scheduling systems (jobs + beat) is redundant. | Celery beat sidecar. |
Separate ca-control-api, ca-control-worker, ca-control-beat apps |
Three Container Apps means three billable revisions and three managed identities. | Single ca-elb-dashboard Container App with six sidecars. |
Resources to Create¶
Authoritative list for infra/ planning. Use this table when sizing cost
estimates or writing new Bicep modules.
| Resource | Type | Purpose | New / Existing |
|---|---|---|---|
| Container Apps Environment | Microsoft.App/managedEnvironments |
VNet-integrated runtime for the Container App | New |
ca-elb-dashboard |
Microsoft.App/containerApps |
Single Container App with six sidecar containers: frontend, api, worker, beat, redis, terminal. Pinned to minReplicas: 1, maxReplicas: 1. Public ingress targets the api sidecar on :8080. |
New |
| Platform Storage account | Microsoft.Storage/storageAccounts |
Job state (table), audit (append blob), schedules (blob), command history (blob) | Re-purposed existing |
| Workload Storage account | Microsoft.Storage/storageAccounts |
ElasticBLAST blast-db, queries, results |
Existing |
| Container Registry | Microsoft.ContainerRegistry/registries |
App + ElasticBLAST images (including the new elb-frontend and elb-terminal images) |
Existing |
| Key Vault | Microsoft.KeyVault/vaults |
Secrets, app configuration references | Existing |
| AKS cluster | Microsoft.ContainerService/managedClusters |
ElasticBLAST workload | Existing |
| Platform VNet | Microsoft.Network/virtualNetworks |
Subnets: snet-containerapps, snet-private-endpoints, snet-aks |
New |
| Private endpoints | Microsoft.Network/privateEndpoints |
Key Vault, Storage (blob + table), ACR | New |
| Private DNS zones | Microsoft.Network/privateDnsZones |
privatelink.vaultcore.azure.net, privatelink.blob.core.windows.net, privatelink.table.core.windows.net, privatelink.azurecr.io |
New |
| User-assigned managed identities | Microsoft.ManagedIdentity/userAssignedIdentities |
id-elb-dashboard-* (shared by all six sidecars), id-elb-openapi (AKS Workload Identity) |
New |
| Log Analytics + Application Insights | Microsoft.OperationalInsights/workspaces + Microsoft.Insights/components |
Logs, metrics, traces | Existing |
Not created: Azure Service Bus, Azure Cosmos DB, Azure Database for PostgreSQL, Azure Cache for Redis, dedicated Redis VM, dedicated Redis subnet/NSG/MI, Remote Terminal VM, terminal subnet, terminal NSG, terminal admin password secret, terminal MI, Azure Bastion, Azure Static Web Apps.
CPU and Memory Sizing¶
Container Apps allocates CPU and memory per container, per replica, and the sum across containers in one replica must satisfy the platform's constraints.
Container Apps allocation rules (Consumption / Workload-profile Consumption)¶
- Minimum per container: 0.25 vCPU + 0.5 GiB.
- Increments: 0.25 vCPU + 0.5 GiB.
- The replica-total ratio must be 1 vCPU : 2 GiB. (e.g. 0.5 vCPU → 1.0 GiB, 2.25 vCPU → 4.5 GiB.)
- Max per replica on Consumption profile: 4 vCPU / 8 GiB.
- Dedicated workload profiles (D4 / D8 / D16 / E-series) allow up to the profile's node capacity per replica and finer increments (down to 0.1 vCPU / 0.1 GiB).
Reference: Microsoft Docs, "Containers in Azure Container Apps", under "Allocations" (sums across all containers in a replica must respect the ratio).
Initial allocation per sidecar¶
Sized for the steady-state operator workload (low concurrency, occasional BLAST submit / DB warmup). Revise after the first week of production telemetry; resize is a revision swap with no downtime.
| Sidecar | vCPU | Memory | Sizing reasoning |
|---|---|---|---|
frontend (nginx:alpine) |
0.25 | 0.5 GiB | Static files; a few QPS at most. The minimum allocation is already overkill. |
api (FastAPI) |
0.5 | 1.0 GiB | Handles JSON requests, the WebSocket terminal proxy, and the streaming upload/download proxy (1 MiB chunks, 4 MiB block uploads, semaphore-capped to 4 concurrent transfers). 0.5 vCPU is sized for the proxy bursts; idle steady-state will be much lower. |
worker (Celery) |
0.5 | 1.0 GiB | Runs Azure SDK pollers, ARM/AKS calls, and az acr build orchestration. CPU spikes during ACR build dispatch and AKS provision but is mostly waiting on long-running Azure operations. |
beat (Celery beat) |
0.25 | 0.5 GiB | Scheduler thread + Storage poller for schedule definitions. Trivial. |
redis (redis:7-alpine) |
0.25 | 0.5 GiB | Single-node broker for control-plane traffic. Ephemeral (no AOF) — queue is rebuilt from the jobstate table by the beat reconciler on revision restart. Memory grows with queue depth; 0.5 GiB is enough for hundreds of thousands of pending tasks. |
terminal (Ubuntu + elastic-blast toolchain) |
0.5 | 1.0 GiB | Bash + tmux + python + occasional kubectl/az/azcopy. Carries the heaviest image, but at runtime it is mostly idle waiting for the operator to type. |
| Replica total | 2.25 | 4.5 GiB | Satisfies the 1 vCPU : 2 GiB ratio. Within Consumption-profile per-replica max (4 / 8). |
If any sidecar regularly hits its CPU limit (visible in App Insights as
Container CPU Usage Percent saturating), bump that sidecar in 0.25 vCPU /
0.5 GiB increments and bump another sidecar down by the same amount, or grow
the replica total (still respecting the 1:2 ratio). The bundled topology has
no horizontal scale-out (minReplicas: 1, maxReplicas: 1); vertical resize
is the only knob.
What if the bundle outgrows 4 vCPU / 8 GiB?¶
Two paths, in preference order:
- Move to a Workload-Profile Dedicated node (D4 → 4 vCPU / 16 GiB, D8 → 8 / 32, etc.). This raises the per-replica cap and lets the bundle keep its single-revision semantics.
- Split a hot sidecar into its own Container App (likely candidates: the
apifor proxy load, then theworkerfor ARM throughput). This breaks the single-revision invariant but unlocksmaxReplicas > 1.
Do not raise replica count on the bundled app; that would duplicate the
beat singleton and break Redis state locality.
Cost Estimate (Korea Central, USD, monthly)¶
Numbers are based on Azure Retail Prices API for koreacentral, May 2026.
Verify in the Azure Pricing Calculator
before publishing to stakeholders.
Per-second meters used (confirmed)¶
| Meter | Unit price |
|---|---|
| Standard vCPU Active Usage | $0.000024 / vCPU-second |
| Standard vCPU Idle Usage | $0.000003 / vCPU-second |
| Standard Memory (active and idle, same price) | $0.000003 / GiB-second |
| Standard Requests | $0.40 per 1,000,000 requests |
| Dedicated Plan Management (workload-profile environment fee) | $0.10 / hour ≈ $72 / month |
Free monthly grant per subscription: 180,000 vCPU-seconds + 360,000 GiB-seconds + 2,000,000 requests.
Always-on math for the bundled app (2.25 vCPU / 4.5 GiB)¶
Per month at 30 days:
- vCPU-seconds = 2.25 × 86,400 × 30 = 5,832,000
- GiB-seconds = 4.5 × 86,400 × 30 = 11,664,000
After applying the free grant:
- Charged vCPU-seconds = 5,832,000 − 180,000 = 5,652,000
- Charged GiB-seconds = 11,664,000 − 360,000 = 11,304,000
Three duty-cycle scenarios:
| Active fraction | vCPU cost | Memory cost | Total per-second meters |
|---|---|---|---|
| 0% (always idle, hypothetical floor) | 5,652,000 × $0.000003 = $16.96 |
11,304,000 × $0.000003 = $33.91 |
~$50.87 |
| 5% (realistic operator workload) | 0.05 × 5,652,000 × $0.000024 + 0.95 × 5,652,000 × $0.000003 = $6.78 + $16.11 = $22.89 |
$33.91 | ~$56.80 |
| 100% (worst case, never happens for this workload) | 5,652,000 × $0.000024 = $135.65 |
$33.91 | ~$169.56 |
Requests are negligible (a few thousand /day at most → free tier covers it).
Two deployment options¶
Option A — Workload-profiles plan with the Consumption profile (recommended). This is required to host the Container Apps Environment inside the platform VNet with private endpoints to Storage and Key Vault. The plan adds the "Dedicated Plan Management" fee even when the workloads run on the Consumption profile (no dedicated node).
| Line item | Monthly | Notes |
|---|---|---|
| Per-second usage (5% active scenario) | ~$57 | from the table above |
| Workload-profile environment fee | $72 | $0.10/hour × 720 hours |
| Platform Storage (table + append blobs for state) | ~$1 | low transactions |
| Container-Apps-side total | ~$130 / month | excludes ACR ($20, already paid) and workload Storage |
Option B — Consumption-only plan (no workload-profile fee). Cheaper, but VNet integration support is more limited and does not cover the day-1 private-storage requirement on every Azure region. Use only if you verify in your subscription that Consumption-only environments can sit in the platform VNet AND reach Key Vault / Storage private endpoints — otherwise you cannot satisfy the Storage Network Isolation invariant.
| Line item | Monthly | Notes |
|---|---|---|
| Per-second usage (5% active scenario) | ~$57 | same math |
| Environment fee | $0 | Consumption-only has none |
| Platform Storage | ~$1 | same |
| Container-Apps-side total | ~$58 / month | only if VNet + private endpoints actually work in this mode |
The plan defaults to Option A because the Storage Network Isolation rule is non-negotiable.
What can move the cost number¶
- Active fraction. Real operator usage tends to be < 1% active on average. At 0.5% active the per-second meters drop to ~$52, total ~$127 / month.
- Resize. Halving CPU on
apiandworker(to 0.25 each) brings the replica total to 1.75 vCPU / 3.5 GiB and trims ~$13 / month, but eats the proxy headroom. Wait for telemetry before resizing. - Use Front Door for the SPA (optional, not in the day-1 plan). Adds ~$35 / month for Front Door Standard plus per-GB egress; gains a CDN.
Storage Network Isolation & Browser ↔ Storage Proxy¶
Extracted to its own page
The Storage network-isolation requirements and the Browser ↔ Storage proxy contract are the load-bearing security spec of the control plane. They moved into a standalone reference so they can be cited and audited without scrolling through the rest of this document.
Read it next: Storage Network Isolation & Browser ↔ Storage Proxy.
The 30-second summary that the rest of this document depends on:
- Every workload Storage account stays
publicNetworkAccess: Disabledin production. No code path enables it, even temporarily. - The browser never receives a SAS token. The
apisidecar is the only Storage client the browser sees; uploads/downloads stream through it (1 MiB download chunks, 4 MiB block uploads, max 4 concurrent transfers).
Target Architecture¶
Browser
|
| HTTPS (TLS terminated by Container Apps ingress)
| + MSAL access token on /api/* and the WebSocket upgrade
v
Container Apps Environment, VNet integrated
|
+-- ca-elb-dashboard (one Container App, one revision, one replica)
|
+-- container: api (FastAPI, public ingress on :8080)
| - serves /api/* directly
| - reverse-proxies everything else to 127.0.0.1:8081 (frontend)
| - upgrades /api/terminal/ws to a duplex copy with 127.0.0.1:7681 (terminal)
+-- container: frontend (nginx:alpine, listens on 127.0.0.1:8081)
| - serves the built /usr/share/nginx/html (Vite dist/)
| - SPA navigation fallback to /index.html for non-asset paths
| - immutable cache for /assets/*, no-cache for /index.html
+-- container: worker (Celery worker, no ingress)
+-- container: beat (Celery beat, no ingress)
+-- container: redis (redis:7-alpine, listens on 127.0.0.1:6379)
| - ephemeral (no AOF, no Azure Files mount)
| - queue rebuilt from `jobstate` table by beat reconciler on restart
|
+-- container: terminal (ttyd + bash + elastic-blast toolchain,
listens on 127.0.0.1:7681)
- /home/azureuser is ephemeral; user files stage to workload
Storage via azcopy (no Azure Files mount)
All six sidecars share:
- the same network namespace
- api reverse-proxies non-/api/* requests to frontend at 127.0.0.1:8081
- api upgrades /api/terminal/ws to terminal's loopback ttyd at 127.0.0.1:7681
- worker reaches Redis at 127.0.0.1:6379
- the same user-assigned managed identity (id-elb-dashboard-*)
- the same lifecycle (start, stop, restart together)
Private endpoints and managed identity
|
+-- Key Vault
+-- Storage accounts (platform + workload)
+-- Azure Container Registry
+-- AKS private or restricted API server
Component Plan¶
| Component | Target service | Purpose | Notes |
|---|---|---|---|
ca-elb-dashboard |
Azure Container Apps | Single Container App, six sidecars | minReplicas: 1, maxReplicas: 1. Public ingress only on the api container. |
frontend sidecar |
Container in ca-elb-dashboard |
nginx:alpine serving the built React SPA dist/ |
Listens on 127.0.0.1:8081. SPA navigation fallback to /index.html. Security headers (CSP, HSTS, X-Frame-Options, etc.) move from staticwebapp.config.json into nginx.conf. Image tag matches the SPA build hash so cache-busting is automatic across revisions. |
api sidecar |
Container in ca-elb-dashboard |
FastAPI HTTP API on Python 3.12 + reverse proxy for non-/api/* to the frontend sidecar |
Owns the public /api/* contract. Public ingress restricted (Container Apps ingress with optional allowedCidrs). Forwards requests that do not match /api/* to 127.0.0.1:8081. Terminates the browser WebSocket and proxies it to the terminal sidecar's loopback ttyd after MSAL + tenant-role check. |
worker sidecar |
Container in ca-elb-dashboard |
Celery worker | Pulls from redis://127.0.0.1:6379/0. Writes progress to Storage. |
beat sidecar |
Container in ca-elb-dashboard |
Celery beat scheduler | Reads schedule definitions from Storage. Singleton by construction (one container, one replica). |
redis sidecar |
Container in ca-elb-dashboard |
Broker + result backend | redis:7-alpine. Binds to 127.0.0.1 only. Ephemeral (no AOF, no Azure Files mount); the broker queue is rebuilt from the jobstate table by the beat reconciler on revision restart. |
terminal sidecar |
Container in ca-elb-dashboard |
Browser-accessible operator shell with the elastic-blast toolchain |
Image based on Ubuntu 24.04 with azure-cli, kubectl, azcopy, python3.12, primer3, tmux, git, jq, make, and the elastic_blast package + venv pre-installed. Runs ttyd -p 7681 -i 127.0.0.1 -W tmux new -A -s elb so each browser session attaches to the same persistent tmux. /home/azureuser is ephemeral; user files stage to workload Storage via azcopy rather than to a local mount. Authenticates to ARM with id-elb-dashboard-* via the env-injected MSI endpoint. |
| Job state | Azure Storage table + blob | Job registry, audit log, command history, schedule records | Table for indexed lookups (PartitionKey=job_id); blob (append) for audit trail; blob for large request/response payloads. |
| Secrets | Azure Key Vault | App configuration references and any future SSH material | Use private endpoint and RBAC. Keep purge protection enabled. No VM admin password is stored anywhere because there is no VM. |
| Runtime storage | Azure Storage | Query, config, DB, and result blobs | Use private endpoints, HNS where needed, and managed identity auth. |
| Images | Azure Container Registry | App containers (frontend, api, worker, beat, terminal) and ElasticBLAST images | Disable anonymous pulls. Use private endpoint where supported by environment. |
| Workload cluster | AKS | ElasticBLAST compute plane | Keep Workload Identity and Blob CSI. Prefer private cluster or authorized IP ranges. |
| Observability | App Insights plus Log Analytics | Logs, metrics, traces, audit | Use shared job_id, task_id, and correlation_id fields across sidecar logs. Each sidecar emits its own log stream. |
Service Boundaries¶
All six sidecars run in the same Container App revision. Boundaries below describe the responsibilities of each container, not separate Azure resources.
frontend sidecar¶
Responsibilities:
- Serve the built React SPA (
web/dist/) over loopback HTTP on127.0.0.1:8081. - Provide SPA navigation fallback (any non-asset path that 404s on disk →
serve
/index.htmlwith200). - Apply the security headers that today live in
web/staticwebapp.config.json:
X-Content-Type-Options: nosniff,X-Frame-Options: DENY,Referrer-Policy: strict-origin-when-cross-origin,Strict-Transport-Security: max-age=31536000; includeSubDomains, and the Content-Security-Policy. These move from the SWA config intonginx.conf. - Serve
/assets/*withCache-Control: public, immutable, max-age=31536000(Vite hashes asset filenames). Serve/index.htmlwithCache-Control: no-cacheso a redeploy is picked up immediately. - Run as non-root, no shell, no extra packages.
nginx:alpinewith a three-line custom config baked into the image.
Image build (elb-frontend:<tag>):
- Multi-stage Dockerfile: stage 1 runs
npm ci && npm run buildagainst web/; stage 2 isFROM nginx:alpineand copiesweb/dist/into/usr/share/nginx/htmlplus the customnginx.conf. - Image tag = the SPA build hash so cache busting is automatic across revisions.
- No managed identity needed; the container makes no outbound calls.
api sidecar¶
Responsibilities:
- Validate MSAL bearer tokens on
/api/*and on the WebSocket upgrade. - Authorize requests against the caller identity and configured tenant.
- Serve fast read endpoints for dashboard state.
- Create command records in Storage and dispatch Celery tasks via
redis://127.0.0.1:6379/0. - Return
202 Acceptedfor long-running operations with the Celerytask_idand thejob_idwritten to Storage. - Expose status endpoints backed by Storage state, not by Celery's transient task result API.
- Reverse-proxy non-
/api/*requests to the frontend sidecar at127.0.0.1:8081. This is the only routing rule the api needs: if the path starts with/api/, handle it; otherwise forward verbatim (preserve method, headers, body, query string) to the frontend.
The API should not block on Azure SDK long-running pollers except for small, bounded reads. Any operation expected to exceed the frontend proxy timeout is dispatched as a Celery task.
worker sidecar¶
Responsibilities:
- Run a Celery worker process that pulls tasks from
redis://127.0.0.1:6379/0. - Execute tasks idempotently (use
job_idas the idempotency key, guarded by status transitions in Storage). - Use Azure SDK pollers for VM, AKS, ACR, Storage, and Key Vault operations.
- Persist each step transition to the Storage state document.
- Append audit events for security-relevant operations.
- Use Celery
autoretry_for+ exponential backoff with explicit retryability decisions. - Clean up network exposure and temporary storage access in
finallypaths ortask_failuresignals.
Start with one worker process consuming a single default queue. Use named
queues (azure, blast, storage) only when there is real contention; even
then, all consumers run inside the same worker container because horizontal
scale-out is not available in this topology.
beat sidecar¶
Responsibilities:
- Run a single Celery beat process.
- Read schedule definitions from Storage (custom scheduler implementation that reads from a blob/table on startup and on a short interval) so that schedules survive container restarts without an external database.
- Dispatch periodic tasks: AKS health snapshot, ACR tag drift check, storage
access window auto-close reconciler, dead-letter scan, in-flight task
reconciler (re-dispatch tasks observed as
runningwhose worker disappeared). - Singleton by construction: one container, one replica.
redis sidecar¶
Responsibilities:
redis:7-alpine(or pinned digest), no auth required because the listener is bound to127.0.0.1and is not reachable from outside the replica.- Runs with
--save ''and--appendonly no. No AOF, no RDB, no Azure Files mount. SMB mounts in Container Apps require a Storage account key, which conflicts with theallowSharedKeyAccess: falseinvariant on the platform Storage account; seeinfra/modules/storageState.bicep. - Resource limits: 0.25 vCPU / 0.5 GiB; revisit after load testing.
- No outbound traffic; lifecycle managed entirely by the Container App.
This sidecar is a single point of failure for queued work within one revision.
Mitigation: tasks in flight are visible in Storage state (the jobstate
table), and the beat reconciler re-dispatches tasks that were observed as
running but whose worker disappeared.
terminal sidecar¶
This replaces the previous Remote Terminal VM. The user gets a browser-based
shell with the full elastic-blast toolchain, reached only through the api
sidecar's authenticated WebSocket proxy.
Image build (elb-terminal:<tag>, pushed to the platform ACR):
- Base:
ubuntu:22.04. - Apt:
azure-cli,kubectl(or installed via direct binary download for version pinning),azcopy,python3.12,python3.12-venv,python3-pip,primer3,git,make,jq,unzip,curl,tmux,ttyd. - Pre-installed Python deps:
requirements/test.txtfromdotnetpower/elastic-blast-azure, the Azure mgmt SDKs (azure-mgmt-resource,azure-mgmt-network,azure-mgmt-compute,azure-mgmt-storage,azure-mgmt-containerregistry,azure-mgmt-containerservice,azure-mgmt-authorization,azure-mgmt-msi,azure-mgmt-monitor), and theelastic_blastpackage itself (installed--no-build-isolation --no-depsexactly like the cloud-init script does today). Versions pinned in theIMAGE_TAGStable so a single bump propagates atomically. /etc/profile.d/elb-env.shexportsPYTHONPATH=src:$PYTHONPATH,AZCOPY_AUTO_LOGIN_TYPE=MSI,ELB_SKIP_DB_VERIFY=true,ELB_DISABLE_AUTO_SHUTDOWN=1.- Entry point:
ttyd -p 7681 -i 127.0.0.1 -W tmux new -A -s elb. -i 127.0.0.1binds to loopback so only the api sidecar (same network namespace) can reach it.-Wmakes the shell writable (default ttyd is read-only).tmux new -A -s elbattaches every browser session to a single persistent tmux session calledelb, so refreshing the browser does not lose work and multiple browser tabs share state. tmux also keeps long-runningelastic-blast submitfrom dying when the WebSocket drops.
Auth and authorization on the WebSocket:
- Browser opens
wss://<api-host>/api/terminal/wswith the MSAL access token in theSec-WebSocket-Protocolheader (or as a?token=query parameter with a short-lived API-issued one-time-use ticket; see verification). - The api sidecar validates the token, requires the caller to hold a tenant
role such as
elb-operator, and only then upgrades the WebSocket and starts a duplex copy with the loopback ttyd. - Per-session correlation id (
session_id) is logged at upgrade and on close, withowner_oidandtenant_id. - Idle-timeout: api closes the WebSocket after 30 minutes of no activity in either direction. tmux survives so reconnecting resumes the same session.
Azure auth from inside the terminal:
- Container Apps exposes a managed-identity endpoint to the workload
(
IDENTITY_ENDPOINTandIDENTITY_HEADERenv vars). The shell startup script runsaz login --identity(or, if the user prefers their own identity,az login --use-device-code). The MOTD explains both options. AZCOPY_AUTO_LOGIN_TYPE=MSImeansazcopypicks up the same identity.kubectluses kubeconfig generated byaz aks get-credentials --admin(or viaaksAadAuthonce the cluster is configured for AAD); the AKS permissions onid-elb-dashboard-*cover this.
Persistence:
/home/azureuseris ephemeral. There is no Azure Files SMB mount, because SMB mounts require a Storage account key and the platform Storage account runs withallowSharedKeyAccess: false(seeinfra/modules/storageState.bicep).- User query files, downloaded result snippets, and similar artefacts stage
to workload Storage via
azcopyinstead of to a local mount. - The cloned
elastic-blast-azurerepo, the venv, and the pre-installed toolchain all live inside the container image — they are immutable per revision and do not depend on a writable home directory. ~/.azure/and~/.kube/configare regenerated on each session: the startup script runsaz login --identityagainst the MI endpoint andaz aks get-credentialsagainst the workload cluster.
Lifecycle:
- Starts and stops with the rest of the Container App revision. There is no per-user provisioning, no per-VM cloud-init wait, and no admin password to reveal.
- Resource limits: 0.5 vCPU / 1 GiB initial; revisit after the first real
user session that runs an
elastic-blast submit. The terminal is the single largest sidecar in the bundle because it carries the toolchain.
What this sidecar intentionally does NOT carry (do not re-introduce; the left column is the retired Remote Terminal VM model preserved as a guardrail):
| Retired (VM model) | Replacement in the sidecar model |
|---|---|
Ubuntu 24.04 VM (vm-elb-terminal) |
elb-terminal:<tag> container in ca-elb-dashboard |
| 10-15 min cloud-init bootstrap (apt, pip, clone, venv, defender-onboarding retry) | Image build does this once at CI time. Cold start is whatever the container engine takes (seconds). |
azure-cli, kubectl, azcopy, git, make, jq, python3.12, primer3, tmux installed via cloud-init |
All baked into the image at build time, with retry / failure handling moved to CI |
~/elastic-blast-azure clone + venv + pip install -r requirements/test.txt + pip install --no-build-isolation --no-deps elastic_blast |
All baked into the image; venv at /opt/elb/venv. |
azure-mgmt-* SDKs installed via cloud-init |
Baked into the image |
/etc/profile.d/elb-env.sh env vars |
Same content baked into the image |
elb-az-login-mi script that az login --identity from IMDS |
Same script runs from the image; uses Container Apps' MI endpoint instead of IMDS. The end result (az account show works) is identical. |
| MOTD with onboarding hints | Same MOTD baked into the image |
| SSH on port 22 + 443 | Removed. No SSH. Browser → api WebSocket → ttyd. |
Port 22 / Port 443 in sshd_config |
Removed. |
Per-VM admin password generated and stored in Key Vault, revealed once via /api/terminal/{vm}/password |
Removed. No password. Access is gated by MSAL + tenant role on the WebSocket upgrade. |
NSG with AllowSSH rule scoped to caller IP via /api/terminal/{vm}/open-ssh |
Removed. No NSG, no IP allow-list. |
/api/terminal/{vm}/start (deallocate the VM) |
Removed. Terminal lifecycle is the Container App revision lifecycle; stopping the terminal would mean stopping the whole control plane. |
/api/terminal/{vm}/stop (deallocate the VM) |
Removed for the same reason. |
/api/terminal/{vm}/destroy (delete VM, NIC, IP, KV secret) |
Replaced by container-image redeploy. There is no per-user resource to delete. |
/api/terminal/{vm}/health (power state, cloud-init progress, reachability) |
Replaced by the Container App revision health and a cheap /api/terminal/health ping that checks tcp://127.0.0.1:7681 from the api sidecar. |
/api/terminal/provision Durable orchestrator (RG, network, KV, password, VM, RBAC, cloud-init poll) |
Removed. Provisioning is azd up + revision rollout. The first time the platform is deployed there is one-time AKS workload-identity / RBAC setup, but no per-user provisioning. |
Persistent /home/azureuser on the OS disk |
Ephemeral /home/azureuser; user files stage to workload Storage via azcopy. |
| Operator runbook step: "wait for cloud-init", "open NSG to your IP", "reveal password", "ssh in" | Operator runbook step: "open the Terminal tab in the dashboard". |
Verification:
- A test that opening
wss://<api-host>/api/terminal/wswithout a token returns401; without the required tenant role returns403; with both succeeds and returns a working bash prompt. - A test that two concurrent browser tabs see the same tmux session and that closing one tab does not kill the other or kill any process started in the shared session.
- A test that running
az account showfrom the terminal sidecar returns theid-elb-dashboard-*identity by default, and that runningaz login --use-device-codelets the user override with their own identity for the duration of the session (without leaking back into the shared tmux for other users — sessions are per-tmux-window, and the docs make this explicit). - A test that
kubectl get nodes,azcopy ls, andelastic-blast --helpall work without further setup. - A test that the api sidecar refuses to upgrade the WebSocket when the
terminalsidecar's loopback port is unreachable, returning a 503 with a clear "terminal sidecar unhealthy" message.
Command and State Model¶
Replace Durable Functions with an explicit Celery task model backed by Storage.
HTTP POST /api/blast/submit
-> validate request
-> write Storage state row: PartitionKey=job_id, status=queued
-> dispatch Celery task: submit_blast.delay(job_id=...)
-> return 202 + { job_id, task_id }
ca-elb-dashboard / worker sidecar pulls task from Redis sidecar (127.0.0.1:6379)
-> update Storage: status=running, phase=checking_vm
-> execute steps with autoretry_for + exponential backoff
-> append audit event after each step
-> update Storage: status=completed or failed
-> on failure, run cleanup compensations (close storage window, etc.)
Recommended Storage layout (platform storage account):
| Container / table | Purpose | Format |
|---|---|---|
job-state (table) |
Indexed lookup of current job status | PartitionKey=job_id, RowKey="current", columns: status, phase, owner_oid, tenant_id, created_at, updated_at, task_id, error_code |
job-history (table) |
Per-step transitions (queryable by job) | PartitionKey=job_id, RowKey=ulid(timestamp), columns: phase, event, payload_blob_uri |
job-payloads (blob, append) |
Sanitised request and result payloads, large step outputs | One append-blob per job_id; immutable once status is terminal |
audit (blob, append) |
Security-relevant events (storage open/close, role assignment changes, terminal lifecycle) | Daily-rolled append blobs, JSON Lines |
schedules (blob) |
Celery beat schedule definitions | Single JSON blob, versioned by ETag |
dead-letter (blob) |
Tasks that exhausted retries | One blob per failure, includes task name, args (sanitised), traceback |
State document shape (table row, JSON-encoded payload column for variable
fields):
{
"PartitionKey": "job_id",
"RowKey": "current",
"type": "blast_job",
"tenant_id": "...",
"owner_oid": "...",
"status": "queued|running|completed|failed|cancelled",
"phase": "checking_vm|opening_storage|uploading|submitting|polling|closing_storage",
"created_at": "2026-05-14T00:00:00Z",
"updated_at": "2026-05-14T00:00:00Z",
"task_id": "celery-uuid",
"error_code": null,
"payload_blob_uri": "https://stelb*/job-payloads/<job_id>.jsonl"
}
Keep request payloads sanitised. Do not store bearer tokens, SAS URLs, VM passwords, or raw command output that may contain secrets in any Storage artifact.
Why Storage instead of a database¶
- Workload is append-mostly with single-key lookups (
job_id). - Consistency model needed is single-row ETag updates, not multi-row transactions.
- Storage tables are billed per operation, with no minimum throughput.
- A future move to Cosmos DB or PostgreSQL is straightforward because the repository layer hides the storage shape.
Runtime Plan (Networking · Identity · Storage · AKS · Smoke)¶
Extracted to its own page
The networking subnets / private DNS, the shared user-assigned managed identity + RBAC matrix, the Storage account rules, the AKS plan, and the post-deploy smoke checklist used to live at the end of this page. They are now a standalone operator reference so they can be cited and audited independently of the Container Apps topology.
Read it next: Runtime Plan — Networking, Identity, Storage, AKS.