Google Cloud Run
This page covers deploying the API to Cloud Run, including the settings that matter for performance under concurrent load. The recommended settings were validated by load testing: tuning request concurrency alone gave roughly a 13x throughput and 10x latency improvement.
Prerequisites
- The gcloud CLI installed and authenticated (
gcloud auth login). - A GCP project with billing enabled.
- The required APIs enabled
gcloud services enable run.googleapis.com cloudbuild.googleapis.com \
artifactregistry.googleapis.com --project=YOUR_PROJECT
Container port
The image listens on the port given by the PORT environment variable, falling back to
8000 elsewhere. Cloud Run sets PORT automatically (8080 by default), so the container
just works with no extra flag.
Deploy from source
--source uploads the repo, builds the image with Cloud Build using the Dockerfile,
pushes it to Artifact Registry, then deploys. No local Docker is needed.
gcloud run deploy epydemix-api \
--source . \
--project=YOUR_PROJECT \
--region=us-east1 \
--cpu=2 --memory=2Gi \
--concurrency=2 \
--min-instances=1 --max-instances=20 \
--timeout=120 \
--allow-unauthenticated
On success it prints the service URL, for example
https://epydemix-api-XXXXXXXXX.us-east1.run.app.
Performance tuning
The simulation endpoint is CPU-bound. One SIR simulation (Nsim=20) takes about 1.5s, and each uvicorn worker runs one at a time. With 2 workers on 2 vCPUs, a single instance sustains about 1.3 simulations per second.
--concurrency=2is the key setting and must match the worker and vCPU count. The Cloud Run default is 80, which routes up to 80 simulations to one 2-worker instance. They queue behind the workers and latency climbs to 30 to 60 seconds. Setting concurrency to 2 means each instance only takes what it can run in parallel, so Cloud Run scales out to more instances instead of overloading one.--cpu=2must agree with uvicorn--workers 2. More workers than vCPUs only time-share the CPU and add no throughput.--max-instances=20sets the throughput ceiling: 20 instances times ~1.3/s is about 26 simulations per second. Raise it if you need more.--min-instances=1keeps a warm instance so the first request avoids a cold start (container boot, epydemix import, population cache warm). Set it higher before a known spike such as a workshop.--timeout=120is the Cloud Run request timeout. The app also self-limits to 60s.
Verify
URL=https://epydemix-api-XXXXXXXXX.us-east1.run.app
curl -s "$URL/api/v1/health"
A single SIR simulation (Nsim=20) should return HTTP 200 in about 1.5s.
Cost management
--min-instances=1 keeps a container running continuously, which bills around the clock.
When not actively serving, scale to zero or delete:
# scale to zero (keep the service, stop idle billing):
gcloud run services update epydemix-api --project=YOUR_PROJECT --region=us-east1 --min-instances=0
# delete entirely:
gcloud run services delete epydemix-api --project=YOUR_PROJECT --region=us-east1
Redeploy a prebuilt image
To redeploy with changed settings without rebuilding, point at the existing image in Artifact Registry:
gcloud run deploy epydemix-api \
--image us-east1-docker.pkg.dev/YOUR_PROJECT/cloud-run-source-deploy/epydemix-api:latest \
--project=YOUR_PROJECT --region=us-east1 \
--cpu=2 --memory=2Gi --concurrency=2 \
--min-instances=1 --max-instances=20 --timeout=120 --allow-unauthenticated
Terraform
Terraform does not build images. Build and push first
(gcloud builds submit --tag ... or docker push), then reference the image:
resource "google_cloud_run_v2_service" "api" {
name = "epydemix-api"
location = "us-east1"
template {
scaling {
min_instance_count = 1
max_instance_count = 20
}
max_instance_request_concurrency = 2 # the key knob
timeout = "120s"
containers {
image = "us-east1-docker.pkg.dev/YOUR_PROJECT/REPO/epydemix-api:latest"
resources {
limits = { cpu = "2", memory = "2Gi" }
}
}
}
}