Deploy Gemma 4 on Cloud Run
Run Gemma 4 as a serverless container on Google Cloud Run. Pay-per-request pricing with GPU support using Cloud Run GPU preview.
Why Cloud Run for Gemma 4?
Cloud Run is Google Cloud's fully managed serverless container platform. You push a Docker image, configure a few parameters, and Cloud Run handles scaling from zero to thousands of requests with no cluster management.
Cloud Run now supports GPU accelerators (NVIDIA L4, currently in preview), which makes it possible to run Gemma 4 inference as a true serverless workload. You pay only when a request is being processed — idle time is free.
The approach in this guide uses Ollama as the inference backend inside the container. Ollama handles model downloading, GGUF quantization, and serving — making the Dockerfile straightforward.
Architecture
Internet → Cloud Run Service (Ollama + Gemma 4) → OpenAI-compatible /api/chatCloud Run scales the container to zero between requests. On a cold start, Ollama starts the server; the model weights are already baked into the image so there is no download delay at runtime.
Prerequisites
- Google Cloud project with billing enabled
gcloudCLI installed and authenticated- Docker installed locally (for building the image)
- Cloud Run GPU quota enabled (request via the Cloud Run quota page in the console)
Enable the Cloud Run API:
gcloud services enable run.googleapis.com \
--project=YOUR_PROJECT_IDDockerfile
Create a Dockerfile that extends the official Ollama image and includes a startup script:
FROM ollama/ollama:latest
COPY start.sh /start.sh
RUN chmod +x /start.sh
CMD ["/start.sh"]Create start.sh alongside the Dockerfile:
#!/bin/bash
ollama serve &
sleep 5
ollama pull gemma4:4b
waitThe sleep 5 gives the Ollama server time to initialise before the pull command runs. The wait at the end keeps the container alive — when wait exits (if the background ollama serve process dies), Cloud Run restarts the container.
Build and Push the Image
Build the image and push it to Google Artifact Registry:
# Create an Artifact Registry repository (one-time setup)
gcloud artifacts repositories create gemma4 \
--repository-format=docker \
--location=us-central1
# Configure Docker authentication
gcloud auth configure-docker us-central1-docker.pkg.dev
# Build and push
docker build -t us-central1-docker.pkg.dev/YOUR_PROJECT/gemma4/ollama-gemma4:latest .
docker push us-central1-docker.pkg.dev/YOUR_PROJECT/gemma4/ollama-gemma4:latestDeploy to Cloud Run with GPU
Deploy the container to Cloud Run, requesting an L4 GPU:
gcloud run deploy gemma4-ollama \
--image us-central1-docker.pkg.dev/YOUR_PROJECT/gemma4/ollama-gemma4:latest \
--region us-central1 \
--gpu 1 \
--gpu-type nvidia-l4 \
--memory 16Gi \
--cpu 4 \
--concurrency 4 \
--timeout 300 \
--allow-unauthenticatedKey flags:
--gpu 1 --gpu-type nvidia-l4— attach one L4 GPU (24 GB VRAM, sufficient for Gemma 4 E4B in GGUF format)--memory 16Gi— system RAM for the container process--concurrency 4— Cloud Run will route up to 4 simultaneous requests to one container instance before scaling out--timeout 300— allow up to 5 minutes for long LLM generations
Query the Endpoint
Once deployed, Cloud Run provides a public HTTPS URL. Test it with curl:
curl -X POST https://YOUR-SERVICE-URL/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:4b",
"messages": [{"role": "user", "content": "What is Gemma 4?"}],
"stream": false
}'Or use the OpenAI-compatible endpoint (Ollama 0.4+):
curl -X POST https://YOUR-SERVICE-URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:4b",
"messages": [{"role": "user", "content": "Explain Gemma 4 briefly"}]
}'Cold Start Considerations
The main drawback of Cloud Run for LLM inference is cold start latency. When Cloud Run scales from zero, it must:
- Pull the container image (mitigated by keeping the image in the same region as the service)
- Start the Ollama server (~2 seconds)
- Load the model weights into GPU VRAM (~5–10 seconds for the 4B GGUF model)
Total cold start is typically 15–30 seconds for the E4B model. To mitigate this, set --min-instances 1 to keep one warm instance running at all times — this adds a small always-on GPU cost.
Pricing
Cloud Run GPU pricing (approximate, us-central1):
- L4 GPU: ~$0.80/hour while active
- CPU + memory: standard Cloud Run rates
With zero-scaling and moderate traffic (a few hundred requests per day), monthly costs are typically well under $50.