Deploy Gemma 4 on Cloud Run

Run Gemma 4 as a serverless container on Google Cloud Run. Pay-per-request pricing with GPU support using Cloud Run GPU preview.

Why Cloud Run for Gemma 4?

Cloud Run is Google Cloud's fully managed serverless container platform. You push a Docker image, configure a few parameters, and Cloud Run handles scaling from zero to thousands of requests with no cluster management.

Cloud Run now supports GPU accelerators (NVIDIA L4, currently in preview), which makes it possible to run Gemma 4 inference as a true serverless workload. You pay only when a request is being processed — idle time is free.

The approach in this guide uses Ollama as the inference backend inside the container. Ollama handles model downloading, GGUF quantization, and serving — making the Dockerfile straightforward.

Architecture

Internet → Cloud Run Service (Ollama + Gemma 4) → OpenAI-compatible /api/chat

Cloud Run scales the container to zero between requests. On a cold start, Ollama starts the server; the model weights are already baked into the image so there is no download delay at runtime.

Prerequisites

Google Cloud project with billing enabled
gcloud CLI installed and authenticated
Docker installed locally (for building the image)
Cloud Run GPU quota enabled (request via the Cloud Run quota page in the console)

Enable the Cloud Run API:

gcloud services enable run.googleapis.com \
  --project=YOUR_PROJECT_ID

Dockerfile

Create a Dockerfile that extends the official Ollama image and includes a startup script:

FROM ollama/ollama:latest
COPY start.sh /start.sh
RUN chmod +x /start.sh
CMD ["/start.sh"]

Create start.sh alongside the Dockerfile:

#!/bin/bash
ollama serve &
sleep 5
ollama pull gemma4:4b
wait

The sleep 5 gives the Ollama server time to initialise before the pull command runs. The wait at the end keeps the container alive — when wait exits (if the background ollama serve process dies), Cloud Run restarts the container.

Build and Push the Image

Build the image and push it to Google Artifact Registry:

# Create an Artifact Registry repository (one-time setup)
gcloud artifacts repositories create gemma4 \
  --repository-format=docker \
  --location=us-central1

# Configure Docker authentication
gcloud auth configure-docker us-central1-docker.pkg.dev

# Build and push
docker build -t us-central1-docker.pkg.dev/YOUR_PROJECT/gemma4/ollama-gemma4:latest .
docker push us-central1-docker.pkg.dev/YOUR_PROJECT/gemma4/ollama-gemma4:latest

Deploy to Cloud Run with GPU

Deploy the container to Cloud Run, requesting an L4 GPU:

gcloud run deploy gemma4-ollama \
  --image us-central1-docker.pkg.dev/YOUR_PROJECT/gemma4/ollama-gemma4:latest \
  --region us-central1 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --memory 16Gi \
  --cpu 4 \
  --concurrency 4 \
  --timeout 300 \
  --allow-unauthenticated

Key flags:

--gpu 1 --gpu-type nvidia-l4 — attach one L4 GPU (24 GB VRAM, sufficient for Gemma 4 E4B in GGUF format)
--memory 16Gi — system RAM for the container process
--concurrency 4 — Cloud Run will route up to 4 simultaneous requests to one container instance before scaling out
--timeout 300 — allow up to 5 minutes for long LLM generations

Query the Endpoint

Once deployed, Cloud Run provides a public HTTPS URL. Test it with curl:

curl -X POST https://YOUR-SERVICE-URL/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:4b",
    "messages": [{"role": "user", "content": "What is Gemma 4?"}],
    "stream": false
  }'

Or use the OpenAI-compatible endpoint (Ollama 0.4+):

curl -X POST https://YOUR-SERVICE-URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:4b",
    "messages": [{"role": "user", "content": "Explain Gemma 4 briefly"}]
  }'

Cold Start Considerations

The main drawback of Cloud Run for LLM inference is cold start latency. When Cloud Run scales from zero, it must:

Pull the container image (mitigated by keeping the image in the same region as the service)
Start the Ollama server (~2 seconds)
Load the model weights into GPU VRAM (~5–10 seconds for the 4B GGUF model)

Total cold start is typically 15–30 seconds for the E4B model. To mitigate this, set --min-instances 1 to keep one warm instance running at all times — this adds a small always-on GPU cost.

Pricing

Cloud Run GPU pricing (approximate, us-central1):

L4 GPU: ~$0.80/hour while active
CPU + memory: standard Cloud Run rates

With zero-scaling and moderate traffic (a few hundred requests per day), monthly costs are typically well under $50.