Deploy Gemma 4 on Google Kubernetes Engine (GKE)

Run Gemma 4 inference at scale on GKE with GPU node pools. Kubernetes deployment with vLLM, autoscaling, and resource limits.

Why GKE for Gemma 4?

Google Kubernetes Engine (GKE) is the right platform when you need Gemma 4 at scale with full control over hardware, networking, and scheduling. Unlike Cloud Run (which is serverless) or Vertex AI (which abstracts the cluster), GKE gives you direct access to the Kubernetes control plane — enabling advanced configurations like GPU node pools, custom schedulers, multi-model serving, and fine-grained resource quotas.

This guide deploys Gemma 4 using vLLM — a high-throughput inference engine — behind a Kubernetes Service, with a HorizontalPodAutoscaler for traffic-driven scaling.

Prerequisites

A GKE cluster with an NVIDIA GPU node pool (L4 or A100 recommended)
kubectl configured to point at the cluster
gcloud CLI installed and authenticated
A Hugging Face account with a token that has access to google/gemma-4-4b-it

Create a GPU Node Pool

If your cluster does not yet have GPU nodes, add a node pool:

gcloud container node-pools create gpu-pool \
  --cluster=YOUR_CLUSTER_NAME \
  --zone=us-central1-a \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --num-nodes=1 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=4

The g2-standard-8 machine type includes one NVIDIA L4 GPU (24 GB VRAM), which is sufficient for Gemma 4 E4B. For the 31B model, use a2-highgpu-1g (A100 40 GB) instead.

Store the Hugging Face Token as a Secret

Gemma 4 is a gated model on Hugging Face. Store the token as a Kubernetes secret so it does not appear in your manifests:

kubectl create secret generic hf-secret \
  --from-literal=token=hf_YOUR_TOKEN_HERE

Kubernetes Deployment Manifest

Save the following as gemma4-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemma4-vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma4-vllm
  template:
    metadata:
      labels:
        app: gemma4-vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "google/gemma-4-4b-it"
        - "--max-model-len"
        - "8192"
        resources:
          limits:
            nvidia.com/gpu: "1"
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: token
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10

The readinessProbe ensures Kubernetes does not route traffic to a pod until vLLM has finished loading the model weights (~60 seconds on first start).

Service Manifest

Expose the deployment internally (or externally) with a Kubernetes Service:

apiVersion: v1
kind: Service
metadata:
  name: gemma4-vllm-svc
spec:
  selector:
    app: gemma4-vllm
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

For external access, change type: ClusterIP to type: LoadBalancer and GKE will provision a Cloud Load Balancer with a public IP.

Apply the Manifests

kubectl apply -f gemma4-deployment.yaml
kubectl apply -f gemma4-service.yaml

# Watch pods come up
kubectl get pods -l app=gemma4-vllm -w

Horizontal Pod Autoscaler

Add automatic scaling based on CPU utilisation (as a proxy for request load):

kubectl autoscale deployment gemma4-vllm \
  --cpu-percent=60 \
  --min=1 \
  --max=4

For more precise GPU-utilisation-based scaling, install the DCGM Exporter and configure the HPA to use custom metrics from Prometheus.

Query the Deployed Service

From within the cluster (or via kubectl port-forward):

kubectl port-forward svc/gemma4-vllm-svc 8000:8000

Then in another terminal:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-4b-it",
    "messages": [{"role": "user", "content": "Explain GKE briefly"}],
    "max_tokens": 256
  }'

Model Caching with Persistent Volumes

On every pod restart, vLLM re-downloads the model from Hugging Face, which takes several minutes and incurs egress costs. Mount a PersistentVolumeClaim to cache the weights:

volumes:
- name: hf-cache
  persistentVolumeClaim:
    claimName: hf-cache-pvc

containers:
- name: vllm
  volumeMounts:
  - name: hf-cache
    mountPath: /root/.cache/huggingface

Create the PVC with at least 20 Gi for the E4B model or 70 Gi for the 31B model. Use ReadWriteOnce storage class for single-node access or ReadOnlyMany if multiple pods need to share the same cached weights.

GPU Node Pool Auto-provisioning

Enable GKE Node Auto-provisioning to let the cluster automatically create GPU node pools when pods with nvidia.com/gpu resource requests cannot be scheduled:

gcloud container clusters update YOUR_CLUSTER_NAME \
  --enable-autoprovisioning \
  --max-accelerator type=nvidia-l4,count=8 \
  --max-cpu 32 \
  --max-memory 128

This is ideal for batch inference workloads where GPU nodes should scale to zero when idle and scale up on demand.