Deploy Gemma 4 on Google Kubernetes Engine (GKE)
Run Gemma 4 inference at scale on GKE with GPU node pools. Kubernetes deployment with vLLM, autoscaling, and resource limits.
Why GKE for Gemma 4?
Google Kubernetes Engine (GKE) is the right platform when you need Gemma 4 at scale with full control over hardware, networking, and scheduling. Unlike Cloud Run (which is serverless) or Vertex AI (which abstracts the cluster), GKE gives you direct access to the Kubernetes control plane — enabling advanced configurations like GPU node pools, custom schedulers, multi-model serving, and fine-grained resource quotas.
This guide deploys Gemma 4 using vLLM — a high-throughput inference engine — behind a Kubernetes Service, with a HorizontalPodAutoscaler for traffic-driven scaling.
Prerequisites
- A GKE cluster with an NVIDIA GPU node pool (L4 or A100 recommended)
kubectlconfigured to point at the clustergcloudCLI installed and authenticated- A Hugging Face account with a token that has access to
google/gemma-4-4b-it
Create a GPU Node Pool
If your cluster does not yet have GPU nodes, add a node pool:
gcloud container node-pools create gpu-pool \
--cluster=YOUR_CLUSTER_NAME \
--zone=us-central1-a \
--machine-type=g2-standard-8 \
--accelerator=type=nvidia-l4,count=1 \
--num-nodes=1 \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=4The g2-standard-8 machine type includes one NVIDIA L4 GPU (24 GB VRAM), which is sufficient for Gemma 4 E4B. For the 31B model, use a2-highgpu-1g (A100 40 GB) instead.
Store the Hugging Face Token as a Secret
Gemma 4 is a gated model on Hugging Face. Store the token as a Kubernetes secret so it does not appear in your manifests:
kubectl create secret generic hf-secret \
--from-literal=token=hf_YOUR_TOKEN_HEREKubernetes Deployment Manifest
Save the following as gemma4-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemma4-vllm
spec:
replicas: 1
selector:
matchLabels:
app: gemma4-vllm
template:
metadata:
labels:
app: gemma4-vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "google/gemma-4-4b-it"
- "--max-model-len"
- "8192"
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10The readinessProbe ensures Kubernetes does not route traffic to a pod until vLLM has finished loading the model weights (~60 seconds on first start).
Service Manifest
Expose the deployment internally (or externally) with a Kubernetes Service:
apiVersion: v1
kind: Service
metadata:
name: gemma4-vllm-svc
spec:
selector:
app: gemma4-vllm
ports:
- port: 8000
targetPort: 8000
type: ClusterIPFor external access, change type: ClusterIP to type: LoadBalancer and GKE will provision a Cloud Load Balancer with a public IP.
Apply the Manifests
kubectl apply -f gemma4-deployment.yaml
kubectl apply -f gemma4-service.yaml
# Watch pods come up
kubectl get pods -l app=gemma4-vllm -wHorizontal Pod Autoscaler
Add automatic scaling based on CPU utilisation (as a proxy for request load):
kubectl autoscale deployment gemma4-vllm \
--cpu-percent=60 \
--min=1 \
--max=4For more precise GPU-utilisation-based scaling, install the DCGM Exporter and configure the HPA to use custom metrics from Prometheus.
Query the Deployed Service
From within the cluster (or via kubectl port-forward):
kubectl port-forward svc/gemma4-vllm-svc 8000:8000Then in another terminal:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-4b-it",
"messages": [{"role": "user", "content": "Explain GKE briefly"}],
"max_tokens": 256
}'Model Caching with Persistent Volumes
On every pod restart, vLLM re-downloads the model from Hugging Face, which takes several minutes and incurs egress costs. Mount a PersistentVolumeClaim to cache the weights:
volumes:
- name: hf-cache
persistentVolumeClaim:
claimName: hf-cache-pvc
containers:
- name: vllm
volumeMounts:
- name: hf-cache
mountPath: /root/.cache/huggingfaceCreate the PVC with at least 20 Gi for the E4B model or 70 Gi for the 31B model. Use ReadWriteOnce storage class for single-node access or ReadOnlyMany if multiple pods need to share the same cached weights.
GPU Node Pool Auto-provisioning
Enable GKE Node Auto-provisioning to let the cluster automatically create GPU node pools when pods with nvidia.com/gpu resource requests cannot be scheduled:
gcloud container clusters update YOUR_CLUSTER_NAME \
--enable-autoprovisioning \
--max-accelerator type=nvidia-l4,count=8 \
--max-cpu 32 \
--max-memory 128This is ideal for batch inference workloads where GPU nodes should scale to zero when idle and scale up on demand.