Deploy Gemma 4 on Vertex AI
Deploy Gemma 4 models on Google Cloud Vertex AI for enterprise inference. Covers Model Garden setup, custom prediction, and scaling.
What is Vertex AI?
Vertex AI is Google Cloud's unified machine learning platform. It provides managed infrastructure for training, fine-tuning, and serving ML models at enterprise scale. The Model Garden feature lists curated open models — including Gemma 4 — that can be deployed to a managed prediction endpoint with a few clicks or API calls.
Compared to the Gemini API, Vertex AI gives you more control: you choose the hardware (GPU type, region, replica count), you can bring custom serving containers, and you get VPC-level network isolation suitable for regulated industries.
Prerequisites
Before deploying Gemma 4 on Vertex AI you need:
- A Google Cloud project with billing enabled
- The Vertex AI API enabled (
aiplatform.googleapis.com) - The
gcloudCLI installed and authenticated - The
google-cloud-aiplatformPython SDK
Enable the API with:
gcloud services enable aiplatform.googleapis.com \
--project=YOUR_PROJECT_IDInstall the SDK:
pip install google-cloud-aiplatformFind Gemma 4 in Model Garden
- Open the Google Cloud Console.
- Navigate to Vertex AI > Model Garden.
- Search for Gemma 4.
- Click the model card and select Deploy.
The console will guide you through choosing a region, a GPU machine type, and the number of replicas. For the E4B model, an n1-standard-4 machine with a single NVIDIA T4 GPU is sufficient. For the 31B model, choose an a2-highgpu-1g (A100 40 GB) or larger.
Deploy via Python SDK
For programmatic or CI/CD deployments, use the Vertex AI Python SDK to upload and deploy the model:
from google.cloud import aiplatform
aiplatform.init(project="your-project", location="us-central1")
# Deploy Gemma 4 from Model Garden
model = aiplatform.Model(
model_name="publishers/google/models/gemma4@gemma-4-4b-it",
)
endpoint = model.deploy(
machine_type="n1-standard-4",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
min_replica_count=1,
max_replica_count=3,
)
print(f"Endpoint resource name: {endpoint.resource_name}")The max_replica_count=3 setting enables autoscaling — Vertex AI will add replicas as traffic grows and remove them when traffic drops, so you only pay for what you use.
Query the Deployed Endpoint
Once the endpoint is deployed (typically 5–10 minutes), query it via the SDK:
from google.cloud import aiplatform
aiplatform.init(project="your-project", location="us-central1")
endpoint = aiplatform.Endpoint("projects/YOUR_PROJECT/locations/us-central1/endpoints/YOUR_ENDPOINT_ID")
response = endpoint.predict(
instances=[{"inputs": "What is Gemma 4?"}]
)
print(response.predictions[0])Or with a chat-style prompt:
response = endpoint.predict(
instances=[{
"inputs": "<start_of_turn>user\nExplain transformers briefly<end_of_turn>\n<start_of_turn>model\n"
}],
parameters={
"max_new_tokens": 512,
"temperature": 0.7,
},
)
print(response.predictions[0])Autoscaling Configuration
Vertex AI autoscaling is configured at deploy time via min_replica_count and max_replica_count. For production workloads you will also want to set a traffic split if you are running multiple model versions on one endpoint:
endpoint.deploy(
model=model,
machine_type="a2-highgpu-1g",
accelerator_type="NVIDIA_TESLA_A100",
accelerator_count=1,
min_replica_count=1,
max_replica_count=5,
traffic_percentage=100,
)Vertex AI uses CPU utilisation and request queue depth as autoscaling signals. For GPU workloads, requests queue up until a replica is free; additional replicas spin up within 2–3 minutes when the queue remains non-empty.
GPU Options and Sizing
| Gemma 4 Model | Recommended GPU | Machine Type |
|---|---|---|
E4B (gemma-4-4b-it) | NVIDIA T4 | n1-standard-4 |
31B (gemma-4-31b-it) | NVIDIA A100 40 GB | a2-highgpu-1g |
| 31B (high throughput) | 2× A100 80 GB | a2-ultragpu-2g |
Choose a region where the required GPU quota is available. us-central1 has the broadest availability.
Pricing
Vertex AI charges for:
- Compute: per-hour rate for the machine and GPU type while the endpoint has at least one replica running
- Prediction requests: a small per-1,000-request fee
Scale min_replica_count to zero if the endpoint does not need to serve traffic 24/7. Note that cold starts with a zero-replica configuration take 3–5 minutes for large models.
When to Use Vertex AI
Vertex AI is the best choice when you:
- Need enterprise SLAs, VPC isolation, and audit logging
- Want tight integration with other Google Cloud services (BigQuery, Cloud Storage, Pub/Sub)
- Need to run custom inference containers or multi-model ensembles
- Have compliance requirements that prevent using the shared Gemini API infrastructure