Deploy Gemma 4 on Vertex AI

Deploy Gemma 4 models on Google Cloud Vertex AI for enterprise inference. Covers Model Garden setup, custom prediction, and scaling.

What is Vertex AI?

Vertex AI is Google Cloud's unified machine learning platform. It provides managed infrastructure for training, fine-tuning, and serving ML models at enterprise scale. The Model Garden feature lists curated open models — including Gemma 4 — that can be deployed to a managed prediction endpoint with a few clicks or API calls.

Compared to the Gemini API, Vertex AI gives you more control: you choose the hardware (GPU type, region, replica count), you can bring custom serving containers, and you get VPC-level network isolation suitable for regulated industries.

Prerequisites

Before deploying Gemma 4 on Vertex AI you need:

A Google Cloud project with billing enabled
The Vertex AI API enabled (aiplatform.googleapis.com)
The gcloud CLI installed and authenticated
The google-cloud-aiplatform Python SDK

Enable the API with:

gcloud services enable aiplatform.googleapis.com \
  --project=YOUR_PROJECT_ID

Install the SDK:

pip install google-cloud-aiplatform

Find Gemma 4 in Model Garden

Open the Google Cloud Console.
Navigate to Vertex AI > Model Garden.
Search for Gemma 4.
Click the model card and select Deploy.

The console will guide you through choosing a region, a GPU machine type, and the number of replicas. For the E4B model, an n1-standard-4 machine with a single NVIDIA T4 GPU is sufficient. For the 31B model, choose an a2-highgpu-1g (A100 40 GB) or larger.

Deploy via Python SDK

For programmatic or CI/CD deployments, use the Vertex AI Python SDK to upload and deploy the model:

from google.cloud import aiplatform

aiplatform.init(project="your-project", location="us-central1")

# Deploy Gemma 4 from Model Garden
model = aiplatform.Model(
    model_name="publishers/google/models/gemma4@gemma-4-4b-it",
)

endpoint = model.deploy(
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=3,
)

print(f"Endpoint resource name: {endpoint.resource_name}")

The max_replica_count=3 setting enables autoscaling — Vertex AI will add replicas as traffic grows and remove them when traffic drops, so you only pay for what you use.

Query the Deployed Endpoint

Once the endpoint is deployed (typically 5–10 minutes), query it via the SDK:

from google.cloud import aiplatform

aiplatform.init(project="your-project", location="us-central1")
endpoint = aiplatform.Endpoint("projects/YOUR_PROJECT/locations/us-central1/endpoints/YOUR_ENDPOINT_ID")

response = endpoint.predict(
    instances=[{"inputs": "What is Gemma 4?"}]
)
print(response.predictions[0])

Or with a chat-style prompt:

response = endpoint.predict(
    instances=[{
        "inputs": "<start_of_turn>user\nExplain transformers briefly<end_of_turn>\n<start_of_turn>model\n"
    }],
    parameters={
        "max_new_tokens": 512,
        "temperature": 0.7,
    },
)
print(response.predictions[0])

Autoscaling Configuration

Vertex AI autoscaling is configured at deploy time via min_replica_count and max_replica_count. For production workloads you will also want to set a traffic split if you are running multiple model versions on one endpoint:

endpoint.deploy(
    model=model,
    machine_type="a2-highgpu-1g",
    accelerator_type="NVIDIA_TESLA_A100",
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=5,
    traffic_percentage=100,
)

Vertex AI uses CPU utilisation and request queue depth as autoscaling signals. For GPU workloads, requests queue up until a replica is free; additional replicas spin up within 2–3 minutes when the queue remains non-empty.

GPU Options and Sizing

Gemma 4 Model	Recommended GPU	Machine Type
E4B (`gemma-4-4b-it`)	NVIDIA T4	`n1-standard-4`
31B (`gemma-4-31b-it`)	NVIDIA A100 40 GB	`a2-highgpu-1g`
31B (high throughput)	2× A100 80 GB	`a2-ultragpu-2g`

Choose a region where the required GPU quota is available. us-central1 has the broadest availability.

Pricing

Vertex AI charges for:

Compute: per-hour rate for the machine and GPU type while the endpoint has at least one replica running
Prediction requests: a small per-1,000-request fee

Scale min_replica_count to zero if the endpoint does not need to serve traffic 24/7. Note that cold starts with a zero-replica configuration take 3–5 minutes for large models.

When to Use Vertex AI

Vertex AI is the best choice when you:

Need enterprise SLAs, VPC isolation, and audit logging
Want tight integration with other Google Cloud services (BigQuery, Cloud Storage, Pub/Sub)
Need to run custom inference containers or multi-model ensembles
Have compliance requirements that prevent using the shared Gemini API infrastructure