Deploy Gemma 4 with Gemini API

Access Gemma 4 models through Google's Gemini API. No infrastructure needed — get an API key and start using Gemma 4 instantly.

What is the Gemini API?

The Gemini API is Google's managed inference service for its AI models, including Gemma 4. Unlike self-hosted options such as vLLM or Ollama, the Gemini API requires zero infrastructure — you obtain an API key, make HTTP requests, and Google handles all compute, scaling, and model serving behind the scenes.

This makes it the fastest path to production for teams that want to use Gemma 4 without managing GPU instances, containers, or server uptime.

Available Gemma 4 Model IDs

The Gemini API exposes Gemma 4 under the following model IDs:

Model ID	Parameters	Context
`gemma-4-4b-it`	4 B (E4B)	8 K
`gemma-4-31b-it`	31 B	32 K

Use the gemma-4-4b-it model for latency-sensitive applications and the gemma-4-31b-it model when you need stronger reasoning or longer context handling.

Get an API Key

Open Google AI Studio.
Sign in with your Google account.
Click Get API key in the left sidebar.
Create a new key or copy an existing one.

Store the key as an environment variable — never hard-code it in source files:

export GEMINI_API_KEY=your_key_here

Python Example

Install the Google Generative AI SDK:

pip install google-generativeai

Then call Gemma 4 with a few lines of Python:

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-4b-it")
response = model.generate_content("Explain Gemma 4 thinking mode")
print(response.text)

For a multi-turn conversation, pass a list of messages:

chat = model.start_chat(history=[])
response = chat.send_message("What are the Gemma 4 model variants?")
print(response.text)
follow_up = chat.send_message("Which one should I use for coding tasks?")
print(follow_up.text)

REST Example

If you prefer to call the API directly without the SDK, use this curl command:

curl -X POST \
  "https://generativelanguage.googleapis.com/v1beta/models/gemma-4-4b-it:generateContent?key=YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents": [{"parts": [{"text": "What is Gemma 4?"}]}]}'

The response is a JSON object. Extract the generated text from candidates[0].content.parts[0].text.

Streaming Responses

For real-time streaming output, use generate_content_async with stream=True:

import asyncio
import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-4b-it")

async def stream_response():
    async for chunk in await model.generate_content_async(
        "Write a detailed explanation of attention mechanisms",
        stream=True,
    ):
        print(chunk.text, end="", flush=True)

asyncio.run(stream_response())

Configuration Parameters

Control generation behaviour with a GenerationConfig object:

import google.generativeai as genai
from google.generativeai.types import GenerationConfig

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-4b-it")

config = GenerationConfig(
    temperature=0.7,
    top_p=0.9,
    max_output_tokens=1024,
)

response = model.generate_content(
    "Summarise the key features of Gemma 4",
    generation_config=config,
)
print(response.text)

Rate Limits and Pricing

The Gemini API has a free tier with generous rate limits for development and low-traffic applications. As of 2025, the free tier allows 15 requests per minute (RPM) and 1 million tokens per day for Gemma 4 models.

Paid usage (via Google AI Studio billing or Google Cloud) removes most rate limit constraints and scales with your usage. Check Google AI Studio pricing for the latest figures — Gemma 4 is significantly cheaper per token than closed-source alternatives of similar capability.

When to Use the Gemini API

The Gemini API is the right choice when you:

Need to prototype or ship quickly without managing infrastructure
Have unpredictable traffic that benefits from auto-scaling
Prefer pay-per-token billing over reserved GPU capacity
Are already on Google Cloud and want a managed solution

For workloads with high and predictable throughput, a self-hosted vLLM or GKE deployment typically offers lower per-token cost at scale.