Deploy Gemma 4 with Gemini API
Access Gemma 4 models through Google's Gemini API. No infrastructure needed — get an API key and start using Gemma 4 instantly.
What is the Gemini API?
The Gemini API is Google's managed inference service for its AI models, including Gemma 4. Unlike self-hosted options such as vLLM or Ollama, the Gemini API requires zero infrastructure — you obtain an API key, make HTTP requests, and Google handles all compute, scaling, and model serving behind the scenes.
This makes it the fastest path to production for teams that want to use Gemma 4 without managing GPU instances, containers, or server uptime.
Available Gemma 4 Model IDs
The Gemini API exposes Gemma 4 under the following model IDs:
| Model ID | Parameters | Context |
|---|---|---|
gemma-4-4b-it | 4 B (E4B) | 8 K |
gemma-4-31b-it | 31 B | 32 K |
Use the gemma-4-4b-it model for latency-sensitive applications and the gemma-4-31b-it model when you need stronger reasoning or longer context handling.
Get an API Key
- Open Google AI Studio.
- Sign in with your Google account.
- Click Get API key in the left sidebar.
- Create a new key or copy an existing one.
Store the key as an environment variable — never hard-code it in source files:
export GEMINI_API_KEY=your_key_herePython Example
Install the Google Generative AI SDK:
pip install google-generativeaiThen call Gemma 4 with a few lines of Python:
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-4b-it")
response = model.generate_content("Explain Gemma 4 thinking mode")
print(response.text)For a multi-turn conversation, pass a list of messages:
chat = model.start_chat(history=[])
response = chat.send_message("What are the Gemma 4 model variants?")
print(response.text)
follow_up = chat.send_message("Which one should I use for coding tasks?")
print(follow_up.text)REST Example
If you prefer to call the API directly without the SDK, use this curl command:
curl -X POST \
"https://generativelanguage.googleapis.com/v1beta/models/gemma-4-4b-it:generateContent?key=YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"contents": [{"parts": [{"text": "What is Gemma 4?"}]}]}'The response is a JSON object. Extract the generated text from candidates[0].content.parts[0].text.
Streaming Responses
For real-time streaming output, use generate_content_async with stream=True:
import asyncio
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-4b-it")
async def stream_response():
async for chunk in await model.generate_content_async(
"Write a detailed explanation of attention mechanisms",
stream=True,
):
print(chunk.text, end="", flush=True)
asyncio.run(stream_response())Configuration Parameters
Control generation behaviour with a GenerationConfig object:
import google.generativeai as genai
from google.generativeai.types import GenerationConfig
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-4b-it")
config = GenerationConfig(
temperature=0.7,
top_p=0.9,
max_output_tokens=1024,
)
response = model.generate_content(
"Summarise the key features of Gemma 4",
generation_config=config,
)
print(response.text)Rate Limits and Pricing
The Gemini API has a free tier with generous rate limits for development and low-traffic applications. As of 2025, the free tier allows 15 requests per minute (RPM) and 1 million tokens per day for Gemma 4 models.
Paid usage (via Google AI Studio billing or Google Cloud) removes most rate limit constraints and scales with your usage. Check Google AI Studio pricing for the latest figures — Gemma 4 is significantly cheaper per token than closed-source alternatives of similar capability.
When to Use the Gemini API
The Gemini API is the right choice when you:
- Need to prototype or ship quickly without managing infrastructure
- Have unpredictable traffic that benefits from auto-scaling
- Prefer pay-per-token billing over reserved GPU capacity
- Are already on Google Cloud and want a managed solution
For workloads with high and predictable throughput, a self-hosted vLLM or GKE deployment typically offers lower per-token cost at scale.