How to Run Gemma 4 with Ollama
Step-by-step guide to running all Gemma 4 models locally with Ollama. Covers Gemma 4 E4B, E2B, and 26B setup on Mac, Linux, and Windows.
How to Run Gemma 4 with Ollama
Ollama is the easiest way to run Gemma 4 locally. With a single command, Ollama downloads the Gemma 4 model, handles quantization, and starts a local API server compatible with the OpenAI API format.
Prerequisites
- macOS: Apple Silicon (M1/M2/M3/M4) or Intel Mac with 8GB+ RAM
- Linux: Ubuntu 20.04+ with NVIDIA GPU (8GB+ VRAM) or CPU-only mode
- Windows: WSL2 with Ubuntu, or native Windows with NVIDIA GPU
Install Ollama from ollama.ai. The installer handles everything automatically.
Running Gemma 4 E4B with Ollama
Gemma 4 E4B is the recommended starting point for Ollama users. It runs on 8GB VRAM and responds in under 2 seconds per token on modern hardware.
ollama run gemma4:4bOllama will automatically download and quantize the Gemma 4 E4B model. The first run takes 2-5 minutes depending on your connection speed.
Running Gemma 4 E2B with Ollama
For systems with under 8GB VRAM or 8GB unified memory, use Gemma 4 E2B:
ollama run gemma4:2bGemma 4 E2B requires only 3GB VRAM in Q4 quantization and runs entirely on CPU if needed.
Running Gemma 4 with Thinking Mode
Gemma 4 E4B and 31B support Thinking Mode, which enables extended reasoning for complex problems:
ollama run gemma4:4b-thinkingIn Thinking Mode, Gemma 4 produces internal reasoning traces (marked with <think> tags) before giving its final answer.
Using the Ollama API
Once Gemma 4 is running, you can query it via the OpenAI-compatible API:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:4b",
"messages": [{"role": "user", "content": "Explain Gemma 4 thinking mode"}]
}'Ollama + Python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="gemma4:4b",
messages=[{"role": "user", "content": "Hello from Gemma 4!"}]
)
print(response.choices[0].message.content)Available Gemma 4 Tags on Ollama
| Tag | Model | VRAM | Notes |
|---|---|---|---|
gemma4:2b | Gemma 4 E2B | 3 GB | Text only |
gemma4:4b | Gemma 4 E4B | 5 GB | Multimodal |
gemma4:4b-thinking | Gemma 4 E4B Thinking | 5 GB | Extended reasoning |
gemma4:27b | Gemma 4 31B Q4 | 18 GB | Best quality |
Troubleshooting
Model won't start: Check available VRAM with nvidia-smi or Activity Monitor. Try a smaller model or add -q q4_0 for stronger quantization.
Slow responses: Ensure your GPU is being used. Run ollama run gemma4:4b and check if GPU utilization spikes in nvidia-smi or Activity Monitor.
Port already in use: Ollama runs on port 11434 by default. Check with lsof -i :11434.