gemma4.devgemma4.dev
  • Models
  • Run Local
  • Deploy
  • Guides
Try Gemma 4 ↗
gemma4.devgemma4.dev
Run Local/How to Run Gemma 4 with Ollama

How to Run Gemma 4 with Ollama

Step-by-step guide to running all Gemma 4 models locally with Ollama. Covers Gemma 4 E4B, E2B, and 26B setup on Mac, Linux, and Windows.

How to Run Gemma 4 with Ollama

Ollama is the easiest way to run Gemma 4 locally. With a single command, Ollama downloads the Gemma 4 model, handles quantization, and starts a local API server compatible with the OpenAI API format.

Prerequisites

  • macOS: Apple Silicon (M1/M2/M3/M4) or Intel Mac with 8GB+ RAM
  • Linux: Ubuntu 20.04+ with NVIDIA GPU (8GB+ VRAM) or CPU-only mode
  • Windows: WSL2 with Ubuntu, or native Windows with NVIDIA GPU

Install Ollama from ollama.ai. The installer handles everything automatically.

Running Gemma 4 E4B with Ollama

Gemma 4 E4B is the recommended starting point for Ollama users. It runs on 8GB VRAM and responds in under 2 seconds per token on modern hardware.

ollama run gemma4:4b

Ollama will automatically download and quantize the Gemma 4 E4B model. The first run takes 2-5 minutes depending on your connection speed.

Running Gemma 4 E2B with Ollama

For systems with under 8GB VRAM or 8GB unified memory, use Gemma 4 E2B:

ollama run gemma4:2b

Gemma 4 E2B requires only 3GB VRAM in Q4 quantization and runs entirely on CPU if needed.

Running Gemma 4 with Thinking Mode

Gemma 4 E4B and 31B support Thinking Mode, which enables extended reasoning for complex problems:

ollama run gemma4:4b-thinking

In Thinking Mode, Gemma 4 produces internal reasoning traces (marked with <think> tags) before giving its final answer.

Using the Ollama API

Once Gemma 4 is running, you can query it via the OpenAI-compatible API:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:4b",
    "messages": [{"role": "user", "content": "Explain Gemma 4 thinking mode"}]
  }'

Ollama + Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="gemma4:4b",
    messages=[{"role": "user", "content": "Hello from Gemma 4!"}]
)
print(response.choices[0].message.content)

Available Gemma 4 Tags on Ollama

TagModelVRAMNotes
gemma4:2bGemma 4 E2B3 GBText only
gemma4:4bGemma 4 E4B5 GBMultimodal
gemma4:4b-thinkingGemma 4 E4B Thinking5 GBExtended reasoning
gemma4:27bGemma 4 31B Q418 GBBest quality

Troubleshooting

Model won't start: Check available VRAM with nvidia-smi or Activity Monitor. Try a smaller model or add -q q4_0 for stronger quantization.

Slow responses: Ensure your GPU is being used. Run ollama run gemma4:4b and check if GPU utilization spikes in nvidia-smi or Activity Monitor.

Port already in use: Ollama runs on port 11434 by default. Check with lsof -i :11434.

gemma4.devgemma4.dev

运行、部署和调试 Gemma 4 模型。专为快节奏开发者打造。

GitHubGitHubTwitterX (Twitter)Email
Models
  • Gemma 4 E2B
  • Gemma 4 E4B
  • Gemma 4 26B
  • Gemma 4 31B
  • Compare Models
Run Local
  • Ollama
  • Hugging Face
  • GGUF
  • LM Studio
  • llama.cpp
Deploy
  • vLLM
  • Gemini API
  • Vertex AI
  • Cloud Run
Guides & Help
  • Thinking Mode
  • Prompt Formatting
  • Function Calling
  • Error Fixes
© 2026 gemma4.dev All Rights Reserved.