How to Run Gemma 4 with Ollama

Step-by-step guide to running all Gemma 4 models locally with Ollama. Covers Gemma 4 E4B, E2B, and 26B setup on Mac, Linux, and Windows.

How to Run Gemma 4 with Ollama

Ollama is the easiest way to run Gemma 4 locally. With a single command, Ollama downloads the Gemma 4 model, handles quantization, and starts a local API server compatible with the OpenAI API format.

Prerequisites

macOS: Apple Silicon (M1/M2/M3/M4) or Intel Mac with 8GB+ RAM
Linux: Ubuntu 20.04+ with NVIDIA GPU (8GB+ VRAM) or CPU-only mode
Windows: WSL2 with Ubuntu, or native Windows with NVIDIA GPU

Install Ollama from ollama.ai. The installer handles everything automatically.

Running Gemma 4 E4B with Ollama

Gemma 4 E4B is the recommended starting point for Ollama users. It runs on 8GB VRAM and responds in under 2 seconds per token on modern hardware.

ollama run gemma4:4b

Ollama will automatically download and quantize the Gemma 4 E4B model. The first run takes 2-5 minutes depending on your connection speed.

Running Gemma 4 E2B with Ollama

For systems with under 8GB VRAM or 8GB unified memory, use Gemma 4 E2B:

ollama run gemma4:2b

Gemma 4 E2B requires only 3GB VRAM in Q4 quantization and runs entirely on CPU if needed.

Running Gemma 4 with Thinking Mode

Gemma 4 E4B and 31B support Thinking Mode, which enables extended reasoning for complex problems:

ollama run gemma4:4b-thinking

In Thinking Mode, Gemma 4 produces internal reasoning traces (marked with <think> tags) before giving its final answer.

Using the Ollama API

Once Gemma 4 is running, you can query it via the OpenAI-compatible API:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:4b",
    "messages": [{"role": "user", "content": "Explain Gemma 4 thinking mode"}]
  }'

Ollama + Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="gemma4:4b",
    messages=[{"role": "user", "content": "Hello from Gemma 4!"}]
)
print(response.choices[0].message.content)

Available Gemma 4 Tags on Ollama

Tag	Model	VRAM	Notes
`gemma4:2b`	Gemma 4 E2B	3 GB	Text only
`gemma4:4b`	Gemma 4 E4B	5 GB	Multimodal
`gemma4:4b-thinking`	Gemma 4 E4B Thinking	5 GB	Extended reasoning
`gemma4:27b`	Gemma 4 31B Q4	18 GB	Best quality

Troubleshooting

Model won't start: Check available VRAM with nvidia-smi or Activity Monitor. Try a smaller model or add -q q4_0 for stronger quantization.

Slow responses: Ensure your GPU is being used. Run ollama run gemma4:4b and check if GPU utilization spikes in nvidia-smi or Activity Monitor.

Port already in use: Ollama runs on port 11434 by default. Check with lsof -i :11434.

How to Run Gemma 4 with Ollama

Step-by-step guide to running all Gemma 4 models locally with Ollama. Covers Gemma 4 E4B, E2B, and 26B setup on Mac, Linux, and Windows.

How to Run Gemma 4 with Ollama

Ollama is the easiest way to run Gemma 4 locally. With a single command, Ollama downloads the Gemma 4 model, handles quantization, and starts a local API server compatible with the OpenAI API format.

Prerequisites

macOS: Apple Silicon (M1/M2/M3/M4) or Intel Mac with 8GB+ RAM
Linux: Ubuntu 20.04+ with NVIDIA GPU (8GB+ VRAM) or CPU-only mode
Windows: WSL2 with Ubuntu, or native Windows with NVIDIA GPU

Install Ollama from ollama.ai. The installer handles everything automatically.

Running Gemma 4 E4B with Ollama

Gemma 4 E4B is the recommended starting point for Ollama users. It runs on 8GB VRAM and responds in under 2 seconds per token on modern hardware.

ollama run gemma4:4b

Ollama will automatically download and quantize the Gemma 4 E4B model. The first run takes 2-5 minutes depending on your connection speed.

Running Gemma 4 E2B with Ollama

For systems with under 8GB VRAM or 8GB unified memory, use Gemma 4 E2B:

ollama run gemma4:2b

Gemma 4 E2B requires only 3GB VRAM in Q4 quantization and runs entirely on CPU if needed.

Running Gemma 4 with Thinking Mode

Gemma 4 E4B and 31B support Thinking Mode, which enables extended reasoning for complex problems:

ollama run gemma4:4b-thinking

In Thinking Mode, Gemma 4 produces internal reasoning traces (marked with <think> tags) before giving its final answer.

Using the Ollama API

Once Gemma 4 is running, you can query it via the OpenAI-compatible API:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:4b",
    "messages": [{"role": "user", "content": "Explain Gemma 4 thinking mode"}]
  }'

Ollama + Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="gemma4:4b",
    messages=[{"role": "user", "content": "Hello from Gemma 4!"}]
)
print(response.choices[0].message.content)

Available Gemma 4 Tags on Ollama

Tag	Model	VRAM	Notes
`gemma4:2b`	Gemma 4 E2B	3 GB	Text only
`gemma4:4b`	Gemma 4 E4B	5 GB	Multimodal
`gemma4:4b-thinking`	Gemma 4 E4B Thinking	5 GB	Extended reasoning
`gemma4:27b`	Gemma 4 31B Q4	18 GB	Best quality

Troubleshooting

Model won't start: Check available VRAM with nvidia-smi or Activity Monitor. Try a smaller model or add -q q4_0 for stronger quantization.

Slow responses: Ensure your GPU is being used. Run ollama run gemma4:4b and check if GPU utilization spikes in nvidia-smi or Activity Monitor.

Port already in use: Ollama runs on port 11434 by default. Check with lsof -i :11434.