Fix: Can't Disable Gemma 4 Thinking Mode

How to disable Gemma 4's thinking mode when <think> tags appear in responses. Covers Ollama, HuggingFace, and llama.cpp solutions.

Symptom

Every response from your Gemma 4 model includes <think> and </think> tags wrapping a chain-of-thought block before the actual answer:

<think>
Let me think about this step by step...
...
</think>
The answer is...

Adding instructions like "do not think" or "respond directly" to the system prompt has no effect. The thinking output keeps appearing.

You are running a thinking model variant — a checkpoint specifically trained to produce chain-of-thought reasoning. In Ollama, this is the *-thinking tag (e.g., gemma4:4b-thinking). In other runtimes, the weights may be labeled with -thinking or -it-thinking in the filename.

Thinking mode is baked into the model weights at training time. It is not a runtime parameter you can toggle on or off. The model was trained to always produce <think>...</think> blocks, so no system prompt instruction can reliably suppress it.

Fix 1: Switch to the Standard (Non-Thinking) Model

The most reliable fix is to load the correct model variant. With Ollama:

# Instead of:
ollama run gemma4:4b-thinking

# Use:
ollama run gemma4:4b

The standard gemma4:4b model does not produce thinking output. Use the -thinking variants only when you specifically need extended reasoning on complex tasks.

Fix 2: Strip `<think>` Tags from Output in Code

If you need to keep the thinking model (for example, because you want its improved reasoning quality but do not want to display the thinking steps), filter the output in your application:

import re

def strip_thinking(text: str) -> str:
    return re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()

# Usage:
raw_response = model.generate(prompt)
clean_response = strip_thinking(raw_response)

This approach lets you benefit from the thinking model's reasoning quality while presenting only the final answer to users.

Fix 3: HuggingFace Generation Config

If you are using Gemma 4 directly via the HuggingFace transformers library, check whether the model's generation config exposes a thinking_mode parameter. If it does:

output = model.generate(
    inputs,
    thinking_mode=False,
    max_new_tokens=512,
)

Note: support for this parameter depends on the specific model version and transformers version. Verify in the model card on Hugging Face whether it is available.

Important: System Prompts Do Not Work

A common attempted workaround is adding a system prompt instruction like:

You are a helpful assistant. Do not show your thinking. Respond directly without <think> tags.

This does not reliably suppress thinking output. The model may follow the instruction occasionally but will revert to generating thinking tokens in most cases. Model variant selection (Fix 1) is the only reliable method to prevent thinking output entirely.