Thinking Mode
How to enable and control Gemma 4's extended reasoning — and when it actually helps.
Thinking Mode
Gemma 4 supports an extended reasoning mode that generates an internal chain-of-thought before producing a final answer. This significantly improves accuracy on mathematical reasoning, multi-step logic, and code debugging tasks.
How it works
When thinking mode is active, Gemma 4:
- Generates a
<think>...</think>block containing its reasoning steps - Produces the final answer after the closing tag
- The think block is typically hidden in UI wrappers but available in the raw API response
<think>
Let me work through this step by step.
The user wants to sort a list of integers...
</think>
Here's the solution:
```python
sorted_list = sorted(numbers)
## Enabling thinking mode
Thinking mode is controlled via the system prompt or a special token depending on the runtime.
### Ollama
```bash
ollama run gemma4:4b
>>> /set system "Think step by step before answering."Python / Hugging Face
messages = [
{
"role": "system",
"content": "You are a careful reasoning assistant. Think step by step."
},
{
"role": "user",
"content": "What is 17 × 23 + 41?"
}
]Explicit thinking token (advanced)
Some fine-tuned versions support a dedicated token:
<|thinking|>
{your prompt here}
<|/thinking|>Check the specific model card on Hugging Face for token support.
When to use it
| Task | Thinking mode | Reason |
|---|---|---|
| Math word problems | ✓ Yes | Multi-step arithmetic benefits from scratchpad |
| Code debugging | ✓ Yes | Root cause analysis improves significantly |
| Simple factual Q&A | ✗ No | Overhead without benefit |
| Creative writing | ✗ No | Reduces fluency |
| Classification | ✗ No | Direct answers are faster and equally accurate |
Performance impact
Thinking mode increases:
- Token count: 2–5× more tokens generated
- Latency: Proportional to extra tokens
- VRAM: Minimal increase (context is already allocated)
On E4B with Ollama, a typical thinking-mode response takes 8–15 seconds vs 2–4 seconds without.
Thinking tokens count toward your context window. On the E4B (32K context), long think blocks can consume a significant portion of the available window for multi-turn conversations.
Truncating think blocks
If you're building an application, strip the think block before displaying to users:
import re
def strip_thinking(response: str) -> str:
return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()