How to enable and control Gemma 4's extended reasoning — and when it actually helps.

Thinking Mode

Gemma 4 supports an extended reasoning mode that generates an internal chain-of-thought before producing a final answer. This significantly improves accuracy on mathematical reasoning, multi-step logic, and code debugging tasks.

How it works

When thinking mode is active, Gemma 4:

Generates a <think>...</think> block containing its reasoning steps
Produces the final answer after the closing tag
The think block is typically hidden in UI wrappers but available in the raw API response

<think>
Let me work through this step by step.
The user wants to sort a list of integers...
</think>

Here's the solution:
```python
sorted_list = sorted(numbers)


## Enabling thinking mode

Thinking mode is controlled via the system prompt or a special token depending on the runtime.

### Ollama

```bash
ollama run gemma4:4b
>>> /set system "Think step by step before answering."

Python / Hugging Face

messages = [
    {
        "role": "system",
        "content": "You are a careful reasoning assistant. Think step by step."
    },
    {
        "role": "user",
        "content": "What is 17 × 23 + 41?"
    }
]

Explicit thinking token (advanced)

Some fine-tuned versions support a dedicated token:

<|thinking|>
{your prompt here}
<|/thinking|>

Check the specific model card on Hugging Face for token support.

When to use it

Task	Thinking mode	Reason
Math word problems	✓ Yes	Multi-step arithmetic benefits from scratchpad
Code debugging	✓ Yes	Root cause analysis improves significantly
Simple factual Q&A	✗ No	Overhead without benefit
Creative writing	✗ No	Reduces fluency
Classification	✗ No	Direct answers are faster and equally accurate

Performance impact

Thinking mode increases:

Token count: 2–5× more tokens generated
Latency: Proportional to extra tokens
VRAM: Minimal increase (context is already allocated)

On E4B with Ollama, a typical thinking-mode response takes 8–15 seconds vs 2–4 seconds without.

Thinking tokens count toward your context window. On the E4B (32K context), long think blocks can consume a significant portion of the available window for multi-turn conversations.

Truncating think blocks

If you're building an application, strip the think block before displaying to users:

import re

def strip_thinking(response: str) -> str:
    return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()

Thinking Mode

How it works

When thinking mode is active, Gemma 4:

Generates a <think>...</think> block containing its reasoning steps
Produces the final answer after the closing tag
The think block is typically hidden in UI wrappers but available in the raw API response

<think>
Let me work through this step by step.
The user wants to sort a list of integers...
</think>

Here's the solution:
```python
sorted_list = sorted(numbers)


## Enabling thinking mode

Thinking mode is controlled via the system prompt or a special token depending on the runtime.

### Ollama

```bash
ollama run gemma4:4b
>>> /set system "Think step by step before answering."

Python / Hugging Face

messages = [
    {
        "role": "system",
        "content": "You are a careful reasoning assistant. Think step by step."
    },
    {
        "role": "user",
        "content": "What is 17 × 23 + 41?"
    }
]

Explicit thinking token (advanced)

Some fine-tuned versions support a dedicated token:

<|thinking|>
{your prompt here}
<|/thinking|>

Check the specific model card on Hugging Face for token support.

When to use it

Task	Thinking mode	Reason
Math word problems	✓ Yes	Multi-step arithmetic benefits from scratchpad
Code debugging	✓ Yes	Root cause analysis improves significantly
Simple factual Q&A	✗ No	Overhead without benefit
Creative writing	✗ No	Reduces fluency
Classification	✗ No	Direct answers are faster and equally accurate

Performance impact

Thinking mode increases:

Token count: 2–5× more tokens generated
Latency: Proportional to extra tokens
VRAM: Minimal increase (context is already allocated)

On E4B with Ollama, a typical thinking-mode response takes 8–15 seconds vs 2–4 seconds without.

Thinking tokens count toward your context window. On the E4B (32K context), long think blocks can consume a significant portion of the available window for multi-turn conversations.

Truncating think blocks

If you're building an application, strip the think block before displaying to users:

import re

def strip_thinking(response: str) -> str:
    return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()

Thinking Mode

Thinking Mode

How it works

Python / Hugging Face

Explicit thinking token (advanced)

When to use it

Performance impact

Truncating think blocks

目录

Thinking Mode

Thinking Mode

How it works

Python / Hugging Face

Explicit thinking token (advanced)

When to use it

Performance impact

Truncating think blocks

目录