gemma4.devgemma4.dev
MkSaaS文档
gemma4.devgemma4.dev
MkSaaS文档
首页Gemma 4 Developer Hub
ConceptsThinking ModeMultimodal InputsQuantization
X (Twitter)
Concepts

Thinking Mode

How to enable and control Gemma 4's extended reasoning — and when it actually helps.

Thinking Mode

Gemma 4 supports an extended reasoning mode that generates an internal chain-of-thought before producing a final answer. This significantly improves accuracy on mathematical reasoning, multi-step logic, and code debugging tasks.

How it works

When thinking mode is active, Gemma 4:

  1. Generates a <think>...</think> block containing its reasoning steps
  2. Produces the final answer after the closing tag
  3. The think block is typically hidden in UI wrappers but available in the raw API response
<think>
Let me work through this step by step.
The user wants to sort a list of integers...
</think>

Here's the solution:
```python
sorted_list = sorted(numbers)

## Enabling thinking mode

Thinking mode is controlled via the system prompt or a special token depending on the runtime.

### Ollama

```bash
ollama run gemma4:4b
>>> /set system "Think step by step before answering."

Python / Hugging Face

messages = [
    {
        "role": "system",
        "content": "You are a careful reasoning assistant. Think step by step."
    },
    {
        "role": "user",
        "content": "What is 17 × 23 + 41?"
    }
]

Explicit thinking token (advanced)

Some fine-tuned versions support a dedicated token:

<|thinking|>
{your prompt here}
<|/thinking|>

Check the specific model card on Hugging Face for token support.

When to use it

TaskThinking modeReason
Math word problems✓ YesMulti-step arithmetic benefits from scratchpad
Code debugging✓ YesRoot cause analysis improves significantly
Simple factual Q&A✗ NoOverhead without benefit
Creative writing✗ NoReduces fluency
Classification✗ NoDirect answers are faster and equally accurate

Performance impact

Thinking mode increases:

  • Token count: 2–5× more tokens generated
  • Latency: Proportional to extra tokens
  • VRAM: Minimal increase (context is already allocated)

On E4B with Ollama, a typical thinking-mode response takes 8–15 seconds vs 2–4 seconds without.

Thinking tokens count toward your context window. On the E4B (32K context), long think blocks can consume a significant portion of the available window for multi-turn conversations.

Truncating think blocks

If you're building an application, strip the think block before displaying to users:

import re

def strip_thinking(response: str) -> str:
    return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()

Concepts

Core concepts — thinking mode, multimodal, quantization, and prompt formatting.

Multimodal Inputs

How Gemma 4 processes images, and how to pass vision inputs across different runtimes.

目录

Thinking Mode
How it works
Python / Hugging Face
Explicit thinking token (advanced)
When to use it
Performance impact
Truncating think blocks