Gemma 4 Thinking Mode — How Extended Reasoning Works

Learn how to enable and use Gemma 4's Thinking Mode for complex reasoning tasks. Covers available models, prompt format, and output parsing.

What is Thinking Mode?

Thinking Mode is Gemma 4's extended reasoning feature. When enabled, the model generates an internal chain-of-thought before producing its final answer. This inner monologue appears between <think> and </think> tags in the raw output — the model "thinks out loud" before committing to a response.

This is distinct from standard prompting where the model generates tokens directly toward an answer. In Thinking Mode the model can explore multiple solution paths, backtrack, and verify intermediate steps — the same technique used in reasoning-focused models like DeepSeek-R1 and QwQ.

The result is meaningfully better performance on tasks that require multi-step logic: math problems, code debugging, scientific reasoning, and complex planning.

Which Gemma 4 Models Support Thinking Mode?

Not all Gemma 4 variants include Thinking Mode. The feature is available on:

Model	Thinking Mode	Notes
Gemma 4 E4B (`gemma4:4b-thinking`)	Yes	Best balance of speed and reasoning
Gemma 4 31B (`gemma4:31b-thinking`)	Yes	Strongest reasoning; requires 48 GB+ VRAM
Gemma 4 E2B	No	Too small for extended reasoning
Gemma 4 26B A4B (sparse)	No	Standard mode only

The -thinking suffix on Ollama model tags signals a Thinking Mode checkpoint. These are separate model weights, not a runtime flag applied to the standard weights.

Enable with Ollama

Pull and run the thinking variant:

ollama pull gemma4:4b-thinking
ollama run gemma4:4b-thinking

At the interactive prompt, ask a reasoning-heavy question:

>>> Solve step by step: a train leaves city A at 60 mph. Another leaves city B
    at 90 mph toward A. The cities are 300 miles apart. When do they meet?

You will see the <think> block appear before the final answer in the raw stream.

Enable with Hugging Face Transformers

When loading from Hugging Face, use the instruction-tuned thinking checkpoint and ensure thinking_mode=True is set in the generation config:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-4b-it"  # thinking weights
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "What is the derivative of x^3 * sin(x)?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=False,
)
raw_output = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(raw_output)

Parsing the Thinking Output

The raw output contains both the thinking trace and the final answer. For production use you typically want to strip the thinking section and return only the final answer to end users:

import re

def extract_response(raw_output: str) -> tuple[str, str]:
    """Separate Gemma 4 thinking trace from final answer."""
    think_match = re.search(r'<think>(.*?)</think>', raw_output, re.DOTALL)
    thinking = think_match.group(1).strip() if think_match else ""
    response = re.sub(r'<think>.*?</think>', '', raw_output, flags=re.DOTALL).strip()
    return thinking, response

thinking, answer = extract_response(raw_output)
print("=== Thinking ===")
print(thinking[:500], "...")  # truncate for brevity
print("\n=== Answer ===")
print(answer)

You can also log the thinking trace for debugging or display it as a collapsible "Show reasoning" section in your UI.

When to Use Thinking Mode

Thinking Mode is most valuable for:

Mathematics — algebra, calculus, geometry, combinatorics
Code debugging — tracing through logic errors, identifying off-by-one bugs
Multi-step reasoning — word problems, logical deduction, constraint satisfaction
Scientific analysis — hypothesis testing, data interpretation, experimental design
Complex planning — travel itineraries, project scheduling, dependency resolution

When Not to Use Thinking Mode

Avoid Thinking Mode when:

Latency matters — thinking generates hundreds to thousands of extra tokens before the answer, adding seconds of delay
The question is simple — factual lookups, short summaries, and straightforward instructions do not benefit from extended reasoning and waste tokens
Token cost is a concern — each thinking token costs the same as an output token; for high-volume APIs this adds up quickly

A good heuristic: if a human expert could answer confidently in under 10 seconds, use standard mode. If the problem requires scratch paper, use Thinking Mode.

Token Budget

Thinking Mode can consume a very large number of tokens for hard problems. To prevent runaway generation, set max_new_tokens to a reasonable ceiling. For most reasoning tasks 2048–4096 new tokens covers both the thinking trace and the answer. For very hard math problems you may need 8192 or more.