Gemma 4 Prompt Formatting Guide

Correct chat template and system prompt format for Gemma 4 models. Avoid common formatting mistakes that cause poor Gemma 4 output quality.

Why Prompt Formatting Matters

Every instruction-tuned language model is fine-tuned with a specific prompt format. When you deviate from that format at inference time — even slightly — the model behaves as if it is reading garbled input. Common symptoms of incorrect formatting include: the model repeating the prompt back, generating incomplete sentences, mixing languages, or producing generic low-quality output.

Gemma 4 uses its own chat template. It is different from the Llama 2 [INST] format, the ChatML <|im_start|> format, and the Mistral format. Using the wrong template is one of the most common reasons people report "Gemma 4 doesn't work well."

The Gemma 4 Chat Template

Gemma 4's template uses <start_of_turn> and <end_of_turn> tokens as turn delimiters. The raw format looks like this:

<start_of_turn>user
Your message here<end_of_turn>
<start_of_turn>model

The model's response is generated after the final <start_of_turn>model prefix — you do not include an <end_of_turn> at the end because the model will generate until it decides to stop.

A multi-turn conversation:

<start_of_turn>user
What is Gemma 4?<end_of_turn>
<start_of_turn>model
Gemma 4 is a family of open-weight language models from Google.<end_of_turn>
<start_of_turn>user
Which variant should I use for local inference?<end_of_turn>
<start_of_turn>model

Let the Tokenizer Handle It

You should almost never construct the raw template string by hand. The Hugging Face tokenizer for every Gemma 4 model includes a chat_template field that handles formatting automatically via apply_chat_template:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-4b-it")

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print(prompt)
# <start_of_turn>user
# What is the capital of France?<end_of_turn>
# <start_of_turn>model

Pass tokenize=True and return_tensors="pt" to get input tensors ready for model.generate():

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    return_dict=True,
    add_generation_prompt=True,
).to(model.device)

System Prompt Placement

Gemma 4 does not have a dedicated system role in its chat template the way some other models do. The correct way to include a system prompt is to prepend it to the first user message:

messages = [
    {
        "role": "user",
        "content": "System: You are a helpful assistant that responds concisely.\n\nWhat is the boiling point of water?"
    }
]

Alternatively, some Gemma 4 deployments support a system role by placing the system content before the first user turn in the raw template:

<start_of_turn>system
You are a helpful assistant.<end_of_turn>
<start_of_turn>user
What is the boiling point of water?<end_of_turn>
<start_of_turn>model

Check whether the specific model variant you are using has been fine-tuned with system turn support — look at the model card on Hugging Face.

Common Formatting Mistakes

Using the Llama format

The Llama 2 and Llama 3 prompt formats use [INST] / [/INST] tokens. These are meaningless to Gemma 4:

# Wrong — Llama format
[INST] What is Gemma 4? [/INST]

# Correct — Gemma 4 format
<start_of_turn>user
What is Gemma 4?<end_of_turn>
<start_of_turn>model

Using `<s>` as a BOS token

Llama models use <s> as the beginning-of-sequence token. Gemma 4's BOS token is <bos>. Adding <s> at the start of your prompt injects a meaningless token that shifts all subsequent attention calculations:

# Wrong
<s><start_of_turn>user

# Correct (let the tokenizer add <bos> automatically)
<start_of_turn>user

When using apply_chat_template, the tokenizer adds <bos> automatically. Do not add it manually.

Forgetting `add_generation_prompt=True`

Without add_generation_prompt=True, apply_chat_template does not append <start_of_turn>model\n at the end. The model then has no cue to start generating a response and may continue the conversation as a user instead.

Tool Calling Format

When using function calling (tool use), apply_chat_template with a tools argument handles the tool schema injection automatically:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"}
                },
                "required": ["city"],
            },
        },
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    return_tensors="pt",
    return_dict=True,
    add_generation_prompt=True,
).to(model.device)

The tokenizer inserts the tool schema into the prompt in the format Gemma 4 was trained on. Do not manually serialise the tool definitions as JSON and insert them into the user message — the placement and formatting must match the training distribution exactly.

Verify Your Prompt

To inspect the exact string that will be sent to the model, print the decoded prompt before running inference:

prompt_str = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print(repr(prompt_str))

Comparing this output against the expected format is the fastest way to debug formatting issues.