Gemma 4 Prompt Formatting Guide
Correct chat template and system prompt format for Gemma 4 models. Avoid common formatting mistakes that cause poor Gemma 4 output quality.
Why Prompt Formatting Matters
Every instruction-tuned language model is fine-tuned with a specific prompt format. When you deviate from that format at inference time — even slightly — the model behaves as if it is reading garbled input. Common symptoms of incorrect formatting include: the model repeating the prompt back, generating incomplete sentences, mixing languages, or producing generic low-quality output.
Gemma 4 uses its own chat template. It is different from the Llama 2 [INST] format, the ChatML <|im_start|> format, and the Mistral format. Using the wrong template is one of the most common reasons people report "Gemma 4 doesn't work well."
The Gemma 4 Chat Template
Gemma 4's template uses <start_of_turn> and <end_of_turn> tokens as turn delimiters. The raw format looks like this:
<start_of_turn>user
Your message here<end_of_turn>
<start_of_turn>modelThe model's response is generated after the final <start_of_turn>model prefix — you do not include an <end_of_turn> at the end because the model will generate until it decides to stop.
A multi-turn conversation:
<start_of_turn>user
What is Gemma 4?<end_of_turn>
<start_of_turn>model
Gemma 4 is a family of open-weight language models from Google.<end_of_turn>
<start_of_turn>user
Which variant should I use for local inference?<end_of_turn>
<start_of_turn>modelLet the Tokenizer Handle It
You should almost never construct the raw template string by hand. The Hugging Face tokenizer for every Gemma 4 model includes a chat_template field that handles formatting automatically via apply_chat_template:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-4b-it")
messages = [
{"role": "user", "content": "What is the capital of France?"}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print(prompt)
# <start_of_turn>user
# What is the capital of France?<end_of_turn>
# <start_of_turn>modelPass tokenize=True and return_tensors="pt" to get input tensors ready for model.generate():
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
return_dict=True,
add_generation_prompt=True,
).to(model.device)System Prompt Placement
Gemma 4 does not have a dedicated system role in its chat template the way some other models do. The correct way to include a system prompt is to prepend it to the first user message:
messages = [
{
"role": "user",
"content": "System: You are a helpful assistant that responds concisely.\n\nWhat is the boiling point of water?"
}
]Alternatively, some Gemma 4 deployments support a system role by placing the system content before the first user turn in the raw template:
<start_of_turn>system
You are a helpful assistant.<end_of_turn>
<start_of_turn>user
What is the boiling point of water?<end_of_turn>
<start_of_turn>modelCheck whether the specific model variant you are using has been fine-tuned with system turn support — look at the model card on Hugging Face.
Common Formatting Mistakes
Using the Llama format
The Llama 2 and Llama 3 prompt formats use [INST] / [/INST] tokens. These are meaningless to Gemma 4:
# Wrong — Llama format
[INST] What is Gemma 4? [/INST]
# Correct — Gemma 4 format
<start_of_turn>user
What is Gemma 4?<end_of_turn>
<start_of_turn>modelUsing <s> as a BOS token
Llama models use <s> as the beginning-of-sequence token. Gemma 4's BOS token is <bos>. Adding <s> at the start of your prompt injects a meaningless token that shifts all subsequent attention calculations:
# Wrong
<s><start_of_turn>user
# Correct (let the tokenizer add <bos> automatically)
<start_of_turn>userWhen using apply_chat_template, the tokenizer adds <bos> automatically. Do not add it manually.
Forgetting add_generation_prompt=True
Without add_generation_prompt=True, apply_chat_template does not append <start_of_turn>model\n at the end. The model then has no cue to start generating a response and may continue the conversation as a user instead.
Tool Calling Format
When using function calling (tool use), apply_chat_template with a tools argument handles the tool schema injection automatically:
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"],
},
},
}
]
inputs = tokenizer.apply_chat_template(
messages,
tools=tools,
return_tensors="pt",
return_dict=True,
add_generation_prompt=True,
).to(model.device)The tokenizer inserts the tool schema into the prompt in the format Gemma 4 was trained on. Do not manually serialise the tool definitions as JSON and insert them into the user message — the placement and formatting must match the training distribution exactly.
Verify Your Prompt
To inspect the exact string that will be sent to the model, print the decoded prompt before running inference:
prompt_str = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print(repr(prompt_str))Comparing this output against the expected format is the fastest way to debug formatting issues.