Build and run Gemma 4 GGUF models with llama.cpp. Covers compilation for CPU, CUDA, and Apple Metal, plus quantization options for all Gemma 4 variants.

Run Gemma 4 with llama.cpp

llama.cpp is the most flexible runtime for Gemma 4. It compiles to native code for CPU (no GPU required), CUDA, Apple Metal, and Vulkan. It is the best choice when you need maximum control over quantization or are deploying Gemma 4 on diverse hardware.

Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

CPU-only build:

make -j$(nproc)

CUDA build (NVIDIA):

make -j$(nproc) GGML_CUDA=1

Apple Metal build:

make -j$(nproc) GGML_METAL=1

Download Gemma 4 GGUF

pip install huggingface_hub
huggingface-cli download bartowski/gemma-4-4b-it-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./models

Available GGUF models for Gemma 4:

Model	Tag	VRAM
E2B Q4_K_M	bartowski/gemma-4-2b-it-GGUF	2 GB
E4B Q4_K_M	bartowski/gemma-4-4b-it-GGUF	3 GB
26B A4B Q4_K_M	bartowski/gemma-4-26b-a4b-it-GGUF	14 GB
31B Q4_K_M	bartowski/gemma-4-31b-it-GGUF	18 GB

Run Gemma 4 Interactively

./llama-cli -m models/gemma-4-4b-it-Q4_K_M.gguf \
  --chat-template gemma \
  -n 512 \
  -p "You are a helpful assistant."

Always use --chat-template gemma when running Gemma 4. Without it, Gemma 4 produces garbled output because its BOS/EOS tokens differ from the Llama defaults.

Run Gemma 4 as an API Server

./llama-server \
  -m models/gemma-4-4b-it-Q4_K_M.gguf \
  --chat-template gemma \
  --port 8080 \
  -ngl 99

The -ngl 99 flag offloads all layers to the GPU (Metal or CUDA). Remove it for CPU-only Gemma 4 inference.

Gemma 4 Thinking Mode in llama.cpp

Thinking mode is triggered via the system prompt — no special flag is needed:

./llama-cli -m models/gemma-4-4b-it-Q4_K_M.gguf \
  --chat-template gemma \
  -sys "Enable extended thinking for complex questions." \
  -n 1024

Gemma 4 will produce internal reasoning traces before giving its final answer, identical to the behaviour seen in the hosted API.

Quantization Guide

Quant	Size (E4B)	Notes
Q2_K	1.8 GB	Fast, lower quality
Q4_K_M	3.0 GB	Recommended
Q6_K	3.9 GB	High quality
Q8_0	4.7 GB	Near-lossless

Q4_K_M strikes the best balance between Gemma 4 output quality and memory usage for most hardware. Q8_0 is useful when you want near-lossless Gemma 4 inference and have sufficient VRAM.

Troubleshooting

<unused24> tokens appear in output: This means --chat-template gemma is missing. See the fix guide for details.

CUDA out of memory: Reduce the -ngl value to offload fewer Gemma 4 layers to the GPU. Start with -ngl 20 and increase until you hit the limit.

Build fails on macOS: Make sure Xcode Command Line Tools are installed with xcode-select --install, then retry the Metal build.