Run Gemma 4 with llama.cpp
Build and run Gemma 4 GGUF models with llama.cpp. Covers compilation for CPU, CUDA, and Apple Metal, plus quantization options for all Gemma 4 variants.
Run Gemma 4 with llama.cpp
llama.cpp is the most flexible runtime for Gemma 4. It compiles to native code for CPU (no GPU required), CUDA, Apple Metal, and Vulkan. It is the best choice when you need maximum control over quantization or are deploying Gemma 4 on diverse hardware.
Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cppCPU-only build:
make -j$(nproc)CUDA build (NVIDIA):
make -j$(nproc) GGML_CUDA=1Apple Metal build:
make -j$(nproc) GGML_METAL=1Download Gemma 4 GGUF
pip install huggingface_hub
huggingface-cli download bartowski/gemma-4-4b-it-GGUF \
--include "*Q4_K_M*" \
--local-dir ./modelsAvailable GGUF models for Gemma 4:
| Model | Tag | VRAM |
|---|---|---|
| E2B Q4_K_M | bartowski/gemma-4-2b-it-GGUF | 2 GB |
| E4B Q4_K_M | bartowski/gemma-4-4b-it-GGUF | 3 GB |
| 26B A4B Q4_K_M | bartowski/gemma-4-26b-a4b-it-GGUF | 14 GB |
| 31B Q4_K_M | bartowski/gemma-4-31b-it-GGUF | 18 GB |
Run Gemma 4 Interactively
./llama-cli -m models/gemma-4-4b-it-Q4_K_M.gguf \
--chat-template gemma \
-n 512 \
-p "You are a helpful assistant."Always use --chat-template gemma when running Gemma 4. Without it, Gemma 4 produces garbled output because its BOS/EOS tokens differ from the Llama defaults.
Run Gemma 4 as an API Server
./llama-server \
-m models/gemma-4-4b-it-Q4_K_M.gguf \
--chat-template gemma \
--port 8080 \
-ngl 99The -ngl 99 flag offloads all layers to the GPU (Metal or CUDA). Remove it for CPU-only Gemma 4 inference.
Gemma 4 Thinking Mode in llama.cpp
Thinking mode is triggered via the system prompt — no special flag is needed:
./llama-cli -m models/gemma-4-4b-it-Q4_K_M.gguf \
--chat-template gemma \
-sys "Enable extended thinking for complex questions." \
-n 1024Gemma 4 will produce internal reasoning traces before giving its final answer, identical to the behaviour seen in the hosted API.
Quantization Guide
| Quant | Size (E4B) | Notes |
|---|---|---|
| Q2_K | 1.8 GB | Fast, lower quality |
| Q4_K_M | 3.0 GB | Recommended |
| Q6_K | 3.9 GB | High quality |
| Q8_0 | 4.7 GB | Near-lossless |
Q4_K_M strikes the best balance between Gemma 4 output quality and memory usage for most hardware. Q8_0 is useful when you want near-lossless Gemma 4 inference and have sufficient VRAM.
Troubleshooting
<unused24> tokens appear in output: This means --chat-template gemma is missing. See the fix guide for details.
CUDA out of memory: Reduce the -ngl value to offload fewer Gemma 4 layers to the GPU. Start with -ngl 20 and increase until you hit the limit.
Build fails on macOS: Make sure Xcode Command Line Tools are installed with xcode-select --install, then retry the Metal build.