gemma4.devgemma4.dev
  • Models
  • Run Local
  • Deploy
  • Guides
Try Gemma 4 ↗
gemma4.devgemma4.dev
Run Local/Run Gemma 4 with llama.cpp

Run Gemma 4 with llama.cpp

Build and run Gemma 4 GGUF models with llama.cpp. Covers compilation for CPU, CUDA, and Apple Metal, plus quantization options for all Gemma 4 variants.

Run Gemma 4 with llama.cpp

llama.cpp is the most flexible runtime for Gemma 4. It compiles to native code for CPU (no GPU required), CUDA, Apple Metal, and Vulkan. It is the best choice when you need maximum control over quantization or are deploying Gemma 4 on diverse hardware.

Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

CPU-only build:

make -j$(nproc)

CUDA build (NVIDIA):

make -j$(nproc) GGML_CUDA=1

Apple Metal build:

make -j$(nproc) GGML_METAL=1

Download Gemma 4 GGUF

pip install huggingface_hub
huggingface-cli download bartowski/gemma-4-4b-it-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./models

Available GGUF models for Gemma 4:

ModelTagVRAM
E2B Q4_K_Mbartowski/gemma-4-2b-it-GGUF2 GB
E4B Q4_K_Mbartowski/gemma-4-4b-it-GGUF3 GB
26B A4B Q4_K_Mbartowski/gemma-4-26b-a4b-it-GGUF14 GB
31B Q4_K_Mbartowski/gemma-4-31b-it-GGUF18 GB

Run Gemma 4 Interactively

./llama-cli -m models/gemma-4-4b-it-Q4_K_M.gguf \
  --chat-template gemma \
  -n 512 \
  -p "You are a helpful assistant."

Always use --chat-template gemma when running Gemma 4. Without it, Gemma 4 produces garbled output because its BOS/EOS tokens differ from the Llama defaults.

Run Gemma 4 as an API Server

./llama-server \
  -m models/gemma-4-4b-it-Q4_K_M.gguf \
  --chat-template gemma \
  --port 8080 \
  -ngl 99

The -ngl 99 flag offloads all layers to the GPU (Metal or CUDA). Remove it for CPU-only Gemma 4 inference.

Gemma 4 Thinking Mode in llama.cpp

Thinking mode is triggered via the system prompt — no special flag is needed:

./llama-cli -m models/gemma-4-4b-it-Q4_K_M.gguf \
  --chat-template gemma \
  -sys "Enable extended thinking for complex questions." \
  -n 1024

Gemma 4 will produce internal reasoning traces before giving its final answer, identical to the behaviour seen in the hosted API.

Quantization Guide

QuantSize (E4B)Notes
Q2_K1.8 GBFast, lower quality
Q4_K_M3.0 GBRecommended
Q6_K3.9 GBHigh quality
Q8_04.7 GBNear-lossless

Q4_K_M strikes the best balance between Gemma 4 output quality and memory usage for most hardware. Q8_0 is useful when you want near-lossless Gemma 4 inference and have sufficient VRAM.

Troubleshooting

<unused24> tokens appear in output: This means --chat-template gemma is missing. See the fix guide for details.

CUDA out of memory: Reduce the -ngl value to offload fewer Gemma 4 layers to the GPU. Start with -ngl 20 and increase until you hit the limit.

Build fails on macOS: Make sure Xcode Command Line Tools are installed with xcode-select --install, then retry the Metal build.

gemma4.devgemma4.dev

Run, deploy, and debug Gemma 4 models. Built for developers who move fast.

GitHubGitHubTwitterX (Twitter)Email
Models
  • Gemma 4 E2B
  • Gemma 4 E4B
  • Gemma 4 26B
  • Gemma 4 31B
  • Compare Models
Run Local
  • Ollama
  • Hugging Face
  • GGUF
  • LM Studio
  • llama.cpp
Deploy
  • vLLM
  • Gemini API
  • Vertex AI
  • Cloud Run
Guides & Help
  • Thinking Mode
  • Prompt Formatting
  • Function Calling
  • Error Fixes
© 2026 gemma4.dev All Rights Reserved.