Run Gemma 4 with GGUF (llama.cpp / LM Studio)
Download and run Gemma 4 GGUF quantized models for use with llama.cpp, LM Studio, and any GGUF-compatible runtime. Includes quantization guide.
Run Gemma 4 with GGUF
GGUF is the file format used by llama.cpp, LM Studio, Jan, and other local inference tools. GGUF Gemma 4 models are pre-quantized, making them easy to download and run without GPU-specific setup.
Where to Download Gemma 4 GGUF Files
The best source for Gemma 4 GGUF models is Hugging Face. Search for gemma-4 with the GGUF filter, or use the Bartowski quantizations which are well-tested:
# Using huggingface-cli
pip install huggingface_hub
huggingface-cli download bartowski/gemma-4-4b-it-GGUF --include "*Q4_K_M*"Quantization Levels for Gemma 4
| Quant | Quality | Size (E4B) | Recommended For |
|---|---|---|---|
| Q2_K | Low | 1.8 GB | RAM-constrained laptops |
| Q4_K_M | Good | 3.0 GB | Most users |
| Q6_K | High | 3.9 GB | Accuracy-sensitive tasks |
| Q8_0 | Best | 4.7 GB | 8GB+ VRAM, max quality |
Run Gemma 4 with llama.cpp
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j$(nproc)
# Run Gemma 4 E4B
./llama-cli -m gemma-4-4b-it-Q4_K_M.gguf \
-p "You are a helpful AI assistant." \
--chat-template gemma \
-n 512Run Gemma 4 with LM Studio
- Open LM Studio and search for
gemma4 - Select the GGUF model size appropriate for your hardware
- Download and load the model
- Use the built-in chat interface or the local server for API access
LM Studio automatically handles the Gemma 4 chat template and system prompt format.
Gemma 4 Chat Template in GGUF
When using llama.cpp directly, always specify the Gemma chat template:
./llama-server -m gemma-4-4b-it-Q4_K_M.gguf \
--chat-template gemma \
--port 8080Without the --chat-template gemma flag, the model may produce malformed responses due to incorrect token formatting.