Run Gemma 4 with GGUF (llama.cpp / LM Studio)

Download and run Gemma 4 GGUF quantized models for use with llama.cpp, LM Studio, and any GGUF-compatible runtime. Includes quantization guide.

Run Gemma 4 with GGUF

GGUF is the file format used by llama.cpp, LM Studio, Jan, and other local inference tools. GGUF Gemma 4 models are pre-quantized, making them easy to download and run without GPU-specific setup.

Where to Download Gemma 4 GGUF Files

The best source for Gemma 4 GGUF models is Hugging Face. Search for gemma-4 with the GGUF filter, or use the Bartowski quantizations which are well-tested:

# Using huggingface-cli
pip install huggingface_hub
huggingface-cli download bartowski/gemma-4-4b-it-GGUF --include "*Q4_K_M*"

Quantization Levels for Gemma 4

Quant	Quality	Size (E4B)	Recommended For
Q2_K	Low	1.8 GB	RAM-constrained laptops
Q4_K_M	Good	3.0 GB	Most users
Q6_K	High	3.9 GB	Accuracy-sensitive tasks
Q8_0	Best	4.7 GB	8GB+ VRAM, max quality

Run Gemma 4 with llama.cpp

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j$(nproc)

# Run Gemma 4 E4B
./llama-cli -m gemma-4-4b-it-Q4_K_M.gguf \
  -p "You are a helpful AI assistant." \
  --chat-template gemma \
  -n 512

Run Gemma 4 with LM Studio

Open LM Studio and search for gemma4
Select the GGUF model size appropriate for your hardware
Download and load the model
Use the built-in chat interface or the local server for API access

LM Studio automatically handles the Gemma 4 chat template and system prompt format.

Gemma 4 Chat Template in GGUF

When using llama.cpp directly, always specify the Gemma chat template:

./llama-server -m gemma-4-4b-it-Q4_K_M.gguf \
  --chat-template gemma \
  --port 8080

Without the --chat-template gemma flag, the model may produce malformed responses due to incorrect token formatting.