Concepts
Quantization
Q4, Q8, FP16, and GGUF explained. How to choose the right quantization for your hardware.
Quantization
Quantization reduces model weight precision to fit larger models on less VRAM with minimal quality loss.
Formats at a glance
| Format | Bits/weight | VRAM vs FP16 | Quality loss | Use when |
|---|---|---|---|---|
| FP16 | 16 | 1× (baseline) | None | You have enough VRAM |
| Q8_0 | 8 | ~0.5× | Negligible | Tight on VRAM, want max quality |
| Q4_K_M | 4 | ~0.25× | Minor | Best balance — recommended default |
| Q2_K | 2 | ~0.12× | Significant | CPU-only, very constrained hardware |
GGUF quantization (llama.cpp / Ollama)
GGUF files use a naming convention: Q{bits}_{variant}.
Q4_K_M— 4-bit, K-quant, medium. Recommended default.Q4_K_S— 4-bit, K-quant, small. Slightly lower quality, smaller file.Q8_0— 8-bit. Near-lossless, ~2× larger than Q4.Q4_0— Legacy 4-bit. Lower quality than K-quants.
# Pull a specific quantization with Ollama
ollama pull gemma4:4b-q8_0
ollama pull gemma4:4b-q4_K_MChoosing quantization
Start with Q4_K_M. It gives you:
- ~25% of full-precision VRAM
- ~95–98% of full-precision benchmark scores on most tasks
Move to Q8 if:
- You're doing precise mathematical reasoning
- You notice visible quality degradation on your specific workload
- You have the VRAM headroom
Move to Q2 only if:
- You're on CPU with very limited RAM
- Speed is more important than quality
Gemma 4's MoE architecture (26B A4B) is particularly efficient at Q4. The sparse activation pattern means quantization errors affect fewer computations per token.
Hugging Face bitsandbytes
For Python-based inference, use bitsandbytes for in-memory quantization:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quant_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-4b",
quantization_config=quant_config,
device_map="auto",
)AWQ and GPTQ
For production serving (vLLM), pre-quantized AWQ and GPTQ checkpoints give better throughput than runtime quantization:
# vLLM with AWQ
vllm serve google/gemma-4-27b-it-AWQ --quantization awqAWQ typically outperforms GPTQ at the same bit-width on modern GPUs.