Q4, Q8, FP16, and GGUF explained. How to choose the right quantization for your hardware.

Quantization

Quantization reduces model weight precision to fit larger models on less VRAM with minimal quality loss.

Formats at a glance

Format	Bits/weight	VRAM vs FP16	Quality loss	Use when
FP16	16	1× (baseline)	None	You have enough VRAM
Q8_0	8	~0.5×	Negligible	Tight on VRAM, want max quality
Q4_K_M	4	~0.25×	Minor	Best balance — recommended default
Q2_K	2	~0.12×	Significant	CPU-only, very constrained hardware

GGUF quantization (llama.cpp / Ollama)

GGUF files use a naming convention: Q{bits}_{variant}.

Q4_K_M — 4-bit, K-quant, medium. Recommended default.
Q4_K_S — 4-bit, K-quant, small. Slightly lower quality, smaller file.
Q8_0 — 8-bit. Near-lossless, ~2× larger than Q4.
Q4_0 — Legacy 4-bit. Lower quality than K-quants.

# Pull a specific quantization with Ollama
ollama pull gemma4:4b-q8_0
ollama pull gemma4:4b-q4_K_M

Choosing quantization

Start with Q4_K_M. It gives you:

~25% of full-precision VRAM
~95–98% of full-precision benchmark scores on most tasks

Move to Q8 if:

You're doing precise mathematical reasoning
You notice visible quality degradation on your specific workload
You have the VRAM headroom

Move to Q2 only if:

You're on CPU with very limited RAM
Speed is more important than quality

Gemma 4's MoE architecture (26B A4B) is particularly efficient at Q4. The sparse activation pattern means quantization errors affect fewer computations per token.

Hugging Face bitsandbytes

For Python-based inference, use bitsandbytes for in-memory quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-4b",
    quantization_config=quant_config,
    device_map="auto",
)

AWQ and GPTQ

For production serving (vLLM), pre-quantized AWQ and GPTQ checkpoints give better throughput than runtime quantization:

# vLLM with AWQ
vllm serve google/gemma-4-27b-it-AWQ --quantization awq

AWQ typically outperforms GPTQ at the same bit-width on modern GPUs.

Quantization

Quantization reduces model weight precision to fit larger models on less VRAM with minimal quality loss.

Formats at a glance

Format	Bits/weight	VRAM vs FP16	Quality loss	Use when
FP16	16	1× (baseline)	None	You have enough VRAM
Q8_0	8	~0.5×	Negligible	Tight on VRAM, want max quality
Q4_K_M	4	~0.25×	Minor	Best balance — recommended default
Q2_K	2	~0.12×	Significant	CPU-only, very constrained hardware

GGUF quantization (llama.cpp / Ollama)

GGUF files use a naming convention: Q{bits}_{variant}.

Q4_K_M — 4-bit, K-quant, medium. Recommended default.
Q4_K_S — 4-bit, K-quant, small. Slightly lower quality, smaller file.
Q8_0 — 8-bit. Near-lossless, ~2× larger than Q4.
Q4_0 — Legacy 4-bit. Lower quality than K-quants.

# Pull a specific quantization with Ollama
ollama pull gemma4:4b-q8_0
ollama pull gemma4:4b-q4_K_M

Choosing quantization

Start with Q4_K_M. It gives you:

~25% of full-precision VRAM
~95–98% of full-precision benchmark scores on most tasks

Move to Q8 if:

You're doing precise mathematical reasoning
You notice visible quality degradation on your specific workload
You have the VRAM headroom

Move to Q2 only if:

You're on CPU with very limited RAM
Speed is more important than quality

Gemma 4's MoE architecture (26B A4B) is particularly efficient at Q4. The sparse activation pattern means quantization errors affect fewer computations per token.

Hugging Face bitsandbytes

For Python-based inference, use bitsandbytes for in-memory quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-4b",
    quantization_config=quant_config,
    device_map="auto",
)

AWQ and GPTQ

For production serving (vLLM), pre-quantized AWQ and GPTQ checkpoints give better throughput than runtime quantization:

# vLLM with AWQ
vllm serve google/gemma-4-27b-it-AWQ --quantization awq

AWQ typically outperforms GPTQ at the same bit-width on modern GPUs.

Quantization

Quantization

Formats at a glance

GGUF quantization (llama.cpp / Ollama)

Choosing quantization

Hugging Face bitsandbytes

AWQ and GPTQ

目录

Quantization

Quantization

Formats at a glance

GGUF quantization (llama.cpp / Ollama)

Choosing quantization

Hugging Face bitsandbytes

AWQ and GPTQ

目录