gemma4.devgemma4.dev
MkSaaS文档
gemma4.devgemma4.dev
MkSaaS文档
首页Gemma 4 Developer Hub
ConceptsThinking ModeMultimodal InputsQuantization
X (Twitter)
Concepts

Quantization

Q4, Q8, FP16, and GGUF explained. How to choose the right quantization for your hardware.

Quantization

Quantization reduces model weight precision to fit larger models on less VRAM with minimal quality loss.

Formats at a glance

FormatBits/weightVRAM vs FP16Quality lossUse when
FP16161× (baseline)NoneYou have enough VRAM
Q8_08~0.5×NegligibleTight on VRAM, want max quality
Q4_K_M4~0.25×MinorBest balance — recommended default
Q2_K2~0.12×SignificantCPU-only, very constrained hardware

GGUF quantization (llama.cpp / Ollama)

GGUF files use a naming convention: Q{bits}_{variant}.

  • Q4_K_M — 4-bit, K-quant, medium. Recommended default.
  • Q4_K_S — 4-bit, K-quant, small. Slightly lower quality, smaller file.
  • Q8_0 — 8-bit. Near-lossless, ~2× larger than Q4.
  • Q4_0 — Legacy 4-bit. Lower quality than K-quants.
# Pull a specific quantization with Ollama
ollama pull gemma4:4b-q8_0
ollama pull gemma4:4b-q4_K_M

Choosing quantization

Start with Q4_K_M. It gives you:

  • ~25% of full-precision VRAM
  • ~95–98% of full-precision benchmark scores on most tasks

Move to Q8 if:

  • You're doing precise mathematical reasoning
  • You notice visible quality degradation on your specific workload
  • You have the VRAM headroom

Move to Q2 only if:

  • You're on CPU with very limited RAM
  • Speed is more important than quality

Gemma 4's MoE architecture (26B A4B) is particularly efficient at Q4. The sparse activation pattern means quantization errors affect fewer computations per token.

Hugging Face bitsandbytes

For Python-based inference, use bitsandbytes for in-memory quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-4b",
    quantization_config=quant_config,
    device_map="auto",
)

AWQ and GPTQ

For production serving (vLLM), pre-quantized AWQ and GPTQ checkpoints give better throughput than runtime quantization:

# vLLM with AWQ
vllm serve google/gemma-4-27b-it-AWQ --quantization awq

AWQ typically outperforms GPTQ at the same bit-width on modern GPUs.

Multimodal Inputs

How Gemma 4 processes images, and how to pass vision inputs across different runtimes.

目录

Quantization
Formats at a glance
GGUF quantization (llama.cpp / Ollama)
Choosing quantization
Hugging Face bitsandbytes
AWQ and GPTQ