Run Gemma 4 with Hugging Face Transformers
How to load and run Gemma 4 models using the Hugging Face Transformers library in Python. Covers all Gemma 4 variants with CUDA and CPU examples.
Run Gemma 4 with Hugging Face Transformers
Hugging Face Transformers gives you full control over Gemma 4 inference. This approach is best for Python pipelines, fine-tuning, and integration with the broader ML ecosystem.
Prerequisites
pip install transformers torch accelerateFor CUDA support, install PyTorch with CUDA:
pip install torch --index-url https://download.pytorch.org/whl/cu124Authenticate with Hugging Face
Gemma 4 requires accepting Google's Gemma Terms of Use. Visit huggingface.co/google/gemma-4-4b-it and accept the terms, then authenticate:
huggingface-cli loginLoad Gemma 4 E4B
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "google/gemma-4-4b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)Run Inference
messages = [
{"role": "user", "content": "What are the key features of Gemma 4?"}
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", return_dict=True
).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)Gemma 4 with 4-bit Quantization (bitsandbytes)
For systems with 8GB VRAM, use 4-bit quantization to run Gemma 4 E4B:
pip install bitsandbytesfrom transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-4b-it",
quantization_config=bnb_config,
device_map="auto",
)Available Gemma 4 Model IDs
| Model ID | Params | Notes |
|---|---|---|
google/gemma-4-2b-it | 2.1B | Instruction-tuned E2B |
google/gemma-4-4b-it | 4.4B | Instruction-tuned E4B |
google/gemma-4-26b-a4b-it | 26.1B | MoE instruction-tuned |
google/gemma-4-31b-it | 31B | Instruction-tuned flagship |
Running Gemma 4 31B on a Single GPU
Gemma 4 31B requires 48GB VRAM in BF16. For a single RTX 3090 (24GB), use 4-bit quantization:
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-31b-it",
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
device_map="auto",
)