Run Gemma 4 with Hugging Face Transformers

How to load and run Gemma 4 models using the Hugging Face Transformers library in Python. Covers all Gemma 4 variants with CUDA and CPU examples.

Run Gemma 4 with Hugging Face Transformers

Hugging Face Transformers gives you full control over Gemma 4 inference. This approach is best for Python pipelines, fine-tuning, and integration with the broader ML ecosystem.

Prerequisites

pip install transformers torch accelerate

For CUDA support, install PyTorch with CUDA:

pip install torch --index-url https://download.pytorch.org/whl/cu124

Authenticate with Hugging Face

Gemma 4 requires accepting Google's Gemma Terms of Use. Visit huggingface.co/google/gemma-4-4b-it and accept the terms, then authenticate:

huggingface-cli login

Load Gemma 4 E4B

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-4b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Run Inference

messages = [
    {"role": "user", "content": "What are the key features of Gemma 4?"}
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", return_dict=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Gemma 4 with 4-bit Quantization (bitsandbytes)

For systems with 8GB VRAM, use 4-bit quantization to run Gemma 4 E4B:

pip install bitsandbytes

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-4b-it",
    quantization_config=bnb_config,
    device_map="auto",
)

Available Gemma 4 Model IDs

Model ID	Params	Notes
`google/gemma-4-2b-it`	2.1B	Instruction-tuned E2B
`google/gemma-4-4b-it`	4.4B	Instruction-tuned E4B
`google/gemma-4-26b-a4b-it`	26.1B	MoE instruction-tuned
`google/gemma-4-31b-it`	31B	Instruction-tuned flagship

Running Gemma 4 31B on a Single GPU

Gemma 4 31B requires 48GB VRAM in BF16. For a single RTX 3090 (24GB), use 4-bit quantization:

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31b-it",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    device_map="auto",
)

Run Gemma 4 with Hugging Face Transformers

How to load and run Gemma 4 models using the Hugging Face Transformers library in Python. Covers all Gemma 4 variants with CUDA and CPU examples.

Run Gemma 4 with Hugging Face Transformers

Hugging Face Transformers gives you full control over Gemma 4 inference. This approach is best for Python pipelines, fine-tuning, and integration with the broader ML ecosystem.

Prerequisites

pip install transformers torch accelerate

For CUDA support, install PyTorch with CUDA:

pip install torch --index-url https://download.pytorch.org/whl/cu124

Authenticate with Hugging Face

Gemma 4 requires accepting Google's Gemma Terms of Use. Visit huggingface.co/google/gemma-4-4b-it and accept the terms, then authenticate:

huggingface-cli login

Load Gemma 4 E4B

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-4b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Run Inference

messages = [
    {"role": "user", "content": "What are the key features of Gemma 4?"}
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", return_dict=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Gemma 4 with 4-bit Quantization (bitsandbytes)

For systems with 8GB VRAM, use 4-bit quantization to run Gemma 4 E4B:

pip install bitsandbytes

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-4b-it",
    quantization_config=bnb_config,
    device_map="auto",
)

Available Gemma 4 Model IDs

Model ID	Params	Notes
`google/gemma-4-2b-it`	2.1B	Instruction-tuned E2B
`google/gemma-4-4b-it`	4.4B	Instruction-tuned E4B
`google/gemma-4-26b-a4b-it`	26.1B	MoE instruction-tuned
`google/gemma-4-31b-it`	31B	Instruction-tuned flagship

Running Gemma 4 31B on a Single GPU

Gemma 4 31B requires 48GB VRAM in BF16. For a single RTX 3090 (24GB), use 4-bit quantization:

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31b-it",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    device_map="auto",
)