gemma4.devgemma4.dev
  • Models
  • Run Local
  • Deploy
  • Guides
Try Gemma 4 ↗
gemma4.devgemma4.dev
Run Local/Run Gemma 4 with Hugging Face Transformers

Run Gemma 4 with Hugging Face Transformers

How to load and run Gemma 4 models using the Hugging Face Transformers library in Python. Covers all Gemma 4 variants with CUDA and CPU examples.

Run Gemma 4 with Hugging Face Transformers

Hugging Face Transformers gives you full control over Gemma 4 inference. This approach is best for Python pipelines, fine-tuning, and integration with the broader ML ecosystem.

Prerequisites

pip install transformers torch accelerate

For CUDA support, install PyTorch with CUDA:

pip install torch --index-url https://download.pytorch.org/whl/cu124

Authenticate with Hugging Face

Gemma 4 requires accepting Google's Gemma Terms of Use. Visit huggingface.co/google/gemma-4-4b-it and accept the terms, then authenticate:

huggingface-cli login

Load Gemma 4 E4B

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-4b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Run Inference

messages = [
    {"role": "user", "content": "What are the key features of Gemma 4?"}
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", return_dict=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Gemma 4 with 4-bit Quantization (bitsandbytes)

For systems with 8GB VRAM, use 4-bit quantization to run Gemma 4 E4B:

pip install bitsandbytes
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-4b-it",
    quantization_config=bnb_config,
    device_map="auto",
)

Available Gemma 4 Model IDs

Model IDParamsNotes
google/gemma-4-2b-it2.1BInstruction-tuned E2B
google/gemma-4-4b-it4.4BInstruction-tuned E4B
google/gemma-4-26b-a4b-it26.1BMoE instruction-tuned
google/gemma-4-31b-it31BInstruction-tuned flagship

Running Gemma 4 31B on a Single GPU

Gemma 4 31B requires 48GB VRAM in BF16. For a single RTX 3090 (24GB), use 4-bit quantization:

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31b-it",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    device_map="auto",
)
gemma4.devgemma4.dev

运行、部署和调试 Gemma 4 模型。专为快节奏开发者打造。

GitHubGitHubTwitterX (Twitter)Email
Models
  • Gemma 4 E2B
  • Gemma 4 E4B
  • Gemma 4 26B
  • Gemma 4 31B
  • Compare Models
Run Local
  • Ollama
  • Hugging Face
  • GGUF
  • LM Studio
  • llama.cpp
Deploy
  • vLLM
  • Gemini API
  • Vertex AI
  • Cloud Run
Guides & Help
  • Thinking Mode
  • Prompt Formatting
  • Function Calling
  • Error Fixes
© 2026 gemma4.dev All Rights Reserved.