Fine-Tune Gemma 4 with Unsloth QLoRA
Fine-tune Gemma 4 models using Unsloth's QLoRA on a single consumer GPU. Covers setup, dataset preparation, training config, and merging weights.
What is QLoRA?
QLoRA (Quantized Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that makes it practical to fine-tune large language models on consumer hardware. It combines two ideas:
- 4-bit quantization — the base model weights are loaded in 4-bit NF4 format, dramatically reducing VRAM usage
- LoRA (Low-Rank Adaptation) — instead of updating all model parameters, LoRA inserts small trainable "adapter" matrices into attention layers. Only these adapters are updated during training; the frozen quantized base weights are never modified
The result: you can fine-tune Gemma 4 E4B on a GPU with as little as 16 GB of VRAM, whereas full fine-tuning would require 80+ GB.
Unsloth is an open-source library that provides optimised CUDA kernels for QLoRA training. It reduces memory usage by a further 30–50% compared to vanilla PEFT and speeds up training by 2–5x.
Prerequisites
- NVIDIA GPU with 16 GB+ VRAM (RTX 3090, RTX 4090, A10, A100, etc.)
- Python 3.10 or newer
- CUDA 12.1 or newer
- A Hugging Face account with
HF_TOKENset — Gemma 4 is a gated model
Install Dependencies
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytesFor systems with older CUDA versions, consult the Unsloth installation guide for the correct pip index.
Load the Model with Unsloth
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="google/gemma-4-4b-it",
max_seq_length=2048,
dtype=None, # Auto-detect: bfloat16 on Ampere+
load_in_4bit=True, # Enable 4-bit quantization
)dtype=None lets Unsloth pick the best dtype for your GPU. On Ampere and newer (RTX 30-series+), it selects bfloat16.
Apply LoRA Adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — higher = more capacity, more memory
target_modules=[ # Which attention projections to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16, # Scaling factor — typically equal to r
lora_dropout=0, # Unsloth recommends 0 for speed
bias="none",
use_gradient_checkpointing="unsloth", # Saves VRAM at the cost of some speed
random_state=42,
)r=16 is a good default. Increase to 32 or 64 if the task is complex or your GPU has spare VRAM. Decrease to 8 if you are running out of memory.
Prepare Your Dataset
Use the datasets library to load or create your training data. Gemma 4 expects conversations in the standard chat format:
from datasets import Dataset
# Example: instruction-following dataset
data = [
{
"conversations": [
{"role": "user", "content": "Translate to French: Hello, world!"},
{"role": "assistant", "content": "Bonjour, le monde!"},
]
},
# ... more examples
]
dataset = Dataset.from_list(data)
def format_example(example):
"""Apply Gemma 4 chat template to each conversation."""
return {
"text": tokenizer.apply_chat_template(
example["conversations"],
tokenize=False,
add_generation_prompt=False,
)
}
dataset = dataset.map(format_example)Train with TRL SFTTrainer
TRL's SFTTrainer (Supervised Fine-Tuning Trainer) handles the training loop, gradient accumulation, and checkpoint saving:
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch size = 8
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
fp16=False,
bf16=True, # Use bfloat16 on Ampere+
logging_steps=10,
output_dir="./gemma4-qlora-output",
save_strategy="epoch",
optim="adamw_8bit", # 8-bit Adam saves ~2 GB VRAM
),
)
trainer.train()On an RTX 4090 with a batch size of 2 and gradient accumulation of 4, expect approximately 3–5 seconds per step for the E4B model.
Save the LoRA Adapter
After training, save the lightweight LoRA adapter weights (typically 50–200 MB depending on rank):
model.save_pretrained("./gemma4-lora-adapter")
tokenizer.save_pretrained("./gemma4-lora-adapter")To share the adapter on Hugging Face Hub:
model.push_to_hub("your-username/gemma4-4b-finetuned")
tokenizer.push_to_hub("your-username/gemma4-4b-finetuned")Merge and Export to GGUF
To use the fine-tuned model with Ollama or llama.cpp, merge the LoRA adapter back into the base model and export to GGUF format:
# Merge LoRA into base model (dequantises weights first)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./gemma4-merged")
tokenizer.save_pretrained("./gemma4-merged")Then convert to GGUF using llama.cpp's conversion script:
python llama.cpp/convert_hf_to_gguf.py ./gemma4-merged \
--outfile gemma4-finetuned.gguf \
--outtype q4_k_mLoad the GGUF file into Ollama:
ollama create my-gemma4 -f Modelfile
# Modelfile: FROM ./gemma4-finetuned.ggufMemory Usage Reference
| Configuration | VRAM Usage (approx.) |
|---|---|
| E4B, 4-bit, r=16, batch=2 | ~10 GB |
| E4B, 4-bit, r=16, batch=4 | ~14 GB |
| E4B, 4-bit, r=32, batch=2 | ~12 GB |
| 31B, 4-bit, r=16, batch=1 | ~28 GB (requires A100) |
Use nvidia-smi during training to monitor actual VRAM usage and adjust per_device_train_batch_size or r accordingly.