Fine-Tune Gemma 4 with Unsloth QLoRA

Fine-tune Gemma 4 models using Unsloth's QLoRA on a single consumer GPU. Covers setup, dataset preparation, training config, and merging weights.

What is QLoRA?

QLoRA (Quantized Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that makes it practical to fine-tune large language models on consumer hardware. It combines two ideas:

4-bit quantization — the base model weights are loaded in 4-bit NF4 format, dramatically reducing VRAM usage
LoRA (Low-Rank Adaptation) — instead of updating all model parameters, LoRA inserts small trainable "adapter" matrices into attention layers. Only these adapters are updated during training; the frozen quantized base weights are never modified

The result: you can fine-tune Gemma 4 E4B on a GPU with as little as 16 GB of VRAM, whereas full fine-tuning would require 80+ GB.

Unsloth is an open-source library that provides optimised CUDA kernels for QLoRA training. It reduces memory usage by a further 30–50% compared to vanilla PEFT and speeds up training by 2–5x.

Prerequisites

NVIDIA GPU with 16 GB+ VRAM (RTX 3090, RTX 4090, A10, A100, etc.)
Python 3.10 or newer
CUDA 12.1 or newer
A Hugging Face account with HF_TOKEN set — Gemma 4 is a gated model

Install Dependencies

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

For systems with older CUDA versions, consult the Unsloth installation guide for the correct pip index.

Load the Model with Unsloth

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-4-4b-it",
    max_seq_length=2048,
    dtype=None,          # Auto-detect: bfloat16 on Ampere+
    load_in_4bit=True,   # Enable 4-bit quantization
)

dtype=None lets Unsloth pick the best dtype for your GPU. On Ampere and newer (RTX 30-series+), it selects bfloat16.

Apply LoRA Adapters

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                         # LoRA rank — higher = more capacity, more memory
    target_modules=[              # Which attention projections to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,                # Scaling factor — typically equal to r
    lora_dropout=0,               # Unsloth recommends 0 for speed
    bias="none",
    use_gradient_checkpointing="unsloth",  # Saves VRAM at the cost of some speed
    random_state=42,
)

r=16 is a good default. Increase to 32 or 64 if the task is complex or your GPU has spare VRAM. Decrease to 8 if you are running out of memory.

Prepare Your Dataset

Use the datasets library to load or create your training data. Gemma 4 expects conversations in the standard chat format:

from datasets import Dataset

# Example: instruction-following dataset
data = [
    {
        "conversations": [
            {"role": "user", "content": "Translate to French: Hello, world!"},
            {"role": "assistant", "content": "Bonjour, le monde!"},
        ]
    },
    # ... more examples
]

dataset = Dataset.from_list(data)

def format_example(example):
    """Apply Gemma 4 chat template to each conversation."""
    return {
        "text": tokenizer.apply_chat_template(
            example["conversations"],
            tokenize=False,
            add_generation_prompt=False,
        )
    }

dataset = dataset.map(format_example)

Train with TRL SFTTrainer

TRL's SFTTrainer (Supervised Fine-Tuning Trainer) handles the training loop, gradient accumulation, and checkpoint saving:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,    # Effective batch size = 8
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=False,
        bf16=True,                        # Use bfloat16 on Ampere+
        logging_steps=10,
        output_dir="./gemma4-qlora-output",
        save_strategy="epoch",
        optim="adamw_8bit",               # 8-bit Adam saves ~2 GB VRAM
    ),
)

trainer.train()

On an RTX 4090 with a batch size of 2 and gradient accumulation of 4, expect approximately 3–5 seconds per step for the E4B model.

Save the LoRA Adapter

After training, save the lightweight LoRA adapter weights (typically 50–200 MB depending on rank):

model.save_pretrained("./gemma4-lora-adapter")
tokenizer.save_pretrained("./gemma4-lora-adapter")

To share the adapter on Hugging Face Hub:

model.push_to_hub("your-username/gemma4-4b-finetuned")
tokenizer.push_to_hub("your-username/gemma4-4b-finetuned")

Merge and Export to GGUF

To use the fine-tuned model with Ollama or llama.cpp, merge the LoRA adapter back into the base model and export to GGUF format:

# Merge LoRA into base model (dequantises weights first)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./gemma4-merged")
tokenizer.save_pretrained("./gemma4-merged")

Then convert to GGUF using llama.cpp's conversion script:

python llama.cpp/convert_hf_to_gguf.py ./gemma4-merged \
  --outfile gemma4-finetuned.gguf \
  --outtype q4_k_m

Load the GGUF file into Ollama:

ollama create my-gemma4 -f Modelfile
# Modelfile: FROM ./gemma4-finetuned.gguf

Memory Usage Reference

Configuration	VRAM Usage (approx.)
E4B, 4-bit, r=16, batch=2	~10 GB
E4B, 4-bit, r=16, batch=4	~14 GB
E4B, 4-bit, r=32, batch=2	~12 GB
31B, 4-bit, r=16, batch=1	~28 GB (requires A100)

Use nvidia-smi during training to monitor actual VRAM usage and adjust per_device_train_batch_size or r accordingly.