gemma4.devgemma4.dev
  • Models
  • Run Local
  • Deploy
  • Guides
Try Gemma 4 ↗
gemma4.devgemma4.dev
Models/Memory Requirements

Gemma 4 Memory Requirements — VRAM Guide for All Models

Exact VRAM requirements for every Gemma 4 model at BF16 full precision and Q4 quantization, with hardware recommendations for common GPUs and Apple Silicon Macs.

Overview

Running Gemma 4 locally requires understanding memory requirements before downloading a model. All Gemma 4 models support quantization (4-bit and 8-bit), which dramatically reduces VRAM usage at a modest quality cost. This guide shows exact memory requirements for each Gemma 4 model at both full precision (BF16) and the most popular quantization level (Q4).

Gemma 4 memory requirements vary significantly between models. The smallest Gemma 4 variant (E2B) runs in just 2GB of VRAM with Q4 quantization — fitting on integrated graphics and older mobile GPUs. The largest Gemma 4 model (31B) requires 64GB for full precision, but drops to a more accessible 18GB with Q4, opening it up to RTX 3090 owners.

BF16 Full Precision Memory Requirements

BF16 (Brain Float 16) is the native precision for Gemma 4 models. Use BF16 when you need maximum output quality and have sufficient VRAM available.

ModelParametersMin VRAM (BF16)
Gemma 4 E2B2.1B5 GB
Gemma 4 E4B4.4B10 GB
Gemma 4 26B A4B26.1B (MoE)28 GB
Gemma 4 31B31B64 GB

Q4 Quantized Memory Requirements

Q4 quantization (4-bit weights) is the recommended option for most local Gemma 4 deployments. It cuts VRAM usage by approximately 75% compared to BF16, with only a 3–5% reduction in benchmark quality.

ModelMin VRAM (Q4)Fits On
Gemma 4 E2B Q42 GBAny modern GPU
Gemma 4 E4B Q44 GBGTX 1660, RTX 3060
Gemma 4 26B A4B Q414 GBRTX 3090, RTX 4090
Gemma 4 31B Q418 GBRTX 3090 24GB+
lightbulb

Sweet spot for local development

For most local development workflows, Gemma 4 E4B Q4 at 4GB VRAM is the sweet spot. It fits on any modern GPU and retains 95%+ of the full-precision quality. If you have an RTX 3060 or newer, start here.

Recommended Hardware by Gemma 4 Model

Not sure which Gemma 4 model fits your machine? Use this table to find the best match for your hardware. All recommendations use Q4 quantization unless otherwise noted.

HardwareRecommended Gemma 4 Model
Apple MacBook Air M2 (8GB)Gemma 4 E2B Q4
Apple MacBook Pro M3 (16GB)Gemma 4 E4B Q4
Apple MacBook Pro M3 Max (48GB)Gemma 4 26B A4B Q4
NVIDIA RTX 3070 (8GB)Gemma 4 E4B Q4
NVIDIA RTX 3090 (24GB)Gemma 4 26B A4B Q4
NVIDIA A100 (80GB)Gemma 4 31B BF16

What is Quantization?

Quantization reduces the bit-width of model weights — from 16-bit (BF16) to 8-bit (Q8) or 4-bit (Q4). For Gemma 4 models, Q4 quantization typically reduces quality by 3–5% on standard benchmarks while cutting memory usage by approximately 75%. Q8 quantization offers a middle ground: roughly 50% memory reduction with less than 1% quality loss.

When you run a Gemma 4 model through Ollama, it automatically selects Q4_K_M quantization — a variant of Q4 that applies different quantization levels to different weight layers, prioritizing accuracy for the most sensitive parts of the Gemma 4 architecture. This is the recommended quantization format for all Gemma 4 models running on consumer hardware.

For Gemma 4 models running in production on server-grade hardware, use BF16 or FP8 precision (where supported) to maximize output quality. FP8 is supported on H100 GPUs and cuts memory roughly in half compared to BF16, making Gemma 4 31B fit in 32GB of VRAM at near-full quality.

warning

MoE memory requirements explained

The Gemma 4 26B A4B model uses a Mixture-of-Experts (MoE) architecture. While it has 26.1B total parameters, only approximately 4B are active per token during inference. However, all weights must still be loaded into VRAM — the active parameter count does not reduce memory requirements. Plan for 28GB BF16 or 14GB Q4 when deploying Gemma 4 26B A4B.

Memory Requirements for Apple Silicon

Apple Silicon Macs use unified memory — the same memory pool serves both CPU and GPU workloads. This is distinct from discrete GPU VRAM. When running Gemma 4 on Apple Silicon, count the full system memory, not just the GPU allocation. A MacBook Pro with 16GB unified memory has roughly 12–13GB available for model weights after macOS overhead, making Gemma 4 E4B Q4 (4GB) comfortable and Gemma 4 E4B BF16 (10GB) feasible.

Apple Silicon benefits from the Metal Performance Shaders (MPS) backend in llama.cpp and Ollama. Gemma 4 inference on M-series chips is significantly faster than CPU-only inference on x86 machines with the same memory, making Apple Silicon an excellent platform for running Gemma 4 E4B and Gemma 4 E2B locally.

Related Pages

Gemma 4 E4BThe daily driver — 8GB VRAM minimumGemma 4 E2BUltra-light — runs in 2GB with Q4Run Gemma 4 with OllamaAutomatic quantization selectionGemma 4 Models ComparisonFull feature comparison of all Gemma 4 variants
gemma4.devgemma4.dev

Run, deploy, and debug Gemma 4 models. Built for developers who move fast.

GitHubGitHubTwitterX (Twitter)Email
Models
  • Gemma 4 E2B
  • Gemma 4 E4B
  • Gemma 4 26B
  • Gemma 4 31B
  • Compare Models
Run Local
  • Ollama
  • Hugging Face
  • GGUF
  • LM Studio
  • llama.cpp
Deploy
  • vLLM
  • Gemini API
  • Vertex AI
  • Cloud Run
Guides & Help
  • Thinking Mode
  • Prompt Formatting
  • Function Calling
  • Error Fixes
© 2026 gemma4.dev All Rights Reserved.