Deploy Gemma 4 with vLLM

How to deploy Gemma 4 models using vLLM for high-throughput production inference. Covers installation, serving, and OpenAI-compatible API setup.

What is vLLM?

vLLM is an open-source, high-throughput inference engine designed for production LLM serving. It was developed at UC Berkeley and has become the de-facto standard for deploying large language models at scale.

The key innovation in vLLM is PagedAttention — a memory management technique inspired by operating system virtual memory that allows the KV cache to be stored in non-contiguous blocks. Combined with continuous batching, this means vLLM can process many concurrent requests without wasting GPU memory on padding or pre-allocated buffers.

Compared to a naive Hugging Face generate() loop, vLLM achieves 10–25x higher throughput on the same hardware. It also exposes an OpenAI-compatible REST API out of the box, so any application already using the OpenAI SDK can switch to a self-hosted Gemma 4 endpoint with a one-line change.

Prerequisites

Before serving Gemma 4 with vLLM, make sure you have:

GPU: NVIDIA GPU with CUDA support
- Gemma 4 E4B (google/gemma-4-4b-it): 8 GB VRAM minimum (16 GB recommended)
- Gemma 4 31B (google/gemma-4-31b-it): 48 GB VRAM for full precision, or 24 GB with AWQ quantization
Python: 3.9 or newer
CUDA: 12.1 or newer (12.4 recommended for best performance)
Hugging Face account: Gemma 4 is a gated model — you need to accept the license on the Hugging Face model page and set HF_TOKEN in your environment

Export your token before running any commands:

export HF_TOKEN=hf_your_token_here

Installation

Install vLLM with pip. A virtual environment is strongly recommended:

pip install vllm

For the latest nightly features (including the most recent Gemma 4 optimizations), you can install from the nightly index:

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Serve Gemma 4 E4B

The E4B variant is the best starting point — it fits comfortably on a single consumer GPU with 16 GB of VRAM and delivers strong reasoning performance relative to its size.

vllm serve google/gemma-4-4b-it \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192

vLLM will download the model weights from Hugging Face on first run (using your HF_TOKEN), start the PagedAttention engine, and begin listening for requests. You should see a message like INFO: Application startup complete when it is ready.

Serve Gemma 4 31B

The 31B model requires significantly more VRAM. You have two options:

Option 1 — Two GPUs with tensor parallelism (48 GB total VRAM):

vllm serve google/gemma-4-31b-it \
  --max-model-len 32768 \
  --tensor-parallel-size 2

Option 2 — Single 24 GB GPU with AWQ quantization:

vllm serve google/gemma-4-31b-it \
  --max-model-len 32768 \
  --quantization awq

AWQ (Activation-aware Weight Quantization) reduces the model to 4-bit precision with minimal quality loss, making the 31B model practical on a single high-end consumer or workstation GPU.

Query the vLLM API

vLLM exposes a fully OpenAI-compatible /v1/chat/completions endpoint. You can test it with curl immediately after the server starts:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-4b-it",
    "messages": [{"role": "user", "content": "Hello from vLLM!"}],
    "max_tokens": 256
  }'

The response format is identical to what the OpenAI API returns, including the choices, usage, and model fields.

Python Client

Because vLLM implements the OpenAI API spec, you can use the official openai Python package to talk to it. Just override base_url to point to your vLLM server:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")
response = client.chat.completions.create(
    model="google/gemma-4-4b-it",
    messages=[{"role": "user", "content": "Explain Gemma 4 briefly"}]
)
print(response.choices[0].message.content)

The api_key value is ignored by vLLM by default (it accepts any non-empty string), but you can enable token-based authentication with --api-key if you need to secure the endpoint.

vLLM with Docker

For production deployments, running vLLM inside Docker is the recommended approach. The official vllm/vllm-openai image bundles CUDA, the Python environment, and vLLM itself:

docker run --gpus all -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model google/gemma-4-4b-it

Add --ipc=host if you encounter shared memory errors with larger batch sizes. To persist the model weights across container restarts, mount a volume to /root/.cache/huggingface:

docker run --gpus all -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model google/gemma-4-4b-it

Performance Tips

A few flags that have the biggest impact on throughput and memory usage:

--max-model-len — Caps the maximum sequence length (prompt + output). Lowering this reduces KV cache memory and allows more concurrent requests. Start with 8192 and increase only if your workload needs longer contexts.
--gpu-memory-utilization 0.9 — By default vLLM reserves 90% of GPU memory for the KV cache. You can increase this to 0.95 on dedicated inference machines where no other GPU processes are running.
--quantization awq — Reduces weights to 4-bit precision. Requires the model to have a pre-quantized AWQ checkpoint, or you must quantize it yourself with autoawq beforehand.
--enable-chunked-prefill — Splits long prompts into chunks so they do not starve ongoing decode requests. Useful when mixing short and very long prompts.
--max-num-seqs — Sets the maximum number of sequences processed concurrently. Tune this together with --max-model-len to match your GPU memory budget.

Hugging Face Token Requirement

Gemma 4 is a gated model on Hugging Face. You must visit the model card for google/gemma-4-4b-it or google/gemma-4-31b-it, accept the Google terms of use, and then generate a Hugging Face access token with at least read permission.

Set the token in your shell before starting vLLM:

export HF_TOKEN=hf_your_token_here

Or pass it directly when using Docker via -e HF_TOKEN=$HF_TOKEN. Without a valid token, the model download will fail with a 401 Unauthorized error.