How Gemma 4 processes images, and how to pass vision inputs across different runtimes.

Multimodal Inputs

Gemma 4 E4B, 26B A4B, and 31B support image inputs natively. The E2B model is text-only.

Supported input types

JPEG, PNG, WebP, GIF (first frame only)
Base64-encoded inline images
URL references (runtime-dependent)

Ollama

ollama run gemma4:4b "What is in this image?" --image ./screenshot.png

Via API:

import ollama, base64

with open("image.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model="gemma4:4b",
    messages=[{
        "role": "user",
        "content": "Describe this chart.",
        "images": [img_b64],
    }]
)
print(response["message"]["content"])

Hugging Face Transformers

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

processor = AutoProcessor.from_pretrained("google/gemma-4-4b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-4b-it")

image = Image.open("diagram.png")
inputs = processor(
    text="<image>\nExplain the architecture shown.",
    images=image,
    return_tensors="pt"
)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Token budget for images

Each image consumes vision tokens from your context window. Approximate costs:

Image size	Tokens consumed
224×224	~256 tokens
512×512	~1024 tokens
1024×1024	~4096 tokens

Large images in long conversations can rapidly fill the context window. Resize images to the smallest useful resolution before passing them to the model.

What works well

Screenshots and UIs — Gemma 4 accurately identifies UI elements and describes layouts
Charts and graphs — Can extract data points and describe trends
Code screenshots — Can read and reproduce code from screenshots
Documents — OCR-quality text extraction from printed or typed documents

Limitations

No video support (use first frame extraction)
No real-time camera input
Maximum ~4 images per request recommended for stable inference
The E2B model does not support image inputs

Multimodal Inputs

How Gemma 4 processes images, and how to pass vision inputs across different runtimes.

Multimodal Inputs

Gemma 4 E4B, 26B A4B, and 31B support image inputs natively. The E2B model is text-only.

Supported input types

JPEG, PNG, WebP, GIF (first frame only)
Base64-encoded inline images
URL references (runtime-dependent)

Ollama

ollama run gemma4:4b "What is in this image?" --image ./screenshot.png

Via API:

import ollama, base64

with open("image.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model="gemma4:4b",
    messages=[{
        "role": "user",
        "content": "Describe this chart.",
        "images": [img_b64],
    }]
)
print(response["message"]["content"])

Hugging Face Transformers

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

processor = AutoProcessor.from_pretrained("google/gemma-4-4b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-4b-it")

image = Image.open("diagram.png")
inputs = processor(
    text="<image>\nExplain the architecture shown.",
    images=image,
    return_tensors="pt"
)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Token budget for images

Each image consumes vision tokens from your context window. Approximate costs:

Image size	Tokens consumed
224×224	~256 tokens
512×512	~1024 tokens
1024×1024	~4096 tokens

Large images in long conversations can rapidly fill the context window. Resize images to the smallest useful resolution before passing them to the model.

What works well

Screenshots and UIs — Gemma 4 accurately identifies UI elements and describes layouts
Charts and graphs — Can extract data points and describe trends
Code screenshots — Can read and reproduce code from screenshots
Documents — OCR-quality text extraction from printed or typed documents

Limitations

No video support (use first frame extraction)
No real-time camera input
Maximum ~4 images per request recommended for stable inference
The E2B model does not support image inputs

Multimodal Inputs

Multimodal Inputs

Supported input types

Ollama

Hugging Face Transformers

Token budget for images

What works well

Limitations

Table of Contents

Multimodal Inputs

Multimodal Inputs

Supported input types

Ollama

Hugging Face Transformers

Token budget for images

What works well

Limitations

Table of Contents