Concepts
Multimodal Inputs
How Gemma 4 processes images, and how to pass vision inputs across different runtimes.
Multimodal Inputs
Gemma 4 E4B, 26B A4B, and 31B support image inputs natively. The E2B model is text-only.
Supported input types
- JPEG, PNG, WebP, GIF (first frame only)
- Base64-encoded inline images
- URL references (runtime-dependent)
Ollama
ollama run gemma4:4b "What is in this image?" --image ./screenshot.pngVia API:
import ollama, base64
with open("image.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = ollama.chat(
model="gemma4:4b",
messages=[{
"role": "user",
"content": "Describe this chart.",
"images": [img_b64],
}]
)
print(response["message"]["content"])Hugging Face Transformers
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
processor = AutoProcessor.from_pretrained("google/gemma-4-4b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-4b-it")
image = Image.open("diagram.png")
inputs = processor(
text="<image>\nExplain the architecture shown.",
images=image,
return_tensors="pt"
)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))Token budget for images
Each image consumes vision tokens from your context window. Approximate costs:
| Image size | Tokens consumed |
|---|---|
| 224×224 | ~256 tokens |
| 512×512 | ~1024 tokens |
| 1024×1024 | ~4096 tokens |
Large images in long conversations can rapidly fill the context window. Resize images to the smallest useful resolution before passing them to the model.
What works well
- Screenshots and UIs — Gemma 4 accurately identifies UI elements and describes layouts
- Charts and graphs — Can extract data points and describe trends
- Code screenshots — Can read and reproduce code from screenshots
- Documents — OCR-quality text extraction from printed or typed documents
Limitations
- No video support (use first frame extraction)
- No real-time camera input
- Maximum ~4 images per request recommended for stable inference
- The E2B model does not support image inputs