gemma4.devgemma4.dev
MkSaaS文档
gemma4.devgemma4.dev
MkSaaS文档
首页Gemma 4 Developer Hub
ConceptsThinking ModeMultimodal InputsQuantization
X (Twitter)
Concepts

Multimodal Inputs

How Gemma 4 processes images, and how to pass vision inputs across different runtimes.

Multimodal Inputs

Gemma 4 E4B, 26B A4B, and 31B support image inputs natively. The E2B model is text-only.

Supported input types

  • JPEG, PNG, WebP, GIF (first frame only)
  • Base64-encoded inline images
  • URL references (runtime-dependent)

Ollama

ollama run gemma4:4b "What is in this image?" --image ./screenshot.png

Via API:

import ollama, base64

with open("image.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model="gemma4:4b",
    messages=[{
        "role": "user",
        "content": "Describe this chart.",
        "images": [img_b64],
    }]
)
print(response["message"]["content"])

Hugging Face Transformers

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

processor = AutoProcessor.from_pretrained("google/gemma-4-4b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-4b-it")

image = Image.open("diagram.png")
inputs = processor(
    text="<image>\nExplain the architecture shown.",
    images=image,
    return_tensors="pt"
)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Token budget for images

Each image consumes vision tokens from your context window. Approximate costs:

Image sizeTokens consumed
224×224~256 tokens
512×512~1024 tokens
1024×1024~4096 tokens

Large images in long conversations can rapidly fill the context window. Resize images to the smallest useful resolution before passing them to the model.

What works well

  • Screenshots and UIs — Gemma 4 accurately identifies UI elements and describes layouts
  • Charts and graphs — Can extract data points and describe trends
  • Code screenshots — Can read and reproduce code from screenshots
  • Documents — OCR-quality text extraction from printed or typed documents

Limitations

  • No video support (use first frame extraction)
  • No real-time camera input
  • Maximum ~4 images per request recommended for stable inference
  • The E2B model does not support image inputs

Thinking Mode

How to enable and control Gemma 4's extended reasoning — and when it actually helps.

Quantization

Q4, Q8, FP16, and GGUF explained. How to choose the right quantization for your hardware.

目录

Multimodal Inputs
Supported input types
Ollama
Hugging Face Transformers
Token budget for images
What works well
Limitations