Using Moondream with Transformers

Use this guide to run the latest model with transformers from Hugging Face.

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
 
# Initialize the model
model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    # Uncomment for GPU acceleration & pip install accelerate
    # device_map={"": "cuda"}
)
 
# Load your image
image = Image.open("<PATH_TO_YOUR_IMAGE>")
 
# 1. Image Captioning
print("Short caption:")
print(model.caption(image, length="short")["caption"])
 
print("\nDetailed caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
    print(t, end="", flush=True)
 
# 2. Visual Question Answering
print("\nAsking questions about the image:")
print(model.query(image, "How many people are in the image?")["answer"])
 
# 3. Object Detection
print("\nDetecting objects:")
objects = model.detect(image, "face")["objects"]
print(f"Found {len(objects)} face(s)")
 
# 4. Visual Pointing
print("\nLocating objects:")
points = model.point(image, "person")["points"]
print(f"Found {len(points)} person(s)")

Additional Features

🎯

Direct Integration

Use Moondream directly with Hugging Face Transformers library for maximum flexibility

Import and use like any HF model
Full control over model parameters
Native PyTorch integration
Support for custom pipelines

⚡

GPU Acceleration

Optional CUDA support for faster inference on GPU devices

Enable GPU acceleration with one line
Automatic mixed precision support
Batch processing for efficiency
Memory-efficient inference

🔄

Streaming Support

Stream outputs for real-time responses in caption and detect modes

Real-time caption generation
Progressive object detection
Low-latency responses
Memory-efficient streaming

🛠️

Full API Access

Access to all core features: captioning, querying, detection, and pointing

Generate image captions
Answer visual questions
Detect objects and faces
Point to specific objects

Tips

You can load multiple instances of the model on a single device if it has enough VRAM. Moondream itself uses 4-5GB of VRAM per instance. This way, you can run many inferences at the same time. To do this, initialize the model multiple times.

model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    device_map={"": "cuda"},
)
 
model2 = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    device_map={"": "cuda"},
)

Upon each call to the model, it will automatically encode your image. If you plan to use the same image for multiple inferences, it is best to encode the image once and reuse it for each inference.

image = Image.open("<PATH_TO_YOUR_IMAGE>")
encoded_image = model.encode_image(image)
 
# Reuse the encoded image for each inference
print(model.caption(encoded_image, length="short")["caption"])
print(model.query(encoded_image, "How many people are in the image?")["answer"])

OpenAI Compatibility