AdvancedTransformers Integration

Using Moondream with Transformers

Use this guide to run the latest model with transformers from Hugging Face.

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
 
# Initialize the model
model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    # Uncomment for GPU acceleration & pip install accelerate
    # device_map={"": "cuda"}
)
 
# Load your image
image = Image.open("<PATH_TO_YOUR_IMAGE>")
 
# 1. Image Captioning
print("Short caption:")
print(model.caption(image, length="short")["caption"])
 
print("\nDetailed caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
    print(t, end="", flush=True)
 
# 2. Visual Question Answering
print("\nAsking questions about the image:")
print(model.query(image, "How many people are in the image?")["answer"])
 
# 3. Object Detection
print("\nDetecting objects:")
objects = model.detect(image, "face")["objects"]
print(f"Found {len(objects)} face(s)")
 
# 4. Visual Pointing
print("\nLocating objects:")
points = model.point(image, "person")["points"]
print(f"Found {len(points)} person(s)")

Additional Features

🎯

Direct Integration

Use Moondream directly with Hugging Face Transformers library for maximum flexibility

  • Import and use like any HF model
  • Full control over model parameters
  • Native PyTorch integration
  • Support for custom pipelines

GPU Acceleration

Optional CUDA support for faster inference on GPU devices

  • Enable GPU acceleration with one line
  • Automatic mixed precision support
  • Batch processing for efficiency
  • Memory-efficient inference
🔄

Streaming Support

Stream outputs for real-time responses in caption and detect modes

  • Real-time caption generation
  • Progressive object detection
  • Low-latency responses
  • Memory-efficient streaming
🛠️

Full API Access

Access to all core features: captioning, querying, detection, and pointing

  • Generate image captions
  • Answer visual questions
  • Detect objects and faces
  • Point to specific objects

Tips

You can load multiple instances of the model on a single device if it has enough VRAM. Moondream itself uses 4-5GB of VRAM per instance. This way, you can run many inferences at the same time. To do this, initialize the model multiple times.

model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    device_map={"": "cuda"},
)
 
model2 = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    device_map={"": "cuda"},
)

Upon each call to the model, it will automatically encode your image. If you plan to use the same image for multiple inferences, it is best to encode the image once and reuse it for each inference.

image = Image.open("<PATH_TO_YOUR_IMAGE>")
encoded_image = model.encode_image(image)
 
# Reuse the encoded image for each inference
print(model.caption(encoded_image, length="short")["caption"])
print(model.query(encoded_image, "How many people are in the image?")["answer"])