Using Moondream with Transformers
Use this guide to run the latest model with transformers from Hugging Face.
from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image # Initialize the model model = AutoModelForCausalLM.from_pretrained( "vikhyatk/moondream2", revision="2025-01-09", trust_remote_code=True, # Uncomment for GPU acceleration & pip install accelerate # device_map={"": "cuda"} ) # Load your image image = Image.open("<PATH_TO_YOUR_IMAGE>") # 1. Image Captioning print("Short caption:") print(model.caption(image, length="short")["caption"]) print("\nDetailed caption:") for t in model.caption(image, length="normal", stream=True)["caption"]: print(t, end="", flush=True) # 2. Visual Question Answering print("\nAsking questions about the image:") print(model.query(image, "How many people are in the image?")["answer"]) # 3. Object Detection print("\nDetecting objects:") objects = model.detect(image, "face")["objects"] print(f"Found {len(objects)} face(s)") # 4. Visual Pointing print("\nLocating objects:") points = model.point(image, "person")["points"] print(f"Found {len(points)} person(s)")
Additional Features
Direct Integration
Use Moondream directly with Hugging Face Transformers library for maximum flexibility
- Import and use like any HF model
- Full control over model parameters
- Native PyTorch integration
- Support for custom pipelines
GPU Acceleration
Optional CUDA support for faster inference on GPU devices
- Enable GPU acceleration with one line
- Automatic mixed precision support
- Batch processing for efficiency
- Memory-efficient inference
Streaming Support
Stream outputs for real-time responses in caption and detect modes
- Real-time caption generation
- Progressive object detection
- Low-latency responses
- Memory-efficient streaming
Full API Access
Access to all core features: captioning, querying, detection, and pointing
- Generate image captions
- Answer visual questions
- Detect objects and faces
- Point to specific objects
Tips
You can load multiple instances of the model on a single device if it has enough VRAM. Moondream itself uses 4-5GB of VRAM per instance. This way, you can run many inferences at the same time. To do this, initialize the model multiple times.
model = AutoModelForCausalLM.from_pretrained( "vikhyatk/moondream2", revision="2025-01-09", trust_remote_code=True, device_map={"": "cuda"}, ) model2 = AutoModelForCausalLM.from_pretrained( "vikhyatk/moondream2", revision="2025-01-09", trust_remote_code=True, device_map={"": "cuda"}, )
Upon each call to the model, it will automatically encode your image. If you plan to use the same image for multiple inferences, it is best to encode the image once and reuse it for each inference.
image = Image.open("<PATH_TO_YOUR_IMAGE>") encoded_image = model.encode_image(image) # Reuse the encoded image for each inference print(model.caption(encoded_image, length="short")["caption"]) print(model.query(encoded_image, "How many people are in the image?")["answer"])