Using Moondream with Transformers
Basic and advanced features of Moondream on Hugging Face Transformers.
Prerequisites
First, install the dependencies:
pip install "transformers>=4.51.1" "torch>=2.7.0" "accelerate>=1.10.0" "Pillow>=11.0.0"
Basic Usage
Moondream provides four core skills: image captioning, visual question answering, object detection, and visual pointing.
- Caption
- Query
- Detect
- Point
Captioning does not require a prompt; just give Moondream an image and it will produce the caption right away. For fine-grained control, you can set the desired length of the caption, as well as adjust sampling settings.
The query skill can be used to ask open-ended questions about images. In reasoning mode, Moondream may decide to ground its thoughts by pointing out parts of the image. These grounded points are visible in the model's chain of thought.
The detect skill provides bounding boxes for objects in an image based on a prompt. Bounding box coordinates are proportional relative to the image. Values range from 0 to 1, where 1 refers to either the bottom or right edge of the image.
Point out objects in images with a prompt. Point coordinates are proportional relative to the image. Values range from 0 to 1, where 1 refers to either the bottom or right edge of the image.
Moondream3
Sign up for early access to start using Moondream3. Currently, only Nvidia GPUs with 24GB+ of memory are supported; quantized and Apple Silicon versions coming soon.
model = AutoModelForCausalLM.from_pretrained(
    "moondream/moondream3-preview",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="cuda"
)
Advanced Features
Streaming
Caption and Query both support streaming, displaying tokens immediately as they are generated.
model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    device_map="mps",  # "cuda" for NVIDIA GPUs
)
image = Image.open("path/to/your/image.jpg")
for t in model.caption(image, length="normal", stream=True)["caption"]:
    print(t, end="", flush=True)
Compile
If you are using the model to make multiple inference calls, compiling can noticeably improve generation speed. This is especially true for Moondream3, as it uses FlexAttention. After the model has been created, call model.compile().
model = AutoModelForCausalLM.from_pretrained(
    "moondream/moondream3-preview",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="cuda"
)
model.compile()
Reuse Image Encoding
If you are using an image for multiple inferences, it can be beneficial to reuse the image encoding. Since encoding makes up a large portion of the time Moondream takes to generate a response, reusing it allows you to quickly generate multiple responses for the same image.
image = Image.open("path/to/your/image.jpg")
encoded_image = model.encode_image(image)
# Reuse the encoded image for each inference
print(model.caption(encoded_image, length="short")["caption"])
print(model.query(encoded_image, "How many people are in the image?")["answer"])