Skip to main content

Using Moondream with Transformers

Basic and advanced features of Moondream on Hugging Face Transformers.

Prerequisites

First, install the dependencies:

pip install "transformers>=4.51.1" "torch>=2.7.0" "accelerate>=1.10.0" "Pillow>=11.0.0"

Basic Usage

Moondream provides four core skills: image captioning, visual question answering, object detection, and visual pointing.

Captioning does not require a prompt; just give Moondream an image and it will produce the caption right away. For fine-grained control, you can set the desired length of the caption, as well as adjust sampling settings.

Parameters:

  • image: PIL.Image or encoded image - The image to process.
  • length: str - Caption detail level: "short", "normal", or "long" (default: "normal").
  • stream: bool - Whether to stream the response token by token (default: False).
  • settings: dict - Optional settings with:
    • temperature: float - Controls randomness in generation (default: 0.5).
    • max_tokens: int - Max number of tokens that can be generated (default: 768).
    • top_p: float - Sample the set of most likely tokens that have a cumulative probability >= top_p (default: 0.3).
from transformers import AutoModelForCausalLM
from PIL import Image
import torch

# Load the model
model = AutoModelForCausalLM.from_pretrained(
"vikhyatk/moondream2",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="mps", # "cuda" on Nvidia GPUs
)

# Load your image
image = Image.open("path/to/your/image.jpg")

# Optionally set sampling settings
settings = {"temperature": 0.5, "max_tokens": 768, "top_p": 0.3}

# Generate a short caption
short_result = model.caption(
image,
length="short",
settings=settings
)
print(short_result)

Example Response:

{
"caption": "Four brown horses with red and black harnesses plow a field, guided by two men in western attire, with a majestic mountain range in the background."
}

Moondream3

Sign up for early access to start using Moondream3. Currently, only Nvidia GPUs with 24GB+ of memory are supported; quantized and Apple Silicon versions coming soon.

model = AutoModelForCausalLM.from_pretrained(
"moondream/moondream3-preview",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="cuda"
)

Advanced Features

Streaming

Caption and Query both support streaming, displaying tokens immediately as they are generated.

model = AutoModelForCausalLM.from_pretrained(
"vikhyatk/moondream2",
revision="2025-01-09",
trust_remote_code=True,
device_map="mps", # "cuda" for NVIDIA GPUs
)

image = Image.open("path/to/your/image.jpg")

for t in model.caption(image, length="normal", stream=True)["caption"]:
print(t, end="", flush=True)

Compile

If you are using the model to make multiple inference calls, compiling can noticeably improve generation speed. This is especially true for Moondream3, as it uses FlexAttention. After the model has been created, call model.compile().

model = AutoModelForCausalLM.from_pretrained(
"moondream/moondream3-preview",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="cuda"
)

model.compile()

Reuse Image Encoding

If you are using an image for multiple inferences, it can be beneficial to reuse the image encoding. Since encoding makes up a large portion of the time Moondream takes to generate a response, reusing it allows you to quickly generate multiple responses for the same image.

image = Image.open("path/to/your/image.jpg")
encoded_image = model.encode_image(image)

# Reuse the encoded image for each inference
print(model.caption(encoded_image, length="short")["caption"])
print(model.query(encoded_image, "How many people are in the image?")["answer"])