Skip to main content

Python SDK

pip install moondream

Creating a finetune

New finetune

import moondream as md

ft = md.ft(
api_key="your-api-key",
name="my-finetune",
rank=8,
)
ParameterTypeRequiredDescription
api_keystryesMoondream API key
namestryes*Unique name (alphanumeric, dots, hyphens, underscores)
rankintyes*LoRA rank: 8, 16, 24, or 32
endpointstrnoAPI endpoint (default: https://api.moondream.ai/v1/tuning)

*Required when creating a new finetune.

If a finetune with the same name and rank already exists, the existing one is returned. If the name exists with a different rank, the server returns 409 Conflict.

Existing finetune

ft = md.ft(
api_key="your-api-key",
finetune_id="01HXYZ...",
)

Cannot combine finetune_id with name or rank.

Properties

ft.finetune_id  # "01HXYZ..."
ft.name # "my-finetune"
ft.rank # 8

Generating rollouts

ft.rollouts()

Generate rollouts for a single request.

response = ft.rollouts(
"query",
image=pil_image,
question="What color is the car?",
num_rollouts=4,
settings={"temperature": 1.0, "max_tokens": 128},
)
ParameterTypeRequiredDescription
skillstryes"query", "point", or "detect"
imagePIL.Image or EncodedImagedependsRequired for point/detect, optional for query
questionstrdependsRequired for query
objectstrdependsRequired for point/detect
num_rolloutsintnoNumber of attempts, 1–16 (default: 1)
settingsdictnoSampling settings (see Settings)
reasoningboolnoEnable reasoning (default: false, query only)
spatial_refslistnoSpatial references as [x, y] or [x_min, y_min, x_max, y_max] (query only)
ground_truthdictnoFor automatic reward computation (point/detect only, see Ground truth)

Returns a dict:

{
"request": { ... }, # Pass back to train_step unchanged
"rollouts": [
{
"skill": "query",
"output": {"answer": "red"},
... # Opaque training metadata
}
],
"rewards": [0.8, ...] # Present if ground_truth was provided, otherwise null
}

Rollout output by skill:

SkillOutput
query{"answer": "...", "reasoning": {...}} (reasoning only if enabled)
point{"points": [{"x": 0.52, "y": 0.31}, ...]}
detect{"objects": [{"x_min": 0.1, "y_min": 0.2, "x_max": 0.4, "y_max": 0.6}, ...]}

ft.rollout_stream()

Generate rollouts concurrently in background threads, yielding results as they complete. This overlaps rollout generation with training — while you process one result, the next batch is already in flight.

requests = (
(example, {
"skill": "query",
"image": example["image"],
"question": "What is this?",
"num_rollouts": 4,
"settings": {"temperature": 1.0},
})
for example in training_data
)

for context, response in ft.rollout_stream(requests):
rewards = compute_rewards(context, response)
ft.train_step([{
"mode": "rl",
"request": response["request"],
"rollouts": response["rollouts"],
"rewards": rewards,
}])
ParameterTypeDefaultDescription
requestsiterablerequiredIterable of (context, kwargs_dict) tuples
max_concurrencyint4Maximum parallel requests
buffer_sizeint8Maximum buffered results

Each kwargs_dict is unpacked as **kwargs to ft.rollouts(). The context is passed through untouched so you can pair responses with ground-truth labels.

Results are in completion order, not submission order.

Training

ft.train_step()

Apply one training step.

step = ft.train_step(groups, lr=2e-4)
ParameterTypeDefaultDescription
groupslistrequiredRL and/or SFT group dicts
lrfloat2e-4Learning rate

Returns:

{
"step": 12, # Training step number
"applied": True, # Whether weights were updated
"kl": 0.031, # KL divergence (RL)
"router_kl": 0.004, # Router KL divergence
"grad_norm": 1.42, # Gradient norm
"reward_mean": 0.75, # Mean reward (RL)
"reward_std": 0.18, # Reward std dev (RL)
"sft_loss": None, # SFT loss (SFT)
}

RL groups

Pass rollout responses back with rewards:

ft.train_step([{
"mode": "rl",
"request": response["request"],
"rollouts": response["rollouts"],
"rewards": [0.8, 0.3, 0.6, 0.5],
}])
  • Pass request and rollouts back unchanged from the rollouts response
  • rewards must match the length and order of rollouts

SFT groups

Provide the correct answer directly. The request is a skill request dict, not a rollouts response.

Query:

ft.train_step([{
"mode": "sft",
"request": {
"skill": "query",
"image": pil_image,
"question": "What country is this?",
},
"target": {"answer": "United States"},
}])

With reasoning enabled:

ft.train_step([{
"mode": "sft",
"request": {
"skill": "query",
"image": pil_image,
"question": "What country is this?",
"reasoning": True,
},
"target": {
"answer": "United States",
"reasoning": {"text": "The road markings and signs match the US."},
},
}])

Point:

ft.train_step([{
"mode": "sft",
"request": {
"skill": "point",
"image": pil_image,
"object": "the red button",
},
"target": {"points": [{"x": 0.52, "y": 0.31}]},
}])

Point targets can also use bounding boxes:

"target": {"boxes": [{"x_min": 0.45, "y_min": 0.22, "x_max": 0.58, "y_max": 0.39}]}

Detect:

ft.train_step([{
"mode": "sft",
"request": {
"skill": "detect",
"image": pil_image,
"object": "vehicles",
},
"target": {"boxes": [
{"x_min": 0.10, "y_min": 0.20, "x_max": 0.40, "y_max": 0.60},
]},
}])

You can mix RL and SFT groups and different skills in the same train_step call. PIL images in SFT requests are encoded automatically.

Metrics

ft.log_metrics()

Log custom metrics for a training step:

result = ft.log_metrics(
step=step["step"],
metrics={"eval/accuracy": 0.85, "eval/f1": 0.82},
)
# {"ok": True, "step": 12, "logged_count": 2}

Checkpoints

ft.save_checkpoint()

Save the current checkpoint. Only saved checkpoints can be used for inference.

result = ft.save_checkpoint()
checkpoint = result["checkpoint"]
# {"checkpoint_id": "01JXYZ...", "finetune_id": "01HXYZ...", "step": 100, ...}

ft.list_checkpoints()

result = ft.list_checkpoints(limit=50, cursor=None)
for cp in result["checkpoints"]:
print(f"step={cp['step']} id={cp['checkpoint_id']}")
# result["has_more"], result["next_cursor"] for pagination

ft.delete_checkpoint()

ft.delete_checkpoint(step=100)

Deleting the latest checkpoint prevents resuming training.

Inference

ft.model()

Get the model ID for a saved checkpoint:

model_id = ft.model(step=100)
# "moondream3-preview/01HXYZ...@100"

Use this with the model parameter on any inference endpoint:

EndpointDescription
/v1/queryQuestion answering
/v1/captionImage captioning
/v1/detectObject detection
/v1/pointPoint localization
/v1/batchBatch processing

Only saved checkpoints can be used for inference.

Cleanup

ft.delete()

Delete the finetune and all its checkpoints:

ft.delete()

Settings

Rollout requests accept a settings dict:

FieldTypeDefaultDescription
temperaturefloat1.0Randomness (0 = deterministic)
top_pfloat1.0Nucleus sampling threshold
max_tokensint128 (query/point), 256 (detect)Maximum output length
max_objectsint50Maximum detected objects (detect only)

All fields are optional.

Use high temperature (e.g. 1.0) during training for diverse rollouts. Use temperature=0 for evaluation.

Ground truth

For point and detect skills, provide ground truth to have the server compute rewards automatically.

Point

With coordinates:

ft.rollouts("point", image=img, object="the button",
ground_truth={"points": [{"x": 0.52, "y": 0.31}]})

With bounding boxes (reward based on whether the predicted point falls inside):

ft.rollouts("point", image=img, object="the button",
ground_truth={"boxes": [{"x_min": 0.1, "y_min": 0.2, "x_max": 0.4, "y_max": 0.6}]})

Detect

ft.rollouts("detect", image=img, object="vehicles",
ground_truth={"boxes": [
{"x_min": 0.1, "y_min": 0.2, "x_max": 0.4, "y_max": 0.6},
]})

All coordinates are normalized to 0–1. Ground truth is not supported for query — compute rewards yourself.