Python SDK
pip install moondream
Creating a finetune
New finetune
import moondream as md
ft = md.ft(
api_key="your-api-key",
name="my-finetune",
rank=8,
)
| Parameter | Type | Required | Description |
|---|---|---|---|
| api_key | str | yes | Moondream API key |
| name | str | yes* | Unique name (alphanumeric, dots, hyphens, underscores) |
| rank | int | yes* | LoRA rank: 8, 16, 24, or 32 |
| endpoint | str | no | API endpoint (default: https://api.moondream.ai/v1/tuning) |
*Required when creating a new finetune.
If a finetune with the same name and rank already exists, the existing one is returned. If the name exists with a different rank, the server returns 409 Conflict.
Existing finetune
ft = md.ft(
api_key="your-api-key",
finetune_id="01HXYZ...",
)
Cannot combine finetune_id with name or rank.
Properties
ft.finetune_id # "01HXYZ..."
ft.name # "my-finetune"
ft.rank # 8
Generating rollouts
ft.rollouts()
Generate rollouts for a single request.
response = ft.rollouts(
"query",
image=pil_image,
question="What color is the car?",
num_rollouts=4,
settings={"temperature": 1.0, "max_tokens": 128},
)
| Parameter | Type | Required | Description |
|---|---|---|---|
| skill | str | yes | "query", "point", or "detect" |
| image | PIL.Image or EncodedImage | depends | Required for point/detect, optional for query |
| question | str | depends | Required for query |
| object | str | depends | Required for point/detect |
| num_rollouts | int | no | Number of attempts, 1–16 (default: 1) |
| settings | dict | no | Sampling settings (see Settings) |
| reasoning | bool | no | Enable reasoning (default: false, query only) |
| spatial_refs | list | no | Spatial references as [x, y] or [x_min, y_min, x_max, y_max] (query only) |
| ground_truth | dict | no | For automatic reward computation (point/detect only, see Ground truth) |
Returns a dict:
{
"request": { ... }, # Pass back to train_step unchanged
"rollouts": [
{
"skill": "query",
"output": {"answer": "red"},
... # Opaque training metadata
}
],
"rewards": [0.8, ...] # Present if ground_truth was provided, otherwise null
}
Rollout output by skill:
| Skill | Output |
|---|---|
| query | {"answer": "...", "reasoning": {...}} (reasoning only if enabled) |
| point | {"points": [{"x": 0.52, "y": 0.31}, ...]} |
| detect | {"objects": [{"x_min": 0.1, "y_min": 0.2, "x_max": 0.4, "y_max": 0.6}, ...]} |
ft.rollout_stream()
Generate rollouts concurrently in background threads, yielding results as they complete. This overlaps rollout generation with training — while you process one result, the next batch is already in flight.
requests = (
(example, {
"skill": "query",
"image": example["image"],
"question": "What is this?",
"num_rollouts": 4,
"settings": {"temperature": 1.0},
})
for example in training_data
)
for context, response in ft.rollout_stream(requests):
rewards = compute_rewards(context, response)
ft.train_step([{
"mode": "rl",
"request": response["request"],
"rollouts": response["rollouts"],
"rewards": rewards,
}])
| Parameter | Type | Default | Description |
|---|---|---|---|
| requests | iterable | required | Iterable of (context, kwargs_dict) tuples |
| max_concurrency | int | 4 | Maximum parallel requests |
| buffer_size | int | 8 | Maximum buffered results |
Each kwargs_dict is unpacked as **kwargs to ft.rollouts(). The context is passed through untouched so you can pair responses with ground-truth labels.
Results are in completion order, not submission order.
Training
ft.train_step()
Apply one training step.
step = ft.train_step(groups, lr=2e-4)
| Parameter | Type | Default | Description |
|---|---|---|---|
| groups | list | required | RL and/or SFT group dicts |
| lr | float | 2e-4 | Learning rate |
Returns:
{
"step": 12, # Training step number
"applied": True, # Whether weights were updated
"kl": 0.031, # KL divergence (RL)
"router_kl": 0.004, # Router KL divergence
"grad_norm": 1.42, # Gradient norm
"reward_mean": 0.75, # Mean reward (RL)
"reward_std": 0.18, # Reward std dev (RL)
"sft_loss": None, # SFT loss (SFT)
}
RL groups
Pass rollout responses back with rewards:
ft.train_step([{
"mode": "rl",
"request": response["request"],
"rollouts": response["rollouts"],
"rewards": [0.8, 0.3, 0.6, 0.5],
}])
- Pass
requestandrolloutsback unchanged from the rollouts response rewardsmust match the length and order ofrollouts
SFT groups
Provide the correct answer directly. The request is a skill request dict, not a rollouts response.
Query:
ft.train_step([{
"mode": "sft",
"request": {
"skill": "query",
"image": pil_image,
"question": "What country is this?",
},
"target": {"answer": "United States"},
}])
With reasoning enabled:
ft.train_step([{
"mode": "sft",
"request": {
"skill": "query",
"image": pil_image,
"question": "What country is this?",
"reasoning": True,
},
"target": {
"answer": "United States",
"reasoning": {"text": "The road markings and signs match the US."},
},
}])
Point:
ft.train_step([{
"mode": "sft",
"request": {
"skill": "point",
"image": pil_image,
"object": "the red button",
},
"target": {"points": [{"x": 0.52, "y": 0.31}]},
}])
Point targets can also use bounding boxes:
"target": {"boxes": [{"x_min": 0.45, "y_min": 0.22, "x_max": 0.58, "y_max": 0.39}]}
Detect:
ft.train_step([{
"mode": "sft",
"request": {
"skill": "detect",
"image": pil_image,
"object": "vehicles",
},
"target": {"boxes": [
{"x_min": 0.10, "y_min": 0.20, "x_max": 0.40, "y_max": 0.60},
]},
}])
You can mix RL and SFT groups and different skills in the same train_step call. PIL images in SFT requests are encoded automatically.
Metrics
ft.log_metrics()
Log custom metrics for a training step:
result = ft.log_metrics(
step=step["step"],
metrics={"eval/accuracy": 0.85, "eval/f1": 0.82},
)
# {"ok": True, "step": 12, "logged_count": 2}
Checkpoints
ft.save_checkpoint()
Save the current checkpoint. Only saved checkpoints can be used for inference.
result = ft.save_checkpoint()
checkpoint = result["checkpoint"]
# {"checkpoint_id": "01JXYZ...", "finetune_id": "01HXYZ...", "step": 100, ...}
ft.list_checkpoints()
result = ft.list_checkpoints(limit=50, cursor=None)
for cp in result["checkpoints"]:
print(f"step={cp['step']} id={cp['checkpoint_id']}")
# result["has_more"], result["next_cursor"] for pagination
ft.delete_checkpoint()
ft.delete_checkpoint(step=100)
Deleting the latest checkpoint prevents resuming training.
Inference
ft.model()
Get the model ID for a saved checkpoint:
model_id = ft.model(step=100)
# "moondream3-preview/01HXYZ...@100"
Use this with the model parameter on any inference endpoint:
| Endpoint | Description |
|---|---|
/v1/query | Question answering |
/v1/caption | Image captioning |
/v1/detect | Object detection |
/v1/point | Point localization |
/v1/batch | Batch processing |
Only saved checkpoints can be used for inference.
Cleanup
ft.delete()
Delete the finetune and all its checkpoints:
ft.delete()
Settings
Rollout requests accept a settings dict:
| Field | Type | Default | Description |
|---|---|---|---|
| temperature | float | 1.0 | Randomness (0 = deterministic) |
| top_p | float | 1.0 | Nucleus sampling threshold |
| max_tokens | int | 128 (query/point), 256 (detect) | Maximum output length |
| max_objects | int | 50 | Maximum detected objects (detect only) |
All fields are optional.
Use high temperature (e.g. 1.0) during training for diverse rollouts. Use temperature=0 for evaluation.
Ground truth
For point and detect skills, provide ground truth to have the server compute rewards automatically.
Point
With coordinates:
ft.rollouts("point", image=img, object="the button",
ground_truth={"points": [{"x": 0.52, "y": 0.31}]})
With bounding boxes (reward based on whether the predicted point falls inside):
ft.rollouts("point", image=img, object="the button",
ground_truth={"boxes": [{"x_min": 0.1, "y_min": 0.2, "x_max": 0.4, "y_max": 0.6}]})
Detect
ft.rollouts("detect", image=img, object="vehicles",
ground_truth={"boxes": [
{"x_min": 0.1, "y_min": 0.2, "x_max": 0.4, "y_max": 0.6},
]})
All coordinates are normalized to 0–1. Ground truth is not supported for query — compute rewards yourself.