Finetuning

Finetuning lets you customize Moondream for your specific use case, training it to be better at exactly what you need. Training runs entirely in Moondream Cloud — no GPU required on your end.

Training modes

RL (reinforcement learning) — the model generates multiple attempts, you score them, and it learns to produce more of what scores well.

SFT (supervised finetuning) — you provide correct answers directly, and the model learns to reproduce them.

Start with RL in most cases. It's more sample-efficient, handles ambiguity naturally, and lets you encode exactly what "good" means for your use case.

Consider SFT when you have high-quality unambiguous labels with a single correct answer per input, or when the model can't yet perform the task well enough for RL to work — use SFT to bootstrap, then switch to RL to refine.

You can combine both modes in the same training step.

Aspect	RL	SFT
What you provide	Scoring function	Correct answers
Best when	Labels are fuzzy, expensive, or ambiguous	You have exact labels
Multiple valid outputs	Handles naturally	Requires a single correct answer
Sample efficiency	Higher	Lower

When to use RL finetuning

RL finetuning shines when:

You can recognize good output easier than you can write it. You know a good product description when you see one, but writing perfect examples for every case is impractical.
Multiple answers are acceptable. "The red car" and "a red vehicle on the left" are both correct. This also applies to spatial outputs like bounding boxes and points, where order doesn't matter.
You want to encode preferences. Maybe you want the model to be more conservative, more detailed, or to prioritize certain types of objects.
You have limited training data. RL is more sample-efficient — it extracts more signal from each example by generating and scoring multiple attempts.

Task	Why RL works well
Medical image questions	Multiple valid phrasings; domain expert scores accuracy
Manufacturing defect pointing	"Correct" depends on context — ignore minor scratches, focus on cracks
Security footage detection	Tune precision/recall tradeoff for your needs
Product descriptions	Many good descriptions exist; easier to rate than write

The model needs a starting point

RL works best when the model can already somewhat perform the task — it reinforces existing good behaviors. If the model can't do the task at all, start with SFT to teach the basics, then switch to RL.

Supported skills

Skill	What it does	Output	Example
query	Answer questions about images	Text	"What brand is this product?"
point	Locate objects in an image	x, y coordinates	"Point to the defects"
detect	Locate objects in an image	Bounding boxes	"Detect all vehicles"

Key concepts

Term	Meaning
Rollout	One output attempt from the model
Reward	A score you assign to a rollout (higher = better)
Finetune	Finetuned weights layered on top of the base model via LoRA. Identified by a unique `finetune_id`.
Checkpoint	A saved model state at a specific training step. Only saved checkpoints can be used for inference.
Train step	One update to the model based on scored rollouts or supervised targets

Architecture

Your code	Moondream Cloud
Provides training data	Generates model outputs
Scores outputs (RL) or provides targets (SFT)	Applies training updates
Orchestrates training loop	Manages checkpoints

Your orchestration code runs anywhere — no GPU required.

Next steps

Quickstart — Train a model end-to-end in Python
Python SDK — Full SDK reference
HTTP API reference — Wire format for all endpoints

Training modes​

When to use RL finetuning​

The model needs a starting point​

Supported skills​

Key concepts​

Architecture​

Next steps​