Skip to main content

Finetuning

Finetuning lets you customize Moondream for your specific use case, training it to be better at exactly what you need. Training runs entirely in Moondream Cloud — no GPU required on your end.

Training modes

RL (reinforcement learning) — the model generates multiple attempts, you score them, and it learns to produce more of what scores well.

SFT (supervised finetuning) — you provide correct answers directly, and the model learns to reproduce them.

Start with RL in most cases. It's more sample-efficient, handles ambiguity naturally, and lets you encode exactly what "good" means for your use case.

Consider SFT when you have high-quality unambiguous labels with a single correct answer per input, or when the model can't yet perform the task well enough for RL to work — use SFT to bootstrap, then switch to RL to refine.

You can combine both modes in the same training step.

AspectRLSFT
What you provideScoring functionCorrect answers
Best whenLabels are fuzzy, expensive, or ambiguousYou have exact labels
Multiple valid outputsHandles naturallyRequires a single correct answer
Sample efficiencyHigherLower

When to use RL finetuning

RL finetuning shines when:

  • You can recognize good output easier than you can write it. You know a good product description when you see one, but writing perfect examples for every case is impractical.

  • Multiple answers are acceptable. "The red car" and "a red vehicle on the left" are both correct. This also applies to spatial outputs like bounding boxes and points, where order doesn't matter.

  • You want to encode preferences. Maybe you want the model to be more conservative, more detailed, or to prioritize certain types of objects.

  • You have limited training data. RL is more sample-efficient — it extracts more signal from each example by generating and scoring multiple attempts.

TaskWhy RL works well
Medical image questionsMultiple valid phrasings; domain expert scores accuracy
Manufacturing defect pointing"Correct" depends on context — ignore minor scratches, focus on cracks
Security footage detectionTune precision/recall tradeoff for your needs
Product descriptionsMany good descriptions exist; easier to rate than write

The model needs a starting point

RL works best when the model can already somewhat perform the task — it reinforces existing good behaviors. If the model can't do the task at all, start with SFT to teach the basics, then switch to RL.

Supported skills

SkillWhat it doesOutputExample
queryAnswer questions about imagesText"What brand is this product?"
pointLocate objects in an imagex, y coordinates"Point to the defects"
detectLocate objects in an imageBounding boxes"Detect all vehicles"

Key concepts

TermMeaning
RolloutOne output attempt from the model
RewardA score you assign to a rollout (higher = better)
FinetuneFinetuned weights layered on top of the base model via LoRA. Identified by a unique finetune_id.
CheckpointA saved model state at a specific training step. Only saved checkpoints can be used for inference.
Train stepOne update to the model based on scored rollouts or supervised targets

Architecture

Your codeMoondream Cloud
Provides training dataGenerates model outputs
Scores outputs (RL) or provides targets (SFT)Applies training updates
Orchestrates training loopManages checkpoints

Your orchestration code runs anywhere — no GPU required.

Next steps