Is Your LLM Task Worth Fine-Tuning?

Score one workflow and decide whether to test open-weight fine-tuning, use prompting, use RAG, or narrow the task first.

Start the 3-minute diagnostic

Score your task

Fine-tuning is not mainly about adding knowledge. It is about teaching repeated behavior. Use this rubric to decide whether your task is ready for a small test.

Checklist

Readiness

Use this before building a dataset or buying GPU time.

Exact input: I can define what the model receives. Exact output: I can define what the model should produce. Judgment: A human can quickly judge whether the output is correct. Examples: I have, or can create, around 200 good examples. Value: A smaller model would be useful if it worked well enough. Pain: Current API cost, latency, privacy, or dependency is painful. Alternatives: I have considered whether prompting or RAG is simpler.

Next move

Check the items on the left to get a readiness note.

Start with the scorecard

The checklist helps confirm the decision once you have a score.

Rule of thumb: fix unclear outputs before tuning, fix weak examples before hyperparameters, and use RAG when the task mainly needs fresh or private knowledge.

Want to run the experiment?

This scorecard helps you decide whether fine-tuning is worth testing. The full workshop takes the next step: dataset, QLoRA training, before/after evaluation, and next-step diagnosis.

Attend the workshop