LLM Training and Evaluation

Training and evaluation are separate concerns. Many agent problems should be solved first with better prompts, retrieval, tools, and evals before fine-tuning is considered.

Evaluation Building Blocks

Task-specific examples with expected outcomes.
Retrieval test cases for recall and relevance.
Prompt regression suites.
Human review for edge cases and subjective quality.
Automated metrics for latency, cost, groundedness, and format compliance.

Dataset Preparation

Datasets should be cleaned, permissioned, and versioned. Keep source provenance and avoid mixing internal/private examples into public-facing evaluation sets.

Fine-Tuning vs RAG

Fine-tuning is useful when the model needs a repeatable behavior, style, or domain-specific transformation. RAG is usually better when the system needs current, source-grounded knowledge.

Continuous Improvement

Production evaluation should include real failure cases, user feedback, and trace analysis. The best eval set evolves with observed usage.