LLM Training and Evaluation
LLM Training and Evaluation
Training and evaluation are separate concerns. Many agent problems should be solved first with better prompts, retrieval, tools, and evals before fine-tuning is considered.
Evaluation Building Blocks
- Task-specific examples with expected outcomes.
- Retrieval test cases for recall and relevance.
- Prompt regression suites.
- Human review for edge cases and subjective quality.
- Automated metrics for latency, cost, groundedness, and format compliance.
Dataset Preparation
Datasets should be cleaned, permissioned, and versioned. Keep source provenance and avoid mixing internal/private examples into public-facing evaluation sets.
Fine-Tuning vs RAG
Fine-tuning is useful when the model needs a repeatable behavior, style, or domain-specific transformation. RAG is usually better when the system needs current, source-grounded knowledge.
Continuous Improvement
Production evaluation should include real failure cases, user feedback, and trace analysis. The best eval set evolves with observed usage.
Related pages: Langfuse Observability and RAG and Knowledge Pipelines.