2.8 KiB
ADR-0004: Claude as the RAGAS Evaluation Judge
Date: 2026-03-10 Status: Accepted Deciders: Rafael Ruiz (CTO)
Context
The EvaluateRAG endpoint runs RAGAS metrics to measure the quality of the RAG pipeline. RAGAS metrics (faithfulness, answer_relevancy, context_recall, context_precision) require an LLM judge to score answers against ground truth and context.
The production LLM is Ollama qwen2.5:1.5b — a small, locally-hosted model optimized for AVAP code generation speed. Using it as the evaluation judge creates a conflict of interest (measuring a system with the same model that produces it) and a quality concern (small models produce unreliable evaluation scores).
Decision
Use Claude (claude-sonnet-4-20250514) as the RAGAS evaluation judge, accessed via the Anthropic API.
The production Ollama LLM is still used for answer generation during evaluation (to measure real-world pipeline quality). Only the scoring step uses Claude.
This requires ANTHROPIC_API_KEY to be set. The EvaluateRAG endpoint fails with an explicit error if the key is missing.
Rationale
Separation of generation and evaluation
Using a different model for generation and evaluation is standard practice in LLM system evaluation. The evaluation judge must be:
- Independent — not the same model being measured
- High-capability — capable of nuanced faithfulness and relevancy judgements
- Deterministic — consistent scores across runs (achieved via
temperature=0)
Why Claude specifically?
- Claude Sonnet-class models score among the highest on LLM-as-judge benchmarks for English and multilingual evaluation tasks
- The AVAP knowledge base contains bilingual content (Spanish + English); Claude handles both reliably
- The Anthropic SDK is already available in the dependency stack (
langchain-anthropic)
Cost implications
Claude is called only during explicit EvaluateRAG invocations, not during production queries. Cost per evaluation run depends on dataset size. For 50 questions at standard RAGAS prompt lengths, estimated cost is < $0.50 using Sonnet pricing.
Consequences
ANTHROPIC_API_KEYandANTHROPIC_MODELbecome required configuration for the evaluation feature.- Evaluation runs incur external API costs. This should be factored into the evaluation cadence policy.
- The
judge_modelfield inEvalResponserecords which Claude version was used, enabling score comparisons across model versions over time. - If Anthropic's API is unreachable or rate-limited,
EvaluateRAGwill fail. This is acceptable since evaluation is a batch operation, not a real-time user-facing feature. - Any change to
ANTHROPIC_MODELmay alter scoring distributions. Historical eval scores are only comparable when the same judge model was used.