# ADR-0004: Claude as the RAGAS Evaluation Judge **Date:** 2026-03-10 **Status:** Accepted **Deciders:** Rafael Ruiz (CTO), MrHouston Engineering --- ## Context The `EvaluateRAG` endpoint runs RAGAS metrics to measure the quality of the RAG pipeline. RAGAS metrics (`faithfulness`, `answer_relevancy`, `context_recall`, `context_precision`) require an LLM judge to score answers against ground truth and context. The production LLM is Ollama `qwen2.5:1.5b` — a small, locally-hosted model optimized for AVAP code generation speed. Using it as the evaluation judge creates a conflict of interest (measuring a system with the same model that produces it) and a quality concern (small models produce unreliable evaluation scores). --- ## Decision Use **Claude (`claude-sonnet-4-20250514`) as the RAGAS evaluation judge**, accessed via the Anthropic API. The production Ollama LLM is still used for **answer generation** during evaluation (to measure real-world pipeline quality). Only the scoring step uses Claude. This requires `ANTHROPIC_API_KEY` to be set. The `EvaluateRAG` endpoint fails with an explicit error if the key is missing. --- ## Rationale ### Separation of generation and evaluation Using a different model for generation and evaluation is standard practice in LLM system evaluation. The evaluation judge must be: 1. **Independent** — not the same model being measured 2. **High-capability** — capable of nuanced faithfulness and relevancy judgements 3. **Deterministic** — consistent scores across runs (achieved via `temperature=0`) ### Why Claude specifically? - Claude Sonnet-class models score among the highest on LLM-as-judge benchmarks for English and multilingual evaluation tasks - The AVAP knowledge base contains bilingual content (Spanish + English); Claude handles both reliably - The Anthropic SDK is already available in the dependency stack (`langchain-anthropic`) ### Cost implications Claude is called only during explicit `EvaluateRAG` invocations, not during production queries. Cost per evaluation run depends on dataset size. For 50 questions at standard RAGAS prompt lengths, estimated cost is < $0.50 using Sonnet pricing. --- ## Consequences - `ANTHROPIC_API_KEY` and `ANTHROPIC_MODEL` become required configuration for the evaluation feature. - Evaluation runs incur external API costs. This should be factored into the evaluation cadence policy. - The `judge_model` field in `EvalResponse` records which Claude version was used, enabling score comparisons across model versions over time. - If Anthropic's API is unreachable or rate-limited, `EvaluateRAG` will fail. This is acceptable since evaluation is a batch operation, not a real-time user-facing feature. - Any change to `ANTHROPIC_MODEL` may alter scoring distributions. Historical eval scores are only comparable when the same judge model was used.