assistance-engine/docs/ADR/ADR-0004-claude-eval-judge.md

2.8 KiB

ADR-0004: Claude as the RAGAS Evaluation Judge

Date: 2026-03-10 Status: Accepted Deciders: Rafael Ruiz (CTO)


Context

The EvaluateRAG endpoint runs RAGAS metrics to measure the quality of the RAG pipeline. RAGAS metrics (faithfulness, answer_relevancy, context_recall, context_precision) require an LLM judge to score answers against ground truth and context.

The production LLM is Ollama qwen2.5:1.5b — a small, locally-hosted model optimized for AVAP code generation speed. Using it as the evaluation judge creates a conflict of interest (measuring a system with the same model that produces it) and a quality concern (small models produce unreliable evaluation scores).


Decision

Use Claude (claude-sonnet-4-20250514) as the RAGAS evaluation judge, accessed via the Anthropic API.

The production Ollama LLM is still used for answer generation during evaluation (to measure real-world pipeline quality). Only the scoring step uses Claude.

This requires ANTHROPIC_API_KEY to be set. The EvaluateRAG endpoint fails with an explicit error if the key is missing.


Rationale

Separation of generation and evaluation

Using a different model for generation and evaluation is standard practice in LLM system evaluation. The evaluation judge must be:

  1. Independent — not the same model being measured
  2. High-capability — capable of nuanced faithfulness and relevancy judgements
  3. Deterministic — consistent scores across runs (achieved via temperature=0)

Why Claude specifically?

  • Claude Sonnet-class models score among the highest on LLM-as-judge benchmarks for English and multilingual evaluation tasks
  • The AVAP knowledge base contains bilingual content (Spanish + English); Claude handles both reliably
  • The Anthropic SDK is already available in the dependency stack (langchain-anthropic)

Cost implications

Claude is called only during explicit EvaluateRAG invocations, not during production queries. Cost per evaluation run depends on dataset size. For 50 questions at standard RAGAS prompt lengths, estimated cost is < $0.50 using Sonnet pricing.


Consequences

  • ANTHROPIC_API_KEY and ANTHROPIC_MODEL become required configuration for the evaluation feature.
  • Evaluation runs incur external API costs. This should be factored into the evaluation cadence policy.
  • The judge_model field in EvalResponse records which Claude version was used, enabling score comparisons across model versions over time.
  • If Anthropic's API is unreachable or rate-limited, EvaluateRAG will fail. This is acceptable since evaluation is a batch operation, not a real-time user-facing feature.
  • Any change to ANTHROPIC_MODEL may alter scoring distributions. Historical eval scores are only comparable when the same judge model was used.