assistance-engine/docs/ADR/ADR-0004-claude-eval-judge.md

# ADR-0004: Claude as the RAGAS Evaluation Judge

**Date:** 2026-03-10
**Status:** Accepted
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering

---

## Context

The `EvaluateRAG` endpoint runs RAGAS metrics to measure the quality of the RAG pipeline. RAGAS metrics (`faithfulness`, `answer_relevancy`, `context_recall`, `context_precision`) require an LLM judge to score answers against ground truth and context.

The production LLM is Ollama `qwen2.5:1.5b` — a small, locally-hosted model optimized for AVAP code generation speed. Using it as the evaluation judge creates a conflict of interest (measuring a system with the same model that produces it) and a quality concern (small models produce unreliable evaluation scores).

---

## Decision

Use **Claude (`claude-sonnet-4-20250514`) as the RAGAS evaluation judge**, accessed via the Anthropic API.

The production Ollama LLM is still used for **answer generation** during evaluation (to measure real-world pipeline quality). Only the scoring step uses Claude.

This requires `ANTHROPIC_API_KEY` to be set. The `EvaluateRAG` endpoint fails with an explicit error if the key is missing.

---

## Rationale

### Separation of generation and evaluation

Using a different model for generation and evaluation is standard practice in LLM system evaluation. The evaluation judge must be:
1. **Independent** — not the same model being measured
2. **High-capability** — capable of nuanced faithfulness and relevancy judgements
3. **Deterministic** — consistent scores across runs (achieved via `temperature=0`)

### Why Claude specifically?

- Claude Sonnet-class models score among the highest on LLM-as-judge benchmarks for English and multilingual evaluation tasks
- The AVAP knowledge base contains bilingual content (Spanish + English); Claude handles both reliably
- The Anthropic SDK is already available in the dependency stack (`langchain-anthropic`)

### Cost implications

Claude is called only during explicit `EvaluateRAG` invocations, not during production queries. Cost per evaluation run depends on dataset size. For 50 questions at standard RAGAS prompt lengths, estimated cost is < $0.50 using Sonnet pricing.

---

## Consequences

- `ANTHROPIC_API_KEY` and `ANTHROPIC_MODEL` become required configuration for the evaluation feature.
- Evaluation runs incur external API costs. This should be factored into the evaluation cadence policy.
- The `judge_model` field in `EvalResponse` records which Claude version was used, enabling score comparisons across model versions over time.
- If Anthropic's API is unreachable or rate-limited, `EvaluateRAG` will fail. This is acceptable since evaluation is a batch operation, not a real-time user-facing feature.
- Any change to `ANTHROPIC_MODEL` may alter scoring distributions. Historical eval scores are only comparable when the same judge model was used.