assistance-engine/docs/ADR/ADR-0004-claude-eval-judge.md

55 lines
2.8 KiB
Markdown

# ADR-0004: Claude as the RAGAS Evaluation Judge
**Date:** 2026-03-10
**Status:** Accepted
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering
---
## Context
The `EvaluateRAG` endpoint runs RAGAS metrics to measure the quality of the RAG pipeline. RAGAS metrics (`faithfulness`, `answer_relevancy`, `context_recall`, `context_precision`) require an LLM judge to score answers against ground truth and context.
The production LLM is Ollama `qwen2.5:1.5b` — a small, locally-hosted model optimized for AVAP code generation speed. Using it as the evaluation judge creates a conflict of interest (measuring a system with the same model that produces it) and a quality concern (small models produce unreliable evaluation scores).
---
## Decision
Use **Claude (`claude-sonnet-4-20250514`) as the RAGAS evaluation judge**, accessed via the Anthropic API.
The production Ollama LLM is still used for **answer generation** during evaluation (to measure real-world pipeline quality). Only the scoring step uses Claude.
This requires `ANTHROPIC_API_KEY` to be set. The `EvaluateRAG` endpoint fails with an explicit error if the key is missing.
---
## Rationale
### Separation of generation and evaluation
Using a different model for generation and evaluation is standard practice in LLM system evaluation. The evaluation judge must be:
1. **Independent** — not the same model being measured
2. **High-capability** — capable of nuanced faithfulness and relevancy judgements
3. **Deterministic** — consistent scores across runs (achieved via `temperature=0`)
### Why Claude specifically?
- Claude Sonnet-class models score among the highest on LLM-as-judge benchmarks for English and multilingual evaluation tasks
- The AVAP knowledge base contains bilingual content (Spanish + English); Claude handles both reliably
- The Anthropic SDK is already available in the dependency stack (`langchain-anthropic`)
### Cost implications
Claude is called only during explicit `EvaluateRAG` invocations, not during production queries. Cost per evaluation run depends on dataset size. For 50 questions at standard RAGAS prompt lengths, estimated cost is < $0.50 using Sonnet pricing.
---
## Consequences
- `ANTHROPIC_API_KEY` and `ANTHROPIC_MODEL` become required configuration for the evaluation feature.
- Evaluation runs incur external API costs. This should be factored into the evaluation cadence policy.
- The `judge_model` field in `EvalResponse` records which Claude version was used, enabling score comparisons across model versions over time.
- If Anthropic's API is unreachable or rate-limited, `EvaluateRAG` will fail. This is acceptable since evaluation is a batch operation, not a real-time user-facing feature.
- Any change to `ANTHROPIC_MODEL` may alter scoring distributions. Historical eval scores are only comparable when the same judge model was used.