55 lines
2.8 KiB
Markdown
55 lines
2.8 KiB
Markdown
# ADR-0004: Claude as the RAGAS Evaluation Judge
|
|
|
|
**Date:** 2026-03-10
|
|
**Status:** Accepted
|
|
**Deciders:** Rafael Ruiz (CTO)
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
The `EvaluateRAG` endpoint runs RAGAS metrics to measure the quality of the RAG pipeline. RAGAS metrics (`faithfulness`, `answer_relevancy`, `context_recall`, `context_precision`) require an LLM judge to score answers against ground truth and context.
|
|
|
|
The production LLM is Ollama `qwen2.5:1.5b` — a small, locally-hosted model optimized for AVAP code generation speed. Using it as the evaluation judge creates a conflict of interest (measuring a system with the same model that produces it) and a quality concern (small models produce unreliable evaluation scores).
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
Use **Claude (`claude-sonnet-4-20250514`) as the RAGAS evaluation judge**, accessed via the Anthropic API.
|
|
|
|
The production Ollama LLM is still used for **answer generation** during evaluation (to measure real-world pipeline quality). Only the scoring step uses Claude.
|
|
|
|
This requires `ANTHROPIC_API_KEY` to be set. The `EvaluateRAG` endpoint fails with an explicit error if the key is missing.
|
|
|
|
---
|
|
|
|
## Rationale
|
|
|
|
### Separation of generation and evaluation
|
|
|
|
Using a different model for generation and evaluation is standard practice in LLM system evaluation. The evaluation judge must be:
|
|
1. **Independent** — not the same model being measured
|
|
2. **High-capability** — capable of nuanced faithfulness and relevancy judgements
|
|
3. **Deterministic** — consistent scores across runs (achieved via `temperature=0`)
|
|
|
|
### Why Claude specifically?
|
|
|
|
- Claude Sonnet-class models score among the highest on LLM-as-judge benchmarks for English and multilingual evaluation tasks
|
|
- The AVAP knowledge base contains bilingual content (Spanish + English); Claude handles both reliably
|
|
- The Anthropic SDK is already available in the dependency stack (`langchain-anthropic`)
|
|
|
|
### Cost implications
|
|
|
|
Claude is called only during explicit `EvaluateRAG` invocations, not during production queries. Cost per evaluation run depends on dataset size. For 50 questions at standard RAGAS prompt lengths, estimated cost is < $0.50 using Sonnet pricing.
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
- `ANTHROPIC_API_KEY` and `ANTHROPIC_MODEL` become required configuration for the evaluation feature.
|
|
- Evaluation runs incur external API costs. This should be factored into the evaluation cadence policy.
|
|
- The `judge_model` field in `EvalResponse` records which Claude version was used, enabling score comparisons across model versions over time.
|
|
- If Anthropic's API is unreachable or rate-limited, `EvaluateRAG` will fail. This is acceptable since evaluation is a batch operation, not a real-time user-facing feature.
|
|
- Any change to `ANTHROPIC_MODEL` may alter scoring distributions. Historical eval scores are only comparable when the same judge model was used.
|