243 lines
17 KiB
Markdown
243 lines
17 KiB
Markdown
# ADR-0007: Mandatory Syntactic Validation Layer (MSVL) for RAG Evaluation
|
||
|
||
**Date:** 2026-04-06
|
||
**Status:** Proposed
|
||
**Deciders:** Rafael Ruiz (CTO), Pablo (AI Team)
|
||
**Related ADRs:** ADR-0003 (Hybrid Retrieval RRF), ADR-0004 (Claude as RAGAS Evaluation Judge), ADR-0005 (Embedding Model Selection), ADR-0006 (Reward Algorithm for Dataset Synthesis)
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
### The evaluation campaign that triggered this ADR
|
||
|
||
On 2026-04-06, the AI Team ran six evaluation suites using `EvaluateRAG` on the 50-question golden dataset, covering three embedding models across two index configurations each. All six runs returned a verdict of **ACCEPTABLE** from the RAGAS pipeline. The scores are reproduced below:
|
||
|
||
| Embedding Model | Index | Faithfulness | Answer Relevancy | Context Recall | Context Precision | Global Score |
|
||
|---|---|---|---|---|---|---|
|
||
| qwen3-0.6B-emb | avap-knowledge-v2-qwen | 0.5329 | 0.8393 | 0.5449 | 0.5843 | **0.6254** |
|
||
| qwen3-0.6B-emb | avap-docs-test-v4-qwen | 0.5781 | 0.8472 | 0.6451 | 0.6633 | **0.6834** |
|
||
| bge-m3 | avap-knowledge-v2-bge | 0.5431 | 0.8507 | 0.5057 | 0.5689 | **0.6171** |
|
||
| bge-m3 | avap-docs-test-v4-bge | 0.5843 | 0.8400 | 0.6067 | 0.6384 | **0.6681** |
|
||
| harrier-oss-v1:0.6b | avap-knowledge-v2-harrier | 0.5328 | 0.8424 | 0.4898 | 0.5634 | **0.6071** |
|
||
| harrier-oss-v1:0.6b | avap-docs-test-v4-harrier | 0.6829 | 0.8457 | 0.6461 | 0.6688 | **0.7109** |
|
||
|
||
*Judge model for all runs: `claude-sonnet-4-20250514`. All runs: 50 questions, Docker container on shared EC2.*
|
||
|
||
### Why these scores are not valid for architectural decisions
|
||
|
||
Manual inspection of the `answer_preview` fields reveals a systematic pattern that invalidates all six verdicts: **models are generating syntactically invalid AVAP code while receiving ACCEPTABLE scores from the LLM judge.**
|
||
|
||
The root cause is architectural. The RAGAS judge (Claude Sonnet) evaluates *semantic coherence* — whether the answer is logically consistent with the retrieved context. It does not evaluate *syntactic validity* — whether the generated code would execute on the PLATON kernel. For a proprietary DSL like AVAP, these two properties are independent. A response can score high on faithfulness while containing complete Go syntax.
|
||
|
||
**Forensic analysis of the six evaluation traces** identifies three distinct failure modes.
|
||
|
||
#### Failure Mode 1 — Foreign language injection
|
||
|
||
Models produce complete syntax from Go, Python, or JavaScript inside code blocks labelled `avap`. These responses are not AVAP and would fail at parse time.
|
||
|
||
| Entry | Model / Index | Language injected | Evidence |
|
||
|---|---|---|---|
|
||
| GD-V-009 | harrier / avap-knowledge-v2 | **Go** | `package main`, `import "fmt"`, `func main()` inside an `avap` block |
|
||
| GD-V-009 | qwen3 / avap-knowledge-v2 | **Go** | `package main`, `import (..."fmt"...)` |
|
||
| GD-C-003 | harrier / avap-knowledge-v2 | **Python** | `for i in range(1, 6):` with Python dict literal |
|
||
| GD-C-003 | bge-m3 / avap-knowledge-v2 | **Python** | `for i in range(1, 6):` with `# Build the JSON object` comment |
|
||
| GD-C-004 | bge-m3 / avap-knowledge-v2 | **JavaScript** | `let allowedRoles = ["admin", ...]`, `.includes(rol)` |
|
||
| GD-V-007 | qwen3 / avap-docs-test-v4 | **JS / PHP / Python** | `foreach(item in items)`, Python `print()` |
|
||
|
||
GD-V-009 is the most critical case. The question asks about AVAP goroutine scope. The model answers with a complete Go program. Claude-Sonnet scored this ACCEPTABLE because the prose surrounding the code is semantically consistent with the retrieved context — the code block itself is never validated.
|
||
|
||
#### Failure Mode 2 — Hallucinated AVAP commands
|
||
|
||
Models invent command names that do not exist in the AVAP grammar. These are not foreign languages — they appear syntactically plausible — but would fail at the parser's symbol resolution stage.
|
||
|
||
| Invented command | Observed in | Real AVAP equivalent |
|
||
|---|---|---|
|
||
| `getSHA256(x)` | qwen3 | `encodeSHA256(origen, destino)` |
|
||
| `generateSHA256Hash(x)` | bge-m3, harrier | `encodeSHA256(origen, destino)` |
|
||
| `readParam("x")` | qwen3, bge-m3 | `addParam("x", destino)` |
|
||
| `ifParam("x", dest)` | qwen3 | `addParam("x", dest)` + `if(...)` |
|
||
| `returnResult(x)` | bge-m3 | `addResult(x)` |
|
||
| `getTimeStamp(...)` | qwen3 | `getDateTime(...)` |
|
||
| `except(e)` | qwen3 | `exception(e)` |
|
||
| `getListParamList(...)` | harrier | Does not exist |
|
||
| `variableFromJSON(...)` | harrier | Does not exist |
|
||
| `confirmPassword(...)` | bge-m3 | Does not exist |
|
||
| `httpGet(...)` | bge-m3 | `RequestGet(...)` |
|
||
|
||
#### Failure Mode 3 — Structural foreign syntax
|
||
|
||
Beyond identifiable code blocks, some responses embed structural constructs that are not part of the AVAP grammar: curly-brace function bodies, `while` loops, `let`/`var` declarations, `for`/`foreach` statements. These appear in entries where no foreign language is explicitly named.
|
||
|
||
#### Summary by model and index
|
||
|
||
| Model | Index | Foreign syntax (entries) | Hallucinated cmds (entries) | Estimated invalid / 50 |
|
||
|---|---|---|---|---|
|
||
| qwen3-0.6B-emb | avap-knowledge-v2 | 3 | 2 | ~5 (10%) |
|
||
| qwen3-0.6B-emb | avap-docs-test-v4 | 3 | 3 | ~6 (12%) |
|
||
| bge-m3 | avap-knowledge-v2 | 6 | 3 | ~8 (16%) |
|
||
| bge-m3 | avap-docs-test-v4 | 5 | 1 | ~6 (12%) |
|
||
| harrier-oss-v1:0.6b | avap-knowledge-v2 | 2 | 3 | ~5 (10%) |
|
||
| harrier-oss-v1:0.6b | avap-docs-test-v4 | 1 | 0 | ~1 (2%) |
|
||
|
||
*Counts are conservative lower bounds: `answer_preview` fields are truncated at ~300 characters. Full response bodies may contain additional violations not visible in the preview.*
|
||
|
||
### Relative ordering within this campaign
|
||
|
||
The data supports a *relative* — not absolute — ordering. **harrier / avap-docs-test-v4** shows the fewest syntactic violations and the highest global score (0.7109). It is the least-bad model in this run. However, this does not make it production-ready: a model that generates correct AVAP in 98% of responses can still fail for a user on a critical code generation query.
|
||
|
||
**bge-m3** failures are predominantly well-known foreign syntaxes (Python, JavaScript), which makes them identifiable without a parser. **qwen3** introduces invented commands that look like valid AVAP idioms (`ifParam`, `getSHA256`, `getTimeStamp`) — these are harder to detect precisely because they are superficially plausible.
|
||
|
||
The CTO's conclusion: no model can be selected or rejected based on these six runs. The measurement instrument is not fit for purpose.
|
||
|
||
### Evaluation environment issues identified in this campaign
|
||
|
||
Three additional issues compromise reproducibility independently of model quality:
|
||
|
||
**Mixed execution environments.** Parts of the team ran `run_evaluation` from local notebooks. Notebook runs do not record temperature or random seeds, making score reproduction impossible across machines and Python environments.
|
||
|
||
**Undocumented index re-creation.** Bugs were discovered in the existing indices and they were re-indexed with corrected pipelines (`avap_ingestor.py` for `avap-knowledge-v2-*`, `elasticsearch_ingestion.py` for `avap-docs-test-v4-*`). The pre-processing delta between old and new indices was not documented before the evaluation was run, making it impossible to determine whether score differences reflect model quality or index quality.
|
||
|
||
**BM25 contamination in embedding comparisons.** The pipeline uses Hybrid Retrieval (BM25 + kNN, per ADR-0003). When the goal is to compare embedding models, BM25 acts as a confounding variable: a weaker embedding model can compensate with BM25 recall, masking the true quality differential. Evaluations intended to select an embedding model require a kNN-only retrieval mode that does not exist yet.
|
||
|
||
### The few-shot gap
|
||
|
||
The 190 validated AVAP examples from ADR-0006 are not currently injected into the generation prompt. The syntactic failure rates above — 5 to 17% of responses per run — are consistent with a model that has no valid AVAP examples in its prompt context and falls back on pre-training distributions. This is the expected behaviour of a base LLM encountering an unfamiliar DSL without few-shot grounding.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
Establish the **Mandatory Syntactic Validation Layer (MSVL)** as a non-optional prerequisite gate in the `EvaluateRAG` pipeline. Any evaluation score produced without MSVL is classified as non-binding and cannot be cited in architectural decisions.
|
||
|
||
### 1. Parser integration in `EvaluateRAG`
|
||
|
||
Every code block in a generated response must be submitted to the AVAP Parser via gRPC before RAGAS scoring. The parser returns a binary result: `VALID` or `INVALID` with a failure category (`unknown_token`, `unexpected_construct`, `foreign_keyword`, `syntax_error`).
|
||
|
||
### 2. `syntactic_validity` as an independent metric
|
||
|
||
Introduce `syntactic_validity` (float 0.0–1.0): the fraction of code-bearing responses that pass parser validation within a run. This metric is reported alongside RAGAS scores, not as a replacement.
|
||
|
||
For entries that fail parser validation, `faithfulness` and `answer_relevancy` are **overridden to 0.0** regardless of the LLM judge's qualitative assessment. The raw RAGAS scores are preserved in the evaluation record for audit.
|
||
|
||
```
|
||
final_faithfulness(entry) =
|
||
0.0 if parser(entry) = INVALID
|
||
ragas_faithfulness(entry) otherwise
|
||
|
||
final_answer_relevancy(entry) =
|
||
0.0 if parser(entry) = INVALID
|
||
ragas_answer_relevancy(entry) otherwise
|
||
```
|
||
|
||
### 3. Parser SLA and fallback policy
|
||
|
||
The AVAP Parser gRPC service must respond within 2 seconds per call. If the parser is unreachable or times out, the evaluation run is **aborted** with an explicit error. Silent fallback to RAGAS-only scoring is prohibited.
|
||
|
||
```python
|
||
if parser_status == UNAVAILABLE:
|
||
raise EvaluationAbortedError(
|
||
"AVAP Parser unreachable. MSVL cannot be bypassed."
|
||
)
|
||
```
|
||
|
||
### 4. Standardised evaluation protocol
|
||
|
||
Local notebook environments are **prohibited** for official evaluation reports. All evaluations cited in architectural decisions must be executed within the `EvaluateRAG` Docker container in Staging with:
|
||
|
||
- Fixed random seeds via `EVAL_SEED` environment variable
|
||
- `temperature=0` for all generation calls
|
||
- `ANTHROPIC_MODEL` pinned to a specific version string, not `latest`
|
||
- Index version and the exact ingestion pipeline used documented in the evaluation record *before* the run starts
|
||
|
||
### 5. Few-shot context injection
|
||
|
||
The 190 validated AVAP examples from ADR-0006 must be injected as few-shot context into the generation prompt. Injection protocol:
|
||
|
||
- Examples are selected by **semantic similarity** to the current query (top-K retrieval from the validated pool), not injected wholesale
|
||
- K defaults to 5; effective K per run is logged in the evaluation record
|
||
- If the few-shot retrieval service is unavailable, the run proceeds without injection and this is flagged as `few_shot_injection: degraded` in the report
|
||
|
||
This directly targets Failure Mode 1: a model that has seen 5 valid AVAP examples before generating code is substantially less likely to default to Go or Python syntax.
|
||
|
||
### 6. Embedding-only evaluation mode
|
||
|
||
A separate `knn_only` retrieval mode must be implemented in `EvaluateRAG` for evaluations whose explicit purpose is embedding model comparison. This mode disables BM25 and uses only kNN retrieval. Results from this mode are tagged `retrieval_mode: knn_only` and are not comparable with standard hybrid retrieval scores. This mode must be used for any future embedding model selection decision.
|
||
|
||
### 7. Statistical measurement requirements
|
||
|
||
| Requirement | Specification | Rationale |
|
||
|---|---|---|
|
||
| **Bootstrap stability** | N ≥ 5 runs per suite | N=3 provides 1 degree of freedom for variance estimation; N=5 is the minimum to detect bimodal operating modes |
|
||
| **Reported statistics** | Mean (μ) and standard deviation (σ) | Single-run scores cannot be used for decision-making |
|
||
| **Leakage audit** | Token distribution analysis per model | Quantifies how much syntactic correctness derives from pre-training bias vs. AVAP documentation retrieval |
|
||
| **Syntactic confusion matrix** | Parse failures broken down by category and question ID | Identifies which AVAP constructs (`startLoop`, `ormAccess`, `encodeSHA256`, etc.) require additional documentation or few-shot coverage |
|
||
|
||
---
|
||
|
||
## Rationale
|
||
|
||
### Why 0.0 override rather than a graduated penalty?
|
||
|
||
For AVAP, syntactic validity is a binary production gate: code either executes on the PLATON kernel or it does not. A graduated penalty would imply partial credit for non-executable output, which has no operational meaning. The override to 0.0 aligns the metric with the actual production outcome. Raw RAGAS scores are preserved for post-hoc analysis if the policy needs to be revised.
|
||
|
||
### Why abort on parser unavailability rather than degrade?
|
||
|
||
Silent fallback to RAGAS-only scoring produces evaluation reports that are visually identical to MSVL-validated reports. The purpose of the layer is to prevent false positives. An infrastructure failure that silently removes the gate defeats that purpose entirely. Failing loudly is the only policy consistent with the layer's goal.
|
||
|
||
### Why few-shot injection by similarity rather than full pool injection?
|
||
|
||
Injecting all 190 examples wholesale would consume the majority of the generation context window, compressing the retrieved documentation that RAGAS evaluates. Similarity-based top-K selection preserves the most relevant examples while protecting retrieval context fidelity. Coverage of rare construct combinations depends on query distribution — this is measurable via the confusion matrix.
|
||
|
||
### Why N ≥ 5 runs?
|
||
|
||
`temperature=0` reduces run-to-run variance but does not eliminate it. Retrieval non-determinism from kNN approximate search and prompt token ordering effects can produce different results at zero temperature. N=3 provides 1 degree of freedom for variance estimation. N=5 is the minimum that allows detection of a bimodal distribution (two distinct operating modes) with elementary statistical reliability.
|
||
|
||
---
|
||
|
||
## Status of prior evaluations
|
||
|
||
The six evaluation runs from 2026-04-06 documented in this ADR's Context section are **classified as non-binding**. They may be cited as qualitative evidence of relative model behaviour but cannot be used to select an embedding model for production.
|
||
|
||
Any evaluation report generated before this ADR's acceptance date that does not include a `syntactic_validity` score is retroactively classified as non-binding.
|
||
|
||
---
|
||
|
||
## Alternatives Considered
|
||
|
||
| Alternative | Rejected because |
|
||
|---|---|
|
||
| **Post-hoc validation** (flag but do not override scores) | Does not prevent false positives from propagating into decision metrics |
|
||
| **Raise RAGAS threshold to ≥ 0.80** | A model could pass at 0.80 with 10% Go injection; does not address the structural misalignment between semantic scoring and syntactic validity |
|
||
| **Manual code review per evaluation run** | Not reproducible or scalable; reintroduces evaluator subjectivity |
|
||
| **Fine-tuning with AVAP-only data** | Addresses the generation problem but not the measurement problem; MSVL is needed regardless |
|
||
| **Disable BM25 for all evaluations** | Removes a production component defined in ADR-0003; the correct solution is an explicit `knn_only` mode for embedding comparisons, not removing hybrid retrieval globally |
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
**Positive:**
|
||
- Eliminates the false positive class definitively demonstrated in this ADR's Context: semantically coherent but syntactically invalid responses will no longer receive ACCEPTABLE verdicts.
|
||
- `syntactic_validity` becomes a first-class longitudinal metric enabling tracking of DSL fidelity independently of semantic quality.
|
||
- Standardised Docker execution with documented seeds ensures scores are reproducible and comparable across team members and time.
|
||
- The syntactic confusion matrix creates a direct feedback loop into documentation priorities and few-shot pool expansion.
|
||
|
||
**Negative:**
|
||
- Evaluation latency increases by one gRPC call per generated response. At the 2-second SLA for a 50-question dataset, this adds approximately 100 seconds per run.
|
||
- The AVAP Parser becomes a hard dependency of the evaluation pipeline and must be versioned and kept in sync with the LRM. Parser upgrades may alter score comparability across historical runs.
|
||
- N ≥ 5 runs multiplies evaluation cost (API calls, compute time) approximately 5×.
|
||
- The `knn_only` retrieval mode and the few-shot retrieval service are engineering work not currently scheduled.
|
||
|
||
---
|
||
|
||
## Open Questions
|
||
|
||
1. **Acceptance threshold for `syntactic_validity`:** This ADR defines how to measure syntactic validity but does not specify the minimum score required for production readiness. A subsequent amendment must define this threshold (e.g., `syntactic_validity ≥ 0.95` for `CODE_GENERATION` questions) before MSVL scores can be used as a hard CI/CD gate.
|
||
|
||
2. **Parser version pinning policy:** When a parser upgrade changes accepted constructs, historical scores become incomparable. A policy for when upgrades require re-running historical evaluations has not been defined.
|
||
|
||
3. **Few-shot pool adequacy for the confusion matrix tail:** Whether 190 examples provide adequate coverage of rare construct combinations visible in the confusion matrix has not been empirically tested.
|
||
|
||
4. **BM25 contamination remediation for existing results:** The `knn_only` evaluation mode should be scheduled before the next embedding model comparison campaign to produce a clean comparative baseline.
|