17 KiB

Raw Blame History

ADR-0007: Mandatory Syntactic Validation Layer (MSVL) for RAG Evaluation

Date: 2026-04-06 Status: Proposed Deciders: Rafael Ruiz (CTO), Pablo (AI Team) Related ADRs: ADR-0003 (Hybrid Retrieval RRF), ADR-0004 (Claude as RAGAS Evaluation Judge), ADR-0005 (Embedding Model Selection), ADR-0006 (Reward Algorithm for Dataset Synthesis)

Context

The evaluation campaign that triggered this ADR

On 2026-04-06, the AI Team ran six evaluation suites using EvaluateRAG on the 50-question golden dataset, covering three embedding models across two index configurations each. All six runs returned a verdict of ACCEPTABLE from the RAGAS pipeline. The scores are reproduced below:

Embedding Model	Index	Faithfulness	Answer Relevancy	Context Recall	Context Precision	Global Score
qwen3-0.6B-emb	avap-knowledge-v2-qwen	0.5329	0.8393	0.5449	0.5843	0.6254
qwen3-0.6B-emb	avap-docs-test-v4-qwen	0.5781	0.8472	0.6451	0.6633	0.6834
bge-m3	avap-knowledge-v2-bge	0.5431	0.8507	0.5057	0.5689	0.6171
bge-m3	avap-docs-test-v4-bge	0.5843	0.8400	0.6067	0.6384	0.6681
harrier-oss-v1:0.6b	avap-knowledge-v2-harrier	0.5328	0.8424	0.4898	0.5634	0.6071
harrier-oss-v1:0.6b	avap-docs-test-v4-harrier	0.6829	0.8457	0.6461	0.6688	0.7109

Judge model for all runs: claude-sonnet-4-20250514. All runs: 50 questions, Docker container on shared EC2.

Why these scores are not valid for architectural decisions

Manual inspection of the answer_preview fields reveals a systematic pattern that invalidates all six verdicts: models are generating syntactically invalid AVAP code while receiving ACCEPTABLE scores from the LLM judge.

The root cause is architectural. The RAGAS judge (Claude Sonnet) evaluates semantic coherence — whether the answer is logically consistent with the retrieved context. It does not evaluate syntactic validity — whether the generated code would execute on the PLATON kernel. For a proprietary DSL like AVAP, these two properties are independent. A response can score high on faithfulness while containing complete Go syntax.

Forensic analysis of the six evaluation traces identifies three distinct failure modes.

Failure Mode 1 — Foreign language injection

Models produce complete syntax from Go, Python, or JavaScript inside code blocks labelled avap. These responses are not AVAP and would fail at parse time.

Entry	Model / Index	Language injected	Evidence
GD-V-009	harrier / avap-knowledge-v2	Go	`package main`, `import "fmt"`, `func main()` inside an `avap` block
GD-V-009	qwen3 / avap-knowledge-v2	Go	`package main`, `import (..."fmt"...)`
GD-C-003	harrier / avap-knowledge-v2	Python	`for i in range(1, 6):` with Python dict literal
GD-C-003	bge-m3 / avap-knowledge-v2	Python	`for i in range(1, 6):` with `# Build the JSON object` comment
GD-C-004	bge-m3 / avap-knowledge-v2	JavaScript	`let allowedRoles = ["admin", ...]`, `.includes(rol)`
GD-V-007	qwen3 / avap-docs-test-v4	JS / PHP / Python	`foreach(item in items)`, Python `print()`

GD-V-009 is the most critical case. The question asks about AVAP goroutine scope. The model answers with a complete Go program. Claude-Sonnet scored this ACCEPTABLE because the prose surrounding the code is semantically consistent with the retrieved context — the code block itself is never validated.

Failure Mode 2 — Hallucinated AVAP commands

Models invent command names that do not exist in the AVAP grammar. These are not foreign languages — they appear syntactically plausible — but would fail at the parser's symbol resolution stage.

Invented command	Observed in	Real AVAP equivalent
`getSHA256(x)`	qwen3	`encodeSHA256(origen, destino)`
`generateSHA256Hash(x)`	bge-m3, harrier	`encodeSHA256(origen, destino)`
`readParam("x")`	qwen3, bge-m3	`addParam("x", destino)`
`ifParam("x", dest)`	qwen3	`addParam("x", dest)` + `if(...)`
`returnResult(x)`	bge-m3	`addResult(x)`
`getTimeStamp(...)`	qwen3	`getDateTime(...)`
`except(e)`	qwen3	`exception(e)`
`getListParamList(...)`	harrier	Does not exist
`variableFromJSON(...)`	harrier	Does not exist
`confirmPassword(...)`	bge-m3	Does not exist
`httpGet(...)`	bge-m3	`RequestGet(...)`

Failure Mode 3 — Structural foreign syntax

Beyond identifiable code blocks, some responses embed structural constructs that are not part of the AVAP grammar: curly-brace function bodies, while loops, let/var declarations, for/foreach statements. These appear in entries where no foreign language is explicitly named.

Summary by model and index

Model	Index	Foreign syntax (entries)	Hallucinated cmds (entries)	Estimated invalid / 50
qwen3-0.6B-emb	avap-knowledge-v2	3	2	~5 (10%)
qwen3-0.6B-emb	avap-docs-test-v4	3	3	~6 (12%)
bge-m3	avap-knowledge-v2	6	3	~8 (16%)
bge-m3	avap-docs-test-v4	5	1	~6 (12%)
harrier-oss-v1:0.6b	avap-knowledge-v2	2	3	~5 (10%)
harrier-oss-v1:0.6b	avap-docs-test-v4	1	0	~1 (2%)

Counts are conservative lower bounds: answer_preview fields are truncated at ~300 characters. Full response bodies may contain additional violations not visible in the preview.

Relative ordering within this campaign

The data supports a relative — not absolute — ordering. harrier / avap-docs-test-v4 shows the fewest syntactic violations and the highest global score (0.7109). It is the least-bad model in this run. However, this does not make it production-ready: a model that generates correct AVAP in 98% of responses can still fail for a user on a critical code generation query.

bge-m3 failures are predominantly well-known foreign syntaxes (Python, JavaScript), which makes them identifiable without a parser. qwen3 introduces invented commands that look like valid AVAP idioms (ifParam, getSHA256, getTimeStamp) — these are harder to detect precisely because they are superficially plausible.

The CTO's conclusion: no model can be selected or rejected based on these six runs. The measurement instrument is not fit for purpose.

Evaluation environment issues identified in this campaign

Three additional issues compromise reproducibility independently of model quality:

Mixed execution environments. Parts of the team ran run_evaluation from local notebooks. Notebook runs do not record temperature or random seeds, making score reproduction impossible across machines and Python environments.

Undocumented index re-creation. Bugs were discovered in the existing indices and they were re-indexed with corrected pipelines (avap_ingestor.py for avap-knowledge-v2-*, elasticsearch_ingestion.py for avap-docs-test-v4-*). The pre-processing delta between old and new indices was not documented before the evaluation was run, making it impossible to determine whether score differences reflect model quality or index quality.

BM25 contamination in embedding comparisons. The pipeline uses Hybrid Retrieval (BM25 + kNN, per ADR-0003). When the goal is to compare embedding models, BM25 acts as a confounding variable: a weaker embedding model can compensate with BM25 recall, masking the true quality differential. Evaluations intended to select an embedding model require a kNN-only retrieval mode that does not exist yet.

The few-shot gap

The 190 validated AVAP examples from ADR-0006 are not currently injected into the generation prompt. The syntactic failure rates above — 5 to 17% of responses per run — are consistent with a model that has no valid AVAP examples in its prompt context and falls back on pre-training distributions. This is the expected behaviour of a base LLM encountering an unfamiliar DSL without few-shot grounding.

Decision

Establish the Mandatory Syntactic Validation Layer (MSVL) as a non-optional prerequisite gate in the EvaluateRAG pipeline. Any evaluation score produced without MSVL is classified as non-binding and cannot be cited in architectural decisions.

1. Parser integration in `EvaluateRAG`

Every code block in a generated response must be submitted to the AVAP Parser via gRPC before RAGAS scoring. The parser returns a binary result: VALID or INVALID with a failure category (unknown_token, unexpected_construct, foreign_keyword, syntax_error).

2. `syntactic_validity` as an independent metric

Introduce syntactic_validity (float 0.0–1.0): the fraction of code-bearing responses that pass parser validation within a run. This metric is reported alongside RAGAS scores, not as a replacement.

For entries that fail parser validation, faithfulness and answer_relevancy are overridden to 0.0 regardless of the LLM judge's qualitative assessment. The raw RAGAS scores are preserved in the evaluation record for audit.

final_faithfulness(entry) =
    0.0                             if parser(entry) = INVALID
    ragas_faithfulness(entry)       otherwise

final_answer_relevancy(entry) =
    0.0                             if parser(entry) = INVALID
    ragas_answer_relevancy(entry)   otherwise

3. Parser SLA and fallback policy

The AVAP Parser gRPC service must respond within 2 seconds per call. If the parser is unreachable or times out, the evaluation run is aborted with an explicit error. Silent fallback to RAGAS-only scoring is prohibited.

if parser_status == UNAVAILABLE:
    raise EvaluationAbortedError(
        "AVAP Parser unreachable. MSVL cannot be bypassed."
    )

4. Standardised evaluation protocol

Local notebook environments are prohibited for official evaluation reports. All evaluations cited in architectural decisions must be executed within the EvaluateRAG Docker container in Staging with:

Fixed random seeds via EVAL_SEED environment variable
temperature=0 for all generation calls
ANTHROPIC_MODEL pinned to a specific version string, not latest
Index version and the exact ingestion pipeline used documented in the evaluation record before the run starts

5. Few-shot context injection

The 190 validated AVAP examples from ADR-0006 must be injected as few-shot context into the generation prompt. Injection protocol:

Examples are selected by semantic similarity to the current query (top-K retrieval from the validated pool), not injected wholesale
K defaults to 5; effective K per run is logged in the evaluation record
If the few-shot retrieval service is unavailable, the run proceeds without injection and this is flagged as few_shot_injection: degraded in the report

This directly targets Failure Mode 1: a model that has seen 5 valid AVAP examples before generating code is substantially less likely to default to Go or Python syntax.

6. Embedding-only evaluation mode

A separate knn_only retrieval mode must be implemented in EvaluateRAG for evaluations whose explicit purpose is embedding model comparison. This mode disables BM25 and uses only kNN retrieval. Results from this mode are tagged retrieval_mode: knn_only and are not comparable with standard hybrid retrieval scores. This mode must be used for any future embedding model selection decision.

7. Statistical measurement requirements

Requirement	Specification	Rationale
Bootstrap stability	N ≥ 5 runs per suite	N=3 provides 1 degree of freedom for variance estimation; N=5 is the minimum to detect bimodal operating modes
Reported statistics	Mean (μ) and standard deviation (σ)	Single-run scores cannot be used for decision-making
Leakage audit	Token distribution analysis per model	Quantifies how much syntactic correctness derives from pre-training bias vs. AVAP documentation retrieval
Syntactic confusion matrix	Parse failures broken down by category and question ID	Identifies which AVAP constructs (`startLoop`, `ormAccess`, `encodeSHA256`, etc.) require additional documentation or few-shot coverage

Rationale

Why 0.0 override rather than a graduated penalty?

For AVAP, syntactic validity is a binary production gate: code either executes on the PLATON kernel or it does not. A graduated penalty would imply partial credit for non-executable output, which has no operational meaning. The override to 0.0 aligns the metric with the actual production outcome. Raw RAGAS scores are preserved for post-hoc analysis if the policy needs to be revised.

Why abort on parser unavailability rather than degrade?

Silent fallback to RAGAS-only scoring produces evaluation reports that are visually identical to MSVL-validated reports. The purpose of the layer is to prevent false positives. An infrastructure failure that silently removes the gate defeats that purpose entirely. Failing loudly is the only policy consistent with the layer's goal.

Why few-shot injection by similarity rather than full pool injection?

Injecting all 190 examples wholesale would consume the majority of the generation context window, compressing the retrieved documentation that RAGAS evaluates. Similarity-based top-K selection preserves the most relevant examples while protecting retrieval context fidelity. Coverage of rare construct combinations depends on query distribution — this is measurable via the confusion matrix.

Why N ≥ 5 runs?

temperature=0 reduces run-to-run variance but does not eliminate it. Retrieval non-determinism from kNN approximate search and prompt token ordering effects can produce different results at zero temperature. N=3 provides 1 degree of freedom for variance estimation. N=5 is the minimum that allows detection of a bimodal distribution (two distinct operating modes) with elementary statistical reliability.

Status of prior evaluations

The six evaluation runs from 2026-04-06 documented in this ADR's Context section are classified as non-binding. They may be cited as qualitative evidence of relative model behaviour but cannot be used to select an embedding model for production.

Any evaluation report generated before this ADR's acceptance date that does not include a syntactic_validity score is retroactively classified as non-binding.

Alternatives Considered

Alternative	Rejected because
Post-hoc validation (flag but do not override scores)	Does not prevent false positives from propagating into decision metrics
Raise RAGAS threshold to ≥ 0.80	A model could pass at 0.80 with 10% Go injection; does not address the structural misalignment between semantic scoring and syntactic validity
Manual code review per evaluation run	Not reproducible or scalable; reintroduces evaluator subjectivity
Fine-tuning with AVAP-only data	Addresses the generation problem but not the measurement problem; MSVL is needed regardless
Disable BM25 for all evaluations	Removes a production component defined in ADR-0003; the correct solution is an explicit `knn_only` mode for embedding comparisons, not removing hybrid retrieval globally

Consequences

Positive:

Eliminates the false positive class definitively demonstrated in this ADR's Context: semantically coherent but syntactically invalid responses will no longer receive ACCEPTABLE verdicts.
syntactic_validity becomes a first-class longitudinal metric enabling tracking of DSL fidelity independently of semantic quality.
Standardised Docker execution with documented seeds ensures scores are reproducible and comparable across team members and time.
The syntactic confusion matrix creates a direct feedback loop into documentation priorities and few-shot pool expansion.

Negative:

Evaluation latency increases by one gRPC call per generated response. At the 2-second SLA for a 50-question dataset, this adds approximately 100 seconds per run.
The AVAP Parser becomes a hard dependency of the evaluation pipeline and must be versioned and kept in sync with the LRM. Parser upgrades may alter score comparability across historical runs.
N ≥ 5 runs multiplies evaluation cost (API calls, compute time) approximately 5×.
The knn_only retrieval mode and the few-shot retrieval service are engineering work not currently scheduled.

Open Questions

Acceptance threshold for syntactic_validity: This ADR defines how to measure syntactic validity but does not specify the minimum score required for production readiness. A subsequent amendment must define this threshold (e.g., syntactic_validity ≥ 0.95 for CODE_GENERATION questions) before MSVL scores can be used as a hard CI/CD gate.
Parser version pinning policy: When a parser upgrade changes accepted constructs, historical scores become incomparable. A policy for when upgrades require re-running historical evaluations has not been defined.
Few-shot pool adequacy for the confusion matrix tail: Whether 190 examples provide adequate coverage of rare construct combinations visible in the confusion matrix has not been empirically tested.
BM25 contamination remediation for existing results: The knn_only evaluation mode should be scheduled before the next embedding model comparison campaign to produce a clean comparative baseline.

17 KiB Raw Blame History Unescape Escape