[DOC] ADR0007 and RP0001
This commit is contained in:
parent
e2f6c7d477
commit
c886cc9811
|
|
@ -0,0 +1,242 @@
|
||||||
|
# ADR-0007: Mandatory Syntactic Validation Layer (MSVL) for RAG Evaluation
|
||||||
|
|
||||||
|
**Date:** 2026-04-06
|
||||||
|
**Status:** Proposed
|
||||||
|
**Deciders:** Rafael Ruiz (CTO), Pablo (AI Team)
|
||||||
|
**Related ADRs:** ADR-0003 (Hybrid Retrieval RRF), ADR-0004 (Claude as RAGAS Evaluation Judge), ADR-0005 (Embedding Model Selection), ADR-0006 (Reward Algorithm for Dataset Synthesis)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### The evaluation campaign that triggered this ADR
|
||||||
|
|
||||||
|
On 2026-04-06, the AI Team ran six evaluation suites using `EvaluateRAG` on the 50-question golden dataset, covering three embedding models across two index configurations each. All six runs returned a verdict of **ACCEPTABLE** from the RAGAS pipeline. The scores are reproduced below:
|
||||||
|
|
||||||
|
| Embedding Model | Index | Faithfulness | Answer Relevancy | Context Recall | Context Precision | Global Score |
|
||||||
|
|---|---|---|---|---|---|---|
|
||||||
|
| qwen3-0.6B-emb | avap-knowledge-v2-qwen | 0.5329 | 0.8393 | 0.5449 | 0.5843 | **0.6254** |
|
||||||
|
| qwen3-0.6B-emb | avap-docs-test-v4-qwen | 0.5781 | 0.8472 | 0.6451 | 0.6633 | **0.6834** |
|
||||||
|
| bge-m3 | avap-knowledge-v2-bge | 0.5431 | 0.8507 | 0.5057 | 0.5689 | **0.6171** |
|
||||||
|
| bge-m3 | avap-docs-test-v4-bge | 0.5843 | 0.8400 | 0.6067 | 0.6384 | **0.6681** |
|
||||||
|
| harrier-oss-v1:0.6b | avap-knowledge-v2-harrier | 0.5328 | 0.8424 | 0.4898 | 0.5634 | **0.6071** |
|
||||||
|
| harrier-oss-v1:0.6b | avap-docs-test-v4-harrier | 0.6829 | 0.8457 | 0.6461 | 0.6688 | **0.7109** |
|
||||||
|
|
||||||
|
*Judge model for all runs: `claude-sonnet-4-20250514`. All runs: 50 questions, Docker container on shared EC2.*
|
||||||
|
|
||||||
|
### Why these scores are not valid for architectural decisions
|
||||||
|
|
||||||
|
Manual inspection of the `answer_preview` fields reveals a systematic pattern that invalidates all six verdicts: **models are generating syntactically invalid AVAP code while receiving ACCEPTABLE scores from the LLM judge.**
|
||||||
|
|
||||||
|
The root cause is architectural. The RAGAS judge (Claude Sonnet) evaluates *semantic coherence* — whether the answer is logically consistent with the retrieved context. It does not evaluate *syntactic validity* — whether the generated code would execute on the PLATON kernel. For a proprietary DSL like AVAP, these two properties are independent. A response can score high on faithfulness while containing complete Go syntax.
|
||||||
|
|
||||||
|
**Forensic analysis of the six evaluation traces** identifies three distinct failure modes.
|
||||||
|
|
||||||
|
#### Failure Mode 1 — Foreign language injection
|
||||||
|
|
||||||
|
Models produce complete syntax from Go, Python, or JavaScript inside code blocks labelled `avap`. These responses are not AVAP and would fail at parse time.
|
||||||
|
|
||||||
|
| Entry | Model / Index | Language injected | Evidence |
|
||||||
|
|---|---|---|---|
|
||||||
|
| GD-V-009 | harrier / avap-knowledge-v2 | **Go** | `package main`, `import "fmt"`, `func main()` inside an `avap` block |
|
||||||
|
| GD-V-009 | qwen3 / avap-knowledge-v2 | **Go** | `package main`, `import (..."fmt"...)` |
|
||||||
|
| GD-C-003 | harrier / avap-knowledge-v2 | **Python** | `for i in range(1, 6):` with Python dict literal |
|
||||||
|
| GD-C-003 | bge-m3 / avap-knowledge-v2 | **Python** | `for i in range(1, 6):` with `# Build the JSON object` comment |
|
||||||
|
| GD-C-004 | bge-m3 / avap-knowledge-v2 | **JavaScript** | `let allowedRoles = ["admin", ...]`, `.includes(rol)` |
|
||||||
|
| GD-V-007 | qwen3 / avap-docs-test-v4 | **JS / PHP / Python** | `foreach(item in items)`, Python `print()` |
|
||||||
|
|
||||||
|
GD-V-009 is the most critical case. The question asks about AVAP goroutine scope. The model answers with a complete Go program. Claude-Sonnet scored this ACCEPTABLE because the prose surrounding the code is semantically consistent with the retrieved context — the code block itself is never validated.
|
||||||
|
|
||||||
|
#### Failure Mode 2 — Hallucinated AVAP commands
|
||||||
|
|
||||||
|
Models invent command names that do not exist in the AVAP grammar. These are not foreign languages — they appear syntactically plausible — but would fail at the parser's symbol resolution stage.
|
||||||
|
|
||||||
|
| Invented command | Observed in | Real AVAP equivalent |
|
||||||
|
|---|---|---|
|
||||||
|
| `getSHA256(x)` | qwen3 | `encodeSHA256(origen, destino)` |
|
||||||
|
| `generateSHA256Hash(x)` | bge-m3, harrier | `encodeSHA256(origen, destino)` |
|
||||||
|
| `readParam("x")` | qwen3, bge-m3 | `addParam("x", destino)` |
|
||||||
|
| `ifParam("x", dest)` | qwen3 | `addParam("x", dest)` + `if(...)` |
|
||||||
|
| `returnResult(x)` | bge-m3 | `addResult(x)` |
|
||||||
|
| `getTimeStamp(...)` | qwen3 | `getDateTime(...)` |
|
||||||
|
| `except(e)` | qwen3 | `exception(e)` |
|
||||||
|
| `getListParamList(...)` | harrier | Does not exist |
|
||||||
|
| `variableFromJSON(...)` | harrier | Does not exist |
|
||||||
|
| `confirmPassword(...)` | bge-m3 | Does not exist |
|
||||||
|
| `httpGet(...)` | bge-m3 | `RequestGet(...)` |
|
||||||
|
|
||||||
|
#### Failure Mode 3 — Structural foreign syntax
|
||||||
|
|
||||||
|
Beyond identifiable code blocks, some responses embed structural constructs that are not part of the AVAP grammar: curly-brace function bodies, `while` loops, `let`/`var` declarations, `for`/`foreach` statements. These appear in entries where no foreign language is explicitly named.
|
||||||
|
|
||||||
|
#### Summary by model and index
|
||||||
|
|
||||||
|
| Model | Index | Foreign syntax (entries) | Hallucinated cmds (entries) | Estimated invalid / 50 |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| qwen3-0.6B-emb | avap-knowledge-v2 | 3 | 2 | ~5 (10%) |
|
||||||
|
| qwen3-0.6B-emb | avap-docs-test-v4 | 3 | 3 | ~6 (12%) |
|
||||||
|
| bge-m3 | avap-knowledge-v2 | 6 | 3 | ~8 (16%) |
|
||||||
|
| bge-m3 | avap-docs-test-v4 | 5 | 1 | ~6 (12%) |
|
||||||
|
| harrier-oss-v1:0.6b | avap-knowledge-v2 | 2 | 3 | ~5 (10%) |
|
||||||
|
| harrier-oss-v1:0.6b | avap-docs-test-v4 | 1 | 0 | ~1 (2%) |
|
||||||
|
|
||||||
|
*Counts are conservative lower bounds: `answer_preview` fields are truncated at ~300 characters. Full response bodies may contain additional violations not visible in the preview.*
|
||||||
|
|
||||||
|
### Relative ordering within this campaign
|
||||||
|
|
||||||
|
The data supports a *relative* — not absolute — ordering. **harrier / avap-docs-test-v4** shows the fewest syntactic violations and the highest global score (0.7109). It is the least-bad model in this run. However, this does not make it production-ready: a model that generates correct AVAP in 98% of responses can still fail for a user on a critical code generation query.
|
||||||
|
|
||||||
|
**bge-m3** failures are predominantly well-known foreign syntaxes (Python, JavaScript), which makes them identifiable without a parser. **qwen3** introduces invented commands that look like valid AVAP idioms (`ifParam`, `getSHA256`, `getTimeStamp`) — these are harder to detect precisely because they are superficially plausible.
|
||||||
|
|
||||||
|
The CTO's conclusion: no model can be selected or rejected based on these six runs. The measurement instrument is not fit for purpose.
|
||||||
|
|
||||||
|
### Evaluation environment issues identified in this campaign
|
||||||
|
|
||||||
|
Three additional issues compromise reproducibility independently of model quality:
|
||||||
|
|
||||||
|
**Mixed execution environments.** Parts of the team ran `run_evaluation` from local notebooks. Notebook runs do not record temperature or random seeds, making score reproduction impossible across machines and Python environments.
|
||||||
|
|
||||||
|
**Undocumented index re-creation.** Bugs were discovered in the existing indices and they were re-indexed with corrected pipelines (`avap_ingestor.py` for `avap-knowledge-v2-*`, `elasticsearch_ingestion.py` for `avap-docs-test-v4-*`). The pre-processing delta between old and new indices was not documented before the evaluation was run, making it impossible to determine whether score differences reflect model quality or index quality.
|
||||||
|
|
||||||
|
**BM25 contamination in embedding comparisons.** The pipeline uses Hybrid Retrieval (BM25 + kNN, per ADR-0003). When the goal is to compare embedding models, BM25 acts as a confounding variable: a weaker embedding model can compensate with BM25 recall, masking the true quality differential. Evaluations intended to select an embedding model require a kNN-only retrieval mode that does not exist yet.
|
||||||
|
|
||||||
|
### The few-shot gap
|
||||||
|
|
||||||
|
The 190 validated AVAP examples from ADR-0006 are not currently injected into the generation prompt. The syntactic failure rates above — 5 to 17% of responses per run — are consistent with a model that has no valid AVAP examples in its prompt context and falls back on pre-training distributions. This is the expected behaviour of a base LLM encountering an unfamiliar DSL without few-shot grounding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Establish the **Mandatory Syntactic Validation Layer (MSVL)** as a non-optional prerequisite gate in the `EvaluateRAG` pipeline. Any evaluation score produced without MSVL is classified as non-binding and cannot be cited in architectural decisions.
|
||||||
|
|
||||||
|
### 1. Parser integration in `EvaluateRAG`
|
||||||
|
|
||||||
|
Every code block in a generated response must be submitted to the AVAP Parser via gRPC before RAGAS scoring. The parser returns a binary result: `VALID` or `INVALID` with a failure category (`unknown_token`, `unexpected_construct`, `foreign_keyword`, `syntax_error`).
|
||||||
|
|
||||||
|
### 2. `syntactic_validity` as an independent metric
|
||||||
|
|
||||||
|
Introduce `syntactic_validity` (float 0.0–1.0): the fraction of code-bearing responses that pass parser validation within a run. This metric is reported alongside RAGAS scores, not as a replacement.
|
||||||
|
|
||||||
|
For entries that fail parser validation, `faithfulness` and `answer_relevancy` are **overridden to 0.0** regardless of the LLM judge's qualitative assessment. The raw RAGAS scores are preserved in the evaluation record for audit.
|
||||||
|
|
||||||
|
```
|
||||||
|
final_faithfulness(entry) =
|
||||||
|
0.0 if parser(entry) = INVALID
|
||||||
|
ragas_faithfulness(entry) otherwise
|
||||||
|
|
||||||
|
final_answer_relevancy(entry) =
|
||||||
|
0.0 if parser(entry) = INVALID
|
||||||
|
ragas_answer_relevancy(entry) otherwise
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Parser SLA and fallback policy
|
||||||
|
|
||||||
|
The AVAP Parser gRPC service must respond within 2 seconds per call. If the parser is unreachable or times out, the evaluation run is **aborted** with an explicit error. Silent fallback to RAGAS-only scoring is prohibited.
|
||||||
|
|
||||||
|
```python
|
||||||
|
if parser_status == UNAVAILABLE:
|
||||||
|
raise EvaluationAbortedError(
|
||||||
|
"AVAP Parser unreachable. MSVL cannot be bypassed."
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Standardised evaluation protocol
|
||||||
|
|
||||||
|
Local notebook environments are **prohibited** for official evaluation reports. All evaluations cited in architectural decisions must be executed within the `EvaluateRAG` Docker container in Staging with:
|
||||||
|
|
||||||
|
- Fixed random seeds via `EVAL_SEED` environment variable
|
||||||
|
- `temperature=0` for all generation calls
|
||||||
|
- `ANTHROPIC_MODEL` pinned to a specific version string, not `latest`
|
||||||
|
- Index version and the exact ingestion pipeline used documented in the evaluation record *before* the run starts
|
||||||
|
|
||||||
|
### 5. Few-shot context injection
|
||||||
|
|
||||||
|
The 190 validated AVAP examples from ADR-0006 must be injected as few-shot context into the generation prompt. Injection protocol:
|
||||||
|
|
||||||
|
- Examples are selected by **semantic similarity** to the current query (top-K retrieval from the validated pool), not injected wholesale
|
||||||
|
- K defaults to 5; effective K per run is logged in the evaluation record
|
||||||
|
- If the few-shot retrieval service is unavailable, the run proceeds without injection and this is flagged as `few_shot_injection: degraded` in the report
|
||||||
|
|
||||||
|
This directly targets Failure Mode 1: a model that has seen 5 valid AVAP examples before generating code is substantially less likely to default to Go or Python syntax.
|
||||||
|
|
||||||
|
### 6. Embedding-only evaluation mode
|
||||||
|
|
||||||
|
A separate `knn_only` retrieval mode must be implemented in `EvaluateRAG` for evaluations whose explicit purpose is embedding model comparison. This mode disables BM25 and uses only kNN retrieval. Results from this mode are tagged `retrieval_mode: knn_only` and are not comparable with standard hybrid retrieval scores. This mode must be used for any future embedding model selection decision.
|
||||||
|
|
||||||
|
### 7. Statistical measurement requirements
|
||||||
|
|
||||||
|
| Requirement | Specification | Rationale |
|
||||||
|
|---|---|---|
|
||||||
|
| **Bootstrap stability** | N ≥ 5 runs per suite | N=3 provides 1 degree of freedom for variance estimation; N=5 is the minimum to detect bimodal operating modes |
|
||||||
|
| **Reported statistics** | Mean (μ) and standard deviation (σ) | Single-run scores cannot be used for decision-making |
|
||||||
|
| **Leakage audit** | Token distribution analysis per model | Quantifies how much syntactic correctness derives from pre-training bias vs. AVAP documentation retrieval |
|
||||||
|
| **Syntactic confusion matrix** | Parse failures broken down by category and question ID | Identifies which AVAP constructs (`startLoop`, `ormAccess`, `encodeSHA256`, etc.) require additional documentation or few-shot coverage |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Rationale
|
||||||
|
|
||||||
|
### Why 0.0 override rather than a graduated penalty?
|
||||||
|
|
||||||
|
For AVAP, syntactic validity is a binary production gate: code either executes on the PLATON kernel or it does not. A graduated penalty would imply partial credit for non-executable output, which has no operational meaning. The override to 0.0 aligns the metric with the actual production outcome. Raw RAGAS scores are preserved for post-hoc analysis if the policy needs to be revised.
|
||||||
|
|
||||||
|
### Why abort on parser unavailability rather than degrade?
|
||||||
|
|
||||||
|
Silent fallback to RAGAS-only scoring produces evaluation reports that are visually identical to MSVL-validated reports. The purpose of the layer is to prevent false positives. An infrastructure failure that silently removes the gate defeats that purpose entirely. Failing loudly is the only policy consistent with the layer's goal.
|
||||||
|
|
||||||
|
### Why few-shot injection by similarity rather than full pool injection?
|
||||||
|
|
||||||
|
Injecting all 190 examples wholesale would consume the majority of the generation context window, compressing the retrieved documentation that RAGAS evaluates. Similarity-based top-K selection preserves the most relevant examples while protecting retrieval context fidelity. Coverage of rare construct combinations depends on query distribution — this is measurable via the confusion matrix.
|
||||||
|
|
||||||
|
### Why N ≥ 5 runs?
|
||||||
|
|
||||||
|
`temperature=0` reduces run-to-run variance but does not eliminate it. Retrieval non-determinism from kNN approximate search and prompt token ordering effects can produce different results at zero temperature. N=3 provides 1 degree of freedom for variance estimation. N=5 is the minimum that allows detection of a bimodal distribution (two distinct operating modes) with elementary statistical reliability.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Status of prior evaluations
|
||||||
|
|
||||||
|
The six evaluation runs from 2026-04-06 documented in this ADR's Context section are **classified as non-binding**. They may be cited as qualitative evidence of relative model behaviour but cannot be used to select an embedding model for production.
|
||||||
|
|
||||||
|
Any evaluation report generated before this ADR's acceptance date that does not include a `syntactic_validity` score is retroactively classified as non-binding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Alternatives Considered
|
||||||
|
|
||||||
|
| Alternative | Rejected because |
|
||||||
|
|---|---|
|
||||||
|
| **Post-hoc validation** (flag but do not override scores) | Does not prevent false positives from propagating into decision metrics |
|
||||||
|
| **Raise RAGAS threshold to ≥ 0.80** | A model could pass at 0.80 with 10% Go injection; does not address the structural misalignment between semantic scoring and syntactic validity |
|
||||||
|
| **Manual code review per evaluation run** | Not reproducible or scalable; reintroduces evaluator subjectivity |
|
||||||
|
| **Fine-tuning with AVAP-only data** | Addresses the generation problem but not the measurement problem; MSVL is needed regardless |
|
||||||
|
| **Disable BM25 for all evaluations** | Removes a production component defined in ADR-0003; the correct solution is an explicit `knn_only` mode for embedding comparisons, not removing hybrid retrieval globally |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
**Positive:**
|
||||||
|
- Eliminates the false positive class definitively demonstrated in this ADR's Context: semantically coherent but syntactically invalid responses will no longer receive ACCEPTABLE verdicts.
|
||||||
|
- `syntactic_validity` becomes a first-class longitudinal metric enabling tracking of DSL fidelity independently of semantic quality.
|
||||||
|
- Standardised Docker execution with documented seeds ensures scores are reproducible and comparable across team members and time.
|
||||||
|
- The syntactic confusion matrix creates a direct feedback loop into documentation priorities and few-shot pool expansion.
|
||||||
|
|
||||||
|
**Negative:**
|
||||||
|
- Evaluation latency increases by one gRPC call per generated response. At the 2-second SLA for a 50-question dataset, this adds approximately 100 seconds per run.
|
||||||
|
- The AVAP Parser becomes a hard dependency of the evaluation pipeline and must be versioned and kept in sync with the LRM. Parser upgrades may alter score comparability across historical runs.
|
||||||
|
- N ≥ 5 runs multiplies evaluation cost (API calls, compute time) approximately 5×.
|
||||||
|
- The `knn_only` retrieval mode and the few-shot retrieval service are engineering work not currently scheduled.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
1. **Acceptance threshold for `syntactic_validity`:** This ADR defines how to measure syntactic validity but does not specify the minimum score required for production readiness. A subsequent amendment must define this threshold (e.g., `syntactic_validity ≥ 0.95` for `CODE_GENERATION` questions) before MSVL scores can be used as a hard CI/CD gate.
|
||||||
|
|
||||||
|
2. **Parser version pinning policy:** When a parser upgrade changes accepted constructs, historical scores become incomparable. A policy for when upgrades require re-running historical evaluations has not been defined.
|
||||||
|
|
||||||
|
3. **Few-shot pool adequacy for the confusion matrix tail:** Whether 190 examples provide adequate coverage of rare construct combinations visible in the confusion matrix has not been empirically tested.
|
||||||
|
|
||||||
|
4. **BM25 contamination remediation for existing results:** The `knn_only` evaluation mode should be scheduled before the next embedding model comparison campaign to produce a clean comparative baseline.
|
||||||
|
|
@ -0,0 +1,178 @@
|
||||||
|
# RP-0001: Pre-Implementation Validation for ADR-0007 (MSVL)
|
||||||
|
|
||||||
|
**Date:** 2026-04-06
|
||||||
|
**Status:** Proposed
|
||||||
|
**Author:** Rafael Ruiz (CTO)
|
||||||
|
**Executed by:** AI Team (Pablo)
|
||||||
|
**Related ADR:** ADR-0007 (Mandatory Syntactic Validation Layer)
|
||||||
|
**Input data:** 6 evaluation runs from 2026-04-06 (`evaluation_*.json`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
ADR-0007 defines four implementation decisions: parser gRPC integration, dynamic few-shot injection, N=5 Docker protocol, and syntactic confusion matrix. Before assigning engineering work to any of these, two questions must be answered empirically with data the team already has:
|
||||||
|
|
||||||
|
1. **Is the syntactic failure rate structurally predictable?** If failures concentrate in specific question categories or construct types, the few-shot pool and documentation effort can be targeted. If failures are random, the problem is model capability and few-shot injection may not be sufficient.
|
||||||
|
|
||||||
|
2. **Does few-shot injection reduce foreign syntax injection before we build the parser gate?** The few-shot change is cheap to test manually. If it eliminates the majority of violations, the urgency profile of the remaining ADR-0007 decisions changes.
|
||||||
|
|
||||||
|
These two experiments must be completed and their results reviewed before any ADR-0007 implementation work begins.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Experiment 1 — Syntactic Failure Taxonomy
|
||||||
|
|
||||||
|
### Hypothesis
|
||||||
|
|
||||||
|
Syntactic violations are not uniformly distributed across question categories. `CODE_GENERATION` questions will show significantly higher violation rates than `RETRIEVAL` or `CONVERSATIONAL` questions, because code generation requires the model to produce executable AVAP syntax rather than describe it. Within `CODE_GENERATION`, violations will concentrate in construct combinations not covered by the current few-shot pool (loops with JSON building, ORM with error handling, multi-function scripts).
|
||||||
|
|
||||||
|
### What we already know from the 2026-04-06 runs
|
||||||
|
|
||||||
|
The forensic analysis in ADR-0007 identified the following confirmed violations across 6 runs (300 entries total):
|
||||||
|
|
||||||
|
| Failure type | Confirmed cases | Dominant model |
|
||||||
|
|---|---|---|
|
||||||
|
| Foreign language injection (Go, Python, JS) | 20 | bge-m3, qwen3 |
|
||||||
|
| Hallucinated AVAP commands | 12 | qwen3, harrier |
|
||||||
|
| Structural foreign syntax (while, let, var, for) | 8 | bge-m3, qwen3 |
|
||||||
|
|
||||||
|
**Critical limitation:** `answer_preview` fields are truncated at ~300 characters. The full response bodies are not available in the JSON files. The counts above are lower bounds.
|
||||||
|
|
||||||
|
### Method
|
||||||
|
|
||||||
|
**Step 1 — Full response body access.** Re-run `EvaluateRAG` for the two best-performing configurations (harrier/avap-docs-test-v4 and qwen3/avap-docs-test-v4) with a modified `evaluate.py` that logs the complete `answer` field, not just `answer_preview`. This is a one-line change. Run once, not N=5 — this is exploratory, not a benchmarking run.
|
||||||
|
|
||||||
|
**Step 2 — Manual taxonomy on CODE_GENERATION entries.** For all 20 `GD-C-*` entries per run, classify each response into one of four categories:
|
||||||
|
|
||||||
|
| Category | Definition |
|
||||||
|
|---|---|
|
||||||
|
| `VALID_AVAP` | All constructs present are valid AVAP. May be incomplete but syntactically correct. |
|
||||||
|
| `FOREIGN_SYNTAX` | Contains identifiable syntax from Go, Python, or JavaScript. |
|
||||||
|
| `INVENTED_COMMAND` | Uses a command name that does not exist in the LRM. |
|
||||||
|
| `STRUCTURAL_ERROR` | Grammatically wrong AVAP (wrong argument count, missing `end()`, wrong block nesting). |
|
||||||
|
|
||||||
|
One reviewer, consistent criteria. The goal is not perfect precision — it is identifying whether `CODE_GENERATION` failures are concentrated in specific construct types.
|
||||||
|
|
||||||
|
**Step 3 — Construct-level breakdown.** For every `FOREIGN_SYNTAX` or `INVENTED_COMMAND` entry in `CODE_GENERATION`, record which AVAP construct the question required and which the model failed on. Use the question text to infer the target construct:
|
||||||
|
|
||||||
|
| Question | Target construct |
|
||||||
|
|---|---|
|
||||||
|
| GD-C-003 (loop + JSON) | `startLoop` + `AddVariableToJSON` |
|
||||||
|
| GD-C-005 (GET + error handling) | `RequestGet` + `try/exception` |
|
||||||
|
| GD-C-011 (ORM table check) | `ormCheckTable` + `ormCreateTable` |
|
||||||
|
| GD-C-014 (list length) | `getListLen` + `itemFromList` |
|
||||||
|
| ... | ... |
|
||||||
|
|
||||||
|
**Step 4 — Cross-model comparison.** Compare the taxonomy distributions between harrier and qwen3 on the same index. If one model shows a qualitatively different failure profile (e.g., harrier fails on ORM, qwen3 fails on loops), the few-shot pool composition matters more than the pool size.
|
||||||
|
|
||||||
|
### Success criteria
|
||||||
|
|
||||||
|
The experiment is conclusive if it produces one of these two findings:
|
||||||
|
|
||||||
|
**Finding A (concentrated failures):** ≥ 70% of `CODE_GENERATION` violations occur in ≤ 5 distinct construct combinations. This means the few-shot pool can be targeted and the ADR-0007 few-shot injection decision is high-leverage.
|
||||||
|
|
||||||
|
**Finding B (distributed failures):** Violations are spread across ≥ 10 distinct construct combinations with no clear concentration. This means the model lacks general AVAP grammar coverage and few-shot injection alone will be insufficient — the parser gate becomes the primary defence, not a secondary one.
|
||||||
|
|
||||||
|
### Output
|
||||||
|
|
||||||
|
A one-page table: construct × model × failure type × count. This table becomes the first version of the syntactic confusion matrix specified in ADR-0007 Section 7, produced without any infrastructure changes.
|
||||||
|
|
||||||
|
### Estimated effort
|
||||||
|
|
||||||
|
2–3 hours. One `evaluate.py` modification (log full answer), two evaluation runs (no N=5, no seeds required — exploratory), one manual taxonomy pass on ~40 entries.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Experiment 2 — Few-Shot Injection A/B
|
||||||
|
|
||||||
|
### Hypothesis
|
||||||
|
|
||||||
|
Injecting 5 semantically similar AVAP examples from the 190-example pool into the generation prompt will reduce foreign syntax injection in `CODE_GENERATION` entries by ≥ 60% compared to the current baseline (no few-shot context). The reduction will be measurable manually without a parser gate, because the most severe violations (complete Go programs, Python `for` loops) are visually identifiable.
|
||||||
|
|
||||||
|
### Dependency
|
||||||
|
|
||||||
|
Experiment 2 should be run **after** Experiment 1. The construct-level breakdown from Experiment 1 informs which few-shot examples to select: if GD-C-003 (loop + JSON) fails consistently, the few-shot examples injected for that query should include `bucle_1_10.avap` and `construccion_dinamica_de_objeto.avap` from the LRM pool, not generic examples.
|
||||||
|
|
||||||
|
### Method
|
||||||
|
|
||||||
|
**Step 1 — Build the few-shot retrieval function.** Using `src/utils/emb_factory.py` (already exists), embed the 190 examples from `docs/LRM/*.avap`. For each query in the golden dataset, retrieve the top-5 most similar examples by cosine similarity. Log which examples are selected per query.
|
||||||
|
|
||||||
|
**Step 2 — Modify the generation prompt.** Add a few-shot block before the user query in `prompts.py`. Format:
|
||||||
|
|
||||||
|
```
|
||||||
|
The following are valid AVAP code examples. Use them as syntactic reference.
|
||||||
|
|
||||||
|
--- Example 1 ---
|
||||||
|
{few_shot_example_1}
|
||||||
|
|
||||||
|
--- Example 2 ---
|
||||||
|
{few_shot_example_2}
|
||||||
|
|
||||||
|
[...up to 5]
|
||||||
|
|
||||||
|
Now answer the following question using only valid AVAP syntax:
|
||||||
|
{query}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3 — Run one evaluation pass** on the same two configurations as Experiment 1 (harrier/avap-docs-test-v4 and qwen3/avap-docs-test-v4). Log full response bodies. This is still exploratory — no N=5, no seeds.
|
||||||
|
|
||||||
|
**Step 4 — Manual comparison.** Apply the same taxonomy from Experiment 1 to the new responses. Count `FOREIGN_SYNTAX` and `INVENTED_COMMAND` entries before and after few-shot injection.
|
||||||
|
|
||||||
|
**Step 5 — RAGAS delta.** Compare global scores between baseline and few-shot runs. A few-shot injection that reduces syntactic violations but also reduces RAGAS scores significantly would indicate that the few-shot context is consuming context window at the expense of retrieval quality — this informs the K parameter decision in ADR-0007.
|
||||||
|
|
||||||
|
### Success criteria
|
||||||
|
|
||||||
|
| Result | Interpretation | Implication for ADR-0007 |
|
||||||
|
|---|---|---|
|
||||||
|
| Foreign syntax violations drop ≥ 60% | Few-shot injection is high-leverage | Prioritise few-shot implementation before parser gate |
|
||||||
|
| Foreign syntax violations drop 20–60% | Few-shot helps but is insufficient alone | Implement both in parallel |
|
||||||
|
| Foreign syntax violations drop < 20% | Model lacks AVAP grammar at a fundamental level | Parser gate is the primary defence; few-shot pool needs expansion or the base model needs replacement |
|
||||||
|
| RAGAS global score drops > 0.05 | Context window competition is real | Reduce K or implement dynamic context window management |
|
||||||
|
|
||||||
|
### Output
|
||||||
|
|
||||||
|
A 2×2 table: model × condition (baseline / few-shot) × violation count × RAGAS global score. Plus the few-shot retrieval log showing which examples were selected for which queries — this is the raw input for pool quality analysis.
|
||||||
|
|
||||||
|
### Estimated effort
|
||||||
|
|
||||||
|
4–6 hours. Embedding the 190 examples + retrieval function (~2h), prompt modification (~30min), two evaluation runs (~2h), manual taxonomy pass (~1h).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision gate
|
||||||
|
|
||||||
|
Both experiments feed a **go/no-go decision** for ADR-0007 implementation:
|
||||||
|
|
||||||
|
| Scenario | Decision |
|
||||||
|
|---|---|
|
||||||
|
| Exp 1: concentrated failures + Exp 2: ≥ 60% reduction | Implement few-shot first, parser gate second. The few-shot pool composition (informed by the confusion matrix) is the highest-leverage action. |
|
||||||
|
| Exp 1: concentrated failures + Exp 2: < 60% reduction | Implement parser gate and few-shot in parallel. The concentrated failure profile informs pool expansion. |
|
||||||
|
| Exp 1: distributed failures + Exp 2: any result | Parser gate is the primary defence. Few-shot injection is a secondary measure. The base model may need re-evaluation. |
|
||||||
|
| Both experiments inconclusive | Run Experiment 1 with full response bodies and a second annotator before proceeding. |
|
||||||
|
|
||||||
|
This decision gate replaces the need for an architectural meeting to assign priorities — the data makes the priority order self-evident.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What this protocol does not answer
|
||||||
|
|
||||||
|
- Whether the AVAP Parser gRPC service can handle the throughput of N=5 evaluation runs (50 queries × 5 runs = 250 parser calls). That requires a load test on the parser service, not an evaluation run.
|
||||||
|
- Whether 190 examples are sufficient to cover the confusion matrix tail. That requires the confusion matrix from Experiment 1 to exist first.
|
||||||
|
- The minimum `syntactic_validity` threshold for production readiness. That requires at least one MSVL-validated run with known-good and known-bad models to calibrate.
|
||||||
|
|
||||||
|
These three questions are explicitly deferred to the post-implementation phase of ADR-0007.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
| Step | Owner | Estimated duration |
|
||||||
|
|---|---|---|
|
||||||
|
| Experiment 1: `evaluate.py` modification + 2 runs | Pablo (AI Team) | 1 day |
|
||||||
|
| Experiment 1: manual taxonomy + confusion matrix draft | Pablo (AI Team) | 1 day |
|
||||||
|
| Experiment 2: few-shot retrieval function + prompt modification | Pablo (AI Team) | 1 day |
|
||||||
|
| Experiment 2: 2 runs + manual comparison | Pablo (AI Team) | 1 day |
|
||||||
|
| Results review and go/no-go decision | Rafael Ruiz (CTO) + Pablo | 1 meeting |
|
||||||
|
|
||||||
|
**Total: 4 working days before any infrastructure change from ADR-0007 is scheduled.**
|
||||||
Loading…
Reference in New Issue