179 lines
11 KiB
Markdown
179 lines
11 KiB
Markdown
# RP-0001: Pre-Implementation Validation for ADR-0007 (MSVL)
|
||
|
||
**Date:** 2026-04-06
|
||
**Status:** Proposed
|
||
**Author:** Rafael Ruiz (CTO)
|
||
**Executed by:** AI Team (Pablo)
|
||
**Related ADR:** ADR-0007 (Mandatory Syntactic Validation Layer)
|
||
**Input data:** 6 evaluation runs from 2026-04-06 (`evaluation_*.json`)
|
||
|
||
---
|
||
|
||
## Purpose
|
||
|
||
ADR-0007 defines four implementation decisions: parser gRPC integration, dynamic few-shot injection, N=5 Docker protocol, and syntactic confusion matrix. Before assigning engineering work to any of these, two questions must be answered empirically with data the team already has:
|
||
|
||
1. **Is the syntactic failure rate structurally predictable?** If failures concentrate in specific question categories or construct types, the few-shot pool and documentation effort can be targeted. If failures are random, the problem is model capability and few-shot injection may not be sufficient.
|
||
|
||
2. **Does few-shot injection reduce foreign syntax injection before we build the parser gate?** The few-shot change is cheap to test manually. If it eliminates the majority of violations, the urgency profile of the remaining ADR-0007 decisions changes.
|
||
|
||
These two experiments must be completed and their results reviewed before any ADR-0007 implementation work begins.
|
||
|
||
---
|
||
|
||
## Experiment 1 — Syntactic Failure Taxonomy
|
||
|
||
### Hypothesis
|
||
|
||
Syntactic violations are not uniformly distributed across question categories. `CODE_GENERATION` questions will show significantly higher violation rates than `RETRIEVAL` or `CONVERSATIONAL` questions, because code generation requires the model to produce executable AVAP syntax rather than describe it. Within `CODE_GENERATION`, violations will concentrate in construct combinations not covered by the current few-shot pool (loops with JSON building, ORM with error handling, multi-function scripts).
|
||
|
||
### What we already know from the 2026-04-06 runs
|
||
|
||
The forensic analysis in ADR-0007 identified the following confirmed violations across 6 runs (300 entries total):
|
||
|
||
| Failure type | Confirmed cases | Dominant model |
|
||
|---|---|---|
|
||
| Foreign language injection (Go, Python, JS) | 20 | bge-m3, qwen3 |
|
||
| Hallucinated AVAP commands | 12 | qwen3, harrier |
|
||
| Structural foreign syntax (while, let, var, for) | 8 | bge-m3, qwen3 |
|
||
|
||
**Critical limitation:** `answer_preview` fields are truncated at ~300 characters. The full response bodies are not available in the JSON files. The counts above are lower bounds.
|
||
|
||
### Method
|
||
|
||
**Step 1 — Full response body access.** Re-run `EvaluateRAG` for the two best-performing configurations (harrier/avap-docs-test-v4 and qwen3/avap-docs-test-v4) with a modified `evaluate.py` that logs the complete `answer` field, not just `answer_preview`. This is a one-line change. Run once, not N=5 — this is exploratory, not a benchmarking run.
|
||
|
||
**Step 2 — Manual taxonomy on CODE_GENERATION entries.** For all 20 `GD-C-*` entries per run, classify each response into one of four categories:
|
||
|
||
| Category | Definition |
|
||
|---|---|
|
||
| `VALID_AVAP` | All constructs present are valid AVAP. May be incomplete but syntactically correct. |
|
||
| `FOREIGN_SYNTAX` | Contains identifiable syntax from Go, Python, or JavaScript. |
|
||
| `INVENTED_COMMAND` | Uses a command name that does not exist in the LRM. |
|
||
| `STRUCTURAL_ERROR` | Grammatically wrong AVAP (wrong argument count, missing `end()`, wrong block nesting). |
|
||
|
||
One reviewer, consistent criteria. The goal is not perfect precision — it is identifying whether `CODE_GENERATION` failures are concentrated in specific construct types.
|
||
|
||
**Step 3 — Construct-level breakdown.** For every `FOREIGN_SYNTAX` or `INVENTED_COMMAND` entry in `CODE_GENERATION`, record which AVAP construct the question required and which the model failed on. Use the question text to infer the target construct:
|
||
|
||
| Question | Target construct |
|
||
|---|---|
|
||
| GD-C-003 (loop + JSON) | `startLoop` + `AddVariableToJSON` |
|
||
| GD-C-005 (GET + error handling) | `RequestGet` + `try/exception` |
|
||
| GD-C-011 (ORM table check) | `ormCheckTable` + `ormCreateTable` |
|
||
| GD-C-014 (list length) | `getListLen` + `itemFromList` |
|
||
| ... | ... |
|
||
|
||
**Step 4 — Cross-model comparison.** Compare the taxonomy distributions between harrier and qwen3 on the same index. If one model shows a qualitatively different failure profile (e.g., harrier fails on ORM, qwen3 fails on loops), the few-shot pool composition matters more than the pool size.
|
||
|
||
### Success criteria
|
||
|
||
The experiment is conclusive if it produces one of these two findings:
|
||
|
||
**Finding A (concentrated failures):** ≥ 70% of `CODE_GENERATION` violations occur in ≤ 5 distinct construct combinations. This means the few-shot pool can be targeted and the ADR-0007 few-shot injection decision is high-leverage.
|
||
|
||
**Finding B (distributed failures):** Violations are spread across ≥ 10 distinct construct combinations with no clear concentration. This means the model lacks general AVAP grammar coverage and few-shot injection alone will be insufficient — the parser gate becomes the primary defence, not a secondary one.
|
||
|
||
### Output
|
||
|
||
A one-page table: construct × model × failure type × count. This table becomes the first version of the syntactic confusion matrix specified in ADR-0007 Section 7, produced without any infrastructure changes.
|
||
|
||
### Estimated effort
|
||
|
||
2–3 hours. One `evaluate.py` modification (log full answer), two evaluation runs (no N=5, no seeds required — exploratory), one manual taxonomy pass on ~40 entries.
|
||
|
||
---
|
||
|
||
## Experiment 2 — Few-Shot Injection A/B
|
||
|
||
### Hypothesis
|
||
|
||
Injecting 5 semantically similar AVAP examples from the 190-example pool into the generation prompt will reduce foreign syntax injection in `CODE_GENERATION` entries by ≥ 60% compared to the current baseline (no few-shot context). The reduction will be measurable manually without a parser gate, because the most severe violations (complete Go programs, Python `for` loops) are visually identifiable.
|
||
|
||
### Dependency
|
||
|
||
Experiment 2 should be run **after** Experiment 1. The construct-level breakdown from Experiment 1 informs which few-shot examples to select: if GD-C-003 (loop + JSON) fails consistently, the few-shot examples injected for that query should include `bucle_1_10.avap` and `construccion_dinamica_de_objeto.avap` from the LRM pool, not generic examples.
|
||
|
||
### Method
|
||
|
||
**Step 1 — Build the few-shot retrieval function.** Using `src/utils/emb_factory.py` (already exists), embed the 190 examples from `docs/LRM/*.avap`. For each query in the golden dataset, retrieve the top-5 most similar examples by cosine similarity. Log which examples are selected per query.
|
||
|
||
**Step 2 — Modify the generation prompt.** Add a few-shot block before the user query in `prompts.py`. Format:
|
||
|
||
```
|
||
The following are valid AVAP code examples. Use them as syntactic reference.
|
||
|
||
--- Example 1 ---
|
||
{few_shot_example_1}
|
||
|
||
--- Example 2 ---
|
||
{few_shot_example_2}
|
||
|
||
[...up to 5]
|
||
|
||
Now answer the following question using only valid AVAP syntax:
|
||
{query}
|
||
```
|
||
|
||
**Step 3 — Run one evaluation pass** on the same two configurations as Experiment 1 (harrier/avap-docs-test-v4 and qwen3/avap-docs-test-v4). Log full response bodies. This is still exploratory — no N=5, no seeds.
|
||
|
||
**Step 4 — Manual comparison.** Apply the same taxonomy from Experiment 1 to the new responses. Count `FOREIGN_SYNTAX` and `INVENTED_COMMAND` entries before and after few-shot injection.
|
||
|
||
**Step 5 — RAGAS delta.** Compare global scores between baseline and few-shot runs. A few-shot injection that reduces syntactic violations but also reduces RAGAS scores significantly would indicate that the few-shot context is consuming context window at the expense of retrieval quality — this informs the K parameter decision in ADR-0007.
|
||
|
||
### Success criteria
|
||
|
||
| Result | Interpretation | Implication for ADR-0007 |
|
||
|---|---|---|
|
||
| Foreign syntax violations drop ≥ 60% | Few-shot injection is high-leverage | Prioritise few-shot implementation before parser gate |
|
||
| Foreign syntax violations drop 20–60% | Few-shot helps but is insufficient alone | Implement both in parallel |
|
||
| Foreign syntax violations drop < 20% | Model lacks AVAP grammar at a fundamental level | Parser gate is the primary defence; few-shot pool needs expansion or the base model needs replacement |
|
||
| RAGAS global score drops > 0.05 | Context window competition is real | Reduce K or implement dynamic context window management |
|
||
|
||
### Output
|
||
|
||
A 2×2 table: model × condition (baseline / few-shot) × violation count × RAGAS global score. Plus the few-shot retrieval log showing which examples were selected for which queries — this is the raw input for pool quality analysis.
|
||
|
||
### Estimated effort
|
||
|
||
4–6 hours. Embedding the 190 examples + retrieval function (~2h), prompt modification (~30min), two evaluation runs (~2h), manual taxonomy pass (~1h).
|
||
|
||
---
|
||
|
||
## Decision gate
|
||
|
||
Both experiments feed a **go/no-go decision** for ADR-0007 implementation:
|
||
|
||
| Scenario | Decision |
|
||
|---|---|
|
||
| Exp 1: concentrated failures + Exp 2: ≥ 60% reduction | Implement few-shot first, parser gate second. The few-shot pool composition (informed by the confusion matrix) is the highest-leverage action. |
|
||
| Exp 1: concentrated failures + Exp 2: < 60% reduction | Implement parser gate and few-shot in parallel. The concentrated failure profile informs pool expansion. |
|
||
| Exp 1: distributed failures + Exp 2: any result | Parser gate is the primary defence. Few-shot injection is a secondary measure. The base model may need re-evaluation. |
|
||
| Both experiments inconclusive | Run Experiment 1 with full response bodies and a second annotator before proceeding. |
|
||
|
||
This decision gate replaces the need for an architectural meeting to assign priorities — the data makes the priority order self-evident.
|
||
|
||
---
|
||
|
||
## What this protocol does not answer
|
||
|
||
- Whether the AVAP Parser gRPC service can handle the throughput of N=5 evaluation runs (50 queries × 5 runs = 250 parser calls). That requires a load test on the parser service, not an evaluation run.
|
||
- Whether 190 examples are sufficient to cover the confusion matrix tail. That requires the confusion matrix from Experiment 1 to exist first.
|
||
- The minimum `syntactic_validity` threshold for production readiness. That requires at least one MSVL-validated run with known-good and known-bad models to calibrate.
|
||
|
||
These three questions are explicitly deferred to the post-implementation phase of ADR-0007.
|
||
|
||
---
|
||
|
||
## Timeline
|
||
|
||
| Step | Owner | Estimated duration |
|
||
|---|---|---|
|
||
| Experiment 1: `evaluate.py` modification + 2 runs | Pablo (AI Team) | 1 day |
|
||
| Experiment 1: manual taxonomy + confusion matrix draft | Pablo (AI Team) | 1 day |
|
||
| Experiment 2: few-shot retrieval function + prompt modification | Pablo (AI Team) | 1 day |
|
||
| Experiment 2: 2 runs + manual comparison | Pablo (AI Team) | 1 day |
|
||
| Results review and go/no-go decision | Rafael Ruiz (CTO) + Pablo | 1 meeting |
|
||
|
||
**Total: 4 working days before any infrastructure change from ADR-0007 is scheduled.**
|