assistance-engine/docs/RP/RP-0001-pre-implementation-...

179 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RP-0001: Pre-Implementation Validation for ADR-0007 (MSVL)
**Date:** 2026-04-06
**Status:** Proposed
**Author:** Rafael Ruiz (CTO)
**Executed by:** AI Team (Pablo)
**Related ADR:** ADR-0007 (Mandatory Syntactic Validation Layer)
**Input data:** 6 evaluation runs from 2026-04-06 (`evaluation_*.json`)
---
## Purpose
ADR-0007 defines four implementation decisions: parser gRPC integration, dynamic few-shot injection, N=5 Docker protocol, and syntactic confusion matrix. Before assigning engineering work to any of these, two questions must be answered empirically with data the team already has:
1. **Is the syntactic failure rate structurally predictable?** If failures concentrate in specific question categories or construct types, the few-shot pool and documentation effort can be targeted. If failures are random, the problem is model capability and few-shot injection may not be sufficient.
2. **Does few-shot injection reduce foreign syntax injection before we build the parser gate?** The few-shot change is cheap to test manually. If it eliminates the majority of violations, the urgency profile of the remaining ADR-0007 decisions changes.
These two experiments must be completed and their results reviewed before any ADR-0007 implementation work begins.
---
## Experiment 1 — Syntactic Failure Taxonomy
### Hypothesis
Syntactic violations are not uniformly distributed across question categories. `CODE_GENERATION` questions will show significantly higher violation rates than `RETRIEVAL` or `CONVERSATIONAL` questions, because code generation requires the model to produce executable AVAP syntax rather than describe it. Within `CODE_GENERATION`, violations will concentrate in construct combinations not covered by the current few-shot pool (loops with JSON building, ORM with error handling, multi-function scripts).
### What we already know from the 2026-04-06 runs
The forensic analysis in ADR-0007 identified the following confirmed violations across 6 runs (300 entries total):
| Failure type | Confirmed cases | Dominant model |
|---|---|---|
| Foreign language injection (Go, Python, JS) | 20 | bge-m3, qwen3 |
| Hallucinated AVAP commands | 12 | qwen3, harrier |
| Structural foreign syntax (while, let, var, for) | 8 | bge-m3, qwen3 |
**Critical limitation:** `answer_preview` fields are truncated at ~300 characters. The full response bodies are not available in the JSON files. The counts above are lower bounds.
### Method
**Step 1 — Full response body access.** Re-run `EvaluateRAG` for the two best-performing configurations (harrier/avap-docs-test-v4 and qwen3/avap-docs-test-v4) with a modified `evaluate.py` that logs the complete `answer` field, not just `answer_preview`. This is a one-line change. Run once, not N=5 — this is exploratory, not a benchmarking run.
**Step 2 — Manual taxonomy on CODE_GENERATION entries.** For all 20 `GD-C-*` entries per run, classify each response into one of four categories:
| Category | Definition |
|---|---|
| `VALID_AVAP` | All constructs present are valid AVAP. May be incomplete but syntactically correct. |
| `FOREIGN_SYNTAX` | Contains identifiable syntax from Go, Python, or JavaScript. |
| `INVENTED_COMMAND` | Uses a command name that does not exist in the LRM. |
| `STRUCTURAL_ERROR` | Grammatically wrong AVAP (wrong argument count, missing `end()`, wrong block nesting). |
One reviewer, consistent criteria. The goal is not perfect precision — it is identifying whether `CODE_GENERATION` failures are concentrated in specific construct types.
**Step 3 — Construct-level breakdown.** For every `FOREIGN_SYNTAX` or `INVENTED_COMMAND` entry in `CODE_GENERATION`, record which AVAP construct the question required and which the model failed on. Use the question text to infer the target construct:
| Question | Target construct |
|---|---|
| GD-C-003 (loop + JSON) | `startLoop` + `AddVariableToJSON` |
| GD-C-005 (GET + error handling) | `RequestGet` + `try/exception` |
| GD-C-011 (ORM table check) | `ormCheckTable` + `ormCreateTable` |
| GD-C-014 (list length) | `getListLen` + `itemFromList` |
| ... | ... |
**Step 4 — Cross-model comparison.** Compare the taxonomy distributions between harrier and qwen3 on the same index. If one model shows a qualitatively different failure profile (e.g., harrier fails on ORM, qwen3 fails on loops), the few-shot pool composition matters more than the pool size.
### Success criteria
The experiment is conclusive if it produces one of these two findings:
**Finding A (concentrated failures):** ≥ 70% of `CODE_GENERATION` violations occur in ≤ 5 distinct construct combinations. This means the few-shot pool can be targeted and the ADR-0007 few-shot injection decision is high-leverage.
**Finding B (distributed failures):** Violations are spread across ≥ 10 distinct construct combinations with no clear concentration. This means the model lacks general AVAP grammar coverage and few-shot injection alone will be insufficient — the parser gate becomes the primary defence, not a secondary one.
### Output
A one-page table: construct × model × failure type × count. This table becomes the first version of the syntactic confusion matrix specified in ADR-0007 Section 7, produced without any infrastructure changes.
### Estimated effort
23 hours. One `evaluate.py` modification (log full answer), two evaluation runs (no N=5, no seeds required — exploratory), one manual taxonomy pass on ~40 entries.
---
## Experiment 2 — Few-Shot Injection A/B
### Hypothesis
Injecting 5 semantically similar AVAP examples from the 190-example pool into the generation prompt will reduce foreign syntax injection in `CODE_GENERATION` entries by ≥ 60% compared to the current baseline (no few-shot context). The reduction will be measurable manually without a parser gate, because the most severe violations (complete Go programs, Python `for` loops) are visually identifiable.
### Dependency
Experiment 2 should be run **after** Experiment 1. The construct-level breakdown from Experiment 1 informs which few-shot examples to select: if GD-C-003 (loop + JSON) fails consistently, the few-shot examples injected for that query should include `bucle_1_10.avap` and `construccion_dinamica_de_objeto.avap` from the LRM pool, not generic examples.
### Method
**Step 1 — Build the few-shot retrieval function.** Using `src/utils/emb_factory.py` (already exists), embed the 190 examples from `docs/LRM/*.avap`. For each query in the golden dataset, retrieve the top-5 most similar examples by cosine similarity. Log which examples are selected per query.
**Step 2 — Modify the generation prompt.** Add a few-shot block before the user query in `prompts.py`. Format:
```
The following are valid AVAP code examples. Use them as syntactic reference.
--- Example 1 ---
{few_shot_example_1}
--- Example 2 ---
{few_shot_example_2}
[...up to 5]
Now answer the following question using only valid AVAP syntax:
{query}
```
**Step 3 — Run one evaluation pass** on the same two configurations as Experiment 1 (harrier/avap-docs-test-v4 and qwen3/avap-docs-test-v4). Log full response bodies. This is still exploratory — no N=5, no seeds.
**Step 4 — Manual comparison.** Apply the same taxonomy from Experiment 1 to the new responses. Count `FOREIGN_SYNTAX` and `INVENTED_COMMAND` entries before and after few-shot injection.
**Step 5 — RAGAS delta.** Compare global scores between baseline and few-shot runs. A few-shot injection that reduces syntactic violations but also reduces RAGAS scores significantly would indicate that the few-shot context is consuming context window at the expense of retrieval quality — this informs the K parameter decision in ADR-0007.
### Success criteria
| Result | Interpretation | Implication for ADR-0007 |
|---|---|---|
| Foreign syntax violations drop ≥ 60% | Few-shot injection is high-leverage | Prioritise few-shot implementation before parser gate |
| Foreign syntax violations drop 2060% | Few-shot helps but is insufficient alone | Implement both in parallel |
| Foreign syntax violations drop < 20% | Model lacks AVAP grammar at a fundamental level | Parser gate is the primary defence; few-shot pool needs expansion or the base model needs replacement |
| RAGAS global score drops > 0.05 | Context window competition is real | Reduce K or implement dynamic context window management |
### Output
A 2×2 table: model × condition (baseline / few-shot) × violation count × RAGAS global score. Plus the few-shot retrieval log showing which examples were selected for which queries — this is the raw input for pool quality analysis.
### Estimated effort
46 hours. Embedding the 190 examples + retrieval function (~2h), prompt modification (~30min), two evaluation runs (~2h), manual taxonomy pass (~1h).
---
## Decision gate
Both experiments feed a **go/no-go decision** for ADR-0007 implementation:
| Scenario | Decision |
|---|---|
| Exp 1: concentrated failures + Exp 2: ≥ 60% reduction | Implement few-shot first, parser gate second. The few-shot pool composition (informed by the confusion matrix) is the highest-leverage action. |
| Exp 1: concentrated failures + Exp 2: < 60% reduction | Implement parser gate and few-shot in parallel. The concentrated failure profile informs pool expansion. |
| Exp 1: distributed failures + Exp 2: any result | Parser gate is the primary defence. Few-shot injection is a secondary measure. The base model may need re-evaluation. |
| Both experiments inconclusive | Run Experiment 1 with full response bodies and a second annotator before proceeding. |
This decision gate replaces the need for an architectural meeting to assign priorities the data makes the priority order self-evident.
---
## What this protocol does not answer
- Whether the AVAP Parser gRPC service can handle the throughput of N=5 evaluation runs (50 queries × 5 runs = 250 parser calls). That requires a load test on the parser service, not an evaluation run.
- Whether 190 examples are sufficient to cover the confusion matrix tail. That requires the confusion matrix from Experiment 1 to exist first.
- The minimum `syntactic_validity` threshold for production readiness. That requires at least one MSVL-validated run with known-good and known-bad models to calibrate.
These three questions are explicitly deferred to the post-implementation phase of ADR-0007.
---
## Timeline
| Step | Owner | Estimated duration |
|---|---|---|
| Experiment 1: `evaluate.py` modification + 2 runs | Pablo (AI Team) | 1 day |
| Experiment 1: manual taxonomy + confusion matrix draft | Pablo (AI Team) | 1 day |
| Experiment 2: few-shot retrieval function + prompt modification | Pablo (AI Team) | 1 day |
| Experiment 2: 2 runs + manual comparison | Pablo (AI Team) | 1 day |
| Results review and go/no-go decision | Rafael Ruiz (CTO) + Pablo | 1 meeting |
**Total: 4 working days before any infrastructure change from ADR-0007 is scheduled.**