assistance-engine/docs/ADR/ADR-0006-reward-algorithm-d...

364 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-0006: Reward Algorithm for Self-Improving Dataset Synthesis
**Date:** 2026-03-25
**Status:** Under Evaluation — Primary comparison: Candidate A vs Candidate E vs Candidate F
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering (AI Team)
**Research lead:** Ivar Zapata
---
## Context
The AVAP dataset synthesis pipeline (Track A) generates AVAP code examples using a large language model, filtered by a three-stage quality pipeline: parser validation (Stage 1), Execution Coverage Score (Stage 2), and semantic novelty (Stage 3). The current pipeline has two structural limitations that the reward mechanism must address.
### Limitation 1 — Static generation
Each batch is generated from the same static prompt (LRM + category description). The generator has no memory of what it has already produced and no model of what "good" looks like for the constructs it hasn't explored yet.
### Limitation 2 — Distribution bias (the fundamental problem)
The generator (Claude claude-sonnet) has its own internal distribution over what AVAP code "looks like", derived from its training on mainstream languages. It naturally gravitates toward the simplest patterns — linear code, basic conditionals, single-construct examples — because those are closest to what it knows. Any reward mechanism based on selecting the best from what the model spontaneously produces and feeding those back as few-shots **amplifies this bias**: the pool fills with what the model does easily, and the model never explores what it does poorly.
This is not model collapse in the classical sense (weights are not updated), but it is **cumulative distribution bias** — the effective generation distribution narrows toward the model's comfort zone with each iteration.
### The correct framing
The solution is not to reward what the model produces spontaneously. It is to **specify externally what must be produced** and evaluate quality relative to that specification. Coverage of the DSL's grammar space must be guaranteed by construction, not hoped for through probabilistic exploration.
---
## Decision
**Conduct a primary comparative evaluation of Candidate A (CW-Reward, reward-driven pool), Candidate E (MAP-Elites, externally-specified coverage cells), and Candidate F (MAP-Elites with ConstructPrior transfer from real production code)** before selecting the production algorithm. Candidates B, C, D are secondary alternatives evaluated only if none of A, E, or F meets quality thresholds.
The fundamental research question has two layers:
1. **Does forced external specification of construct combinations produce a less biased, higher-quality dataset than reward-driven spontaneous exploration?** (A vs E)
2. **Does seeding cell selection with real production code co-occurrence distributions further improve coverage quality and downstream RAG performance over blind MAP-Elites?** (E vs F)
---
## Candidate Analysis
### Candidate A — CW-Reward (Composite Weighted Reward)
**Algorithm class:** In-context reward — no parameter updates.
**Mechanism:** A composite reward is computed for each parser-valid example:
```
reward(e) = w_ecs · ECS(e) + w_novelty · Jaccard_novelty(e, Pool) + w_tests · test_quality(e)
```
High-reward examples enter a GoldPool (top-K). The pool is injected as few-shot context in subsequent generation calls. Coverage summary steers the prompt toward underrepresented constructs.
**Known bias risk:** The pool amplifies the model's natural generation distribution. Examples that are easy for the model (simple patterns, single constructs) tend to enter the pool first and persist. The Jaccard novelty metric penalises structural similarity but cannot detect semantic simplicity — two examples with different node type sets can both be trivially shallow.
**Appropriate when:** The base LLM has strong prior knowledge of the target language (mainstream languages). For AVAP, where the model has zero prior knowledge, the bias risk is materially higher.
---
### Candidate E — MAP-Elites with Externally-Defined Coverage Cells (Proposed Primary)
**Algorithm class:** Quality-Diversity algorithm — no parameter updates, coverage guaranteed by construction.
**Core insight:** Instead of rewarding the best examples from spontaneous generation, define the coverage space externally from the grammar and direct the generator to fill specific cells. The model's distribution bias is neutralised because it is never asked to "explore freely" — it is always given a precise specification.
**Coverage space definition:**
The behavior space is defined over **pairs and trios of AVAP node types** drawn from the full grammar vocabulary. Each cell represents a construct combination that must be represented in the dataset:
```
Cell key = frozenset of 2 or 3 AVAP node types
Cell value = (best_example_so_far, quality_score)
Example cells:
{"startLoop", "ormAccessSelect"} → best example using both
{"try", "go", "RequestPost"} → best example using all three
{"function", "if_mode2", "encodeSHA256"} → best example using all three
```
**Space size:**
- Pairs: C(38, 2) = 703 cells
- Trios: C(38, 3) = 8,436 cells
- Total: 9,139 cells
With 5,000 examples targeted, average coverage is ~0.55 examples per cell — statistical coverage of pairwise and triadic construct combinations is achievable with focused cell selection strategy. Full coverage of high-prior cells is expected within budget; tail cells are addressed in Phase 3.
**Generation protocol:**
```
1. SELECT target cell:
- Empty cells first (exploration phase)
- Then lowest-quality cells (exploitation phase)
- Interleave: every 10 calls, select a cell adjacent to a
recently improved cell (local neighborhood search)
2. SPECIFY in the prompt:
"Generate an AVAP example that MUST use ALL of these constructs:
{cell_constructs}. Use additional constructs where natural."
3. VALIDATE:
a. Parser: syntactically valid? (Stage 1)
b. Construct presence: all cell constructs in AST? (cell gate)
c. If both pass → compute cell quality score
4. UPDATE cell:
If quality > current cell quality → replace cell entry
```
**Cell quality score:**
```
cell_quality(e, cell) =
construct_fidelity(e, cell) # fraction of cell constructs actually present
+ α · bonus_constructs(e, cell) # extra constructs beyond cell specification
+ β · test_quality(e) # quality of test assertions
+ γ · code_length_norm(e) # normalised code length (longer = richer)
```
`construct_fidelity` is the primary gate: an example that does not contain all cell constructs scores 0 regardless of other criteria.
**Why this eliminates distribution bias:**
The model is never asked what it "wants" to generate. It receives a precise specification: "you must use these three constructs." If it produces something that satisfies the specification, it enters the map. If not, it is discarded and the cell remains available for the next attempt. The coverage trajectory is determined by the cell selection strategy, not by the model's natural distribution.
The only residual bias is the model's ability to satisfy arbitrary construct specifications — some cells may be harder to fill than others. This is empirically measurable (fill rate per cell) and is itself a research finding about the generator's capabilities.
**Appropriate when:** The target language is novel or partially unknown to the generator. The external specification mechanism compensates for the model's lack of prior knowledge.
---
### Candidate F — MAP-Elites with ConstructPrior Transfer (Proposed Disruptive Extension)
**Algorithm class:** Quality-Diversity algorithm with informed cell selection — no parameter updates, coverage guaranteed by construction.
**Core insight:** Candidate E specifies *which* constructs must appear but treats all cells as equally valuable. Real production code does not use constructs uniformly: some combinations (e.g., `ormAccessSelect` + `try`) appear in virtually every real API endpoint; others (e.g., `encodeSHA256` + `startLoop`) appear rarely. A golden dataset that mirrors production code distributions will retrieve more relevant examples for real developer queries. The ConstructPrior module transfers this knowledge from large public codebases to weight MAP-Elites cell selection.
**ConstructPrior design:**
```
ConstructPrior = weighted combination of 4 domain sources:
Source 1 — The Stack (BigCode, 50% weight)
Filter: paths matching /api/, /routes/, /handlers/, /endpoints/
Languages: Python, Go, JavaScript/TypeScript, Java
Process: extract function-level code blocks → map language constructs
to AVAP semantic equivalents → compute co-occurrence frequency
per (construct_a, construct_b) and (construct_a, construct_b, construct_c)
Rationale: real microservice API code; largest and most representative source
Source 2 — CodeSearchNet (30% weight)
Filter: semantic search for "api endpoint", "http handler", "database query"
Languages: Python, Go, Java, JavaScript
Process: same mapping pipeline as Source 1
Rationale: function-docstring pairs provide semantic context for mapping quality
Source 3 — HumanEval-X Go (10% weight)
Filter: problems using goroutines, channels, wait groups
Process: map Go concurrency primitives → AVAP {go, gather, startLoop}
Rationale: AVAP's concurrency model mirrors Go's; coverage of concurrent patterns
Source 4 — Spider SQL Dataset (10% weight)
Filter: multi-table joins, aggregations, nested queries
Process: map SQL operations → AVAP {ormAccessSelect, ormAccessInsert, ormAccessUpdate}
Rationale: AVAP ORM constructs semantically equivalent to SQL clauses
```
**Construct mapping table (AVAP ← source constructs):**
| AVAP construct | Python equivalent | Go equivalent | SQL equivalent |
|---|---|---|---|
| `ormAccessSelect` | `cursor.fetchall()`, `session.query()` | `db.Query()`, `rows.Scan()` | `SELECT` |
| `ormAccessInsert` | `session.add()`, `cursor.execute(INSERT)` | `db.Exec(INSERT)` | `INSERT INTO` |
| `ormAccessUpdate` | `session.merge()`, `cursor.execute(UPDATE)` | `db.Exec(UPDATE)` | `UPDATE` |
| `RequestGet` | `requests.get()`, `httpx.get()` | `http.Get()`, `client.Get()` | — |
| `RequestPost` | `requests.post()`, `httpx.post()` | `http.Post()`, `client.Post()` | — |
| `startLoop` | `for item in list:` | `for _, v := range` | `CURSOR LOOP` |
| `go` + `gather` | `asyncio.gather()`, `ThreadPoolExecutor` | `go func()`, `sync.WaitGroup` | — |
| `try` + `exception` | `try: except:` | `if err != nil` | — |
| `encodeSHA256` | `hashlib.sha256()` | `sha256.New()` | — |
| `function` | `def func():` | `func name()` | `CREATE FUNCTION` |
**Cell weighting formula:**
```
cell_prior_weight(cell) =
Σ_{s ∈ Sources} weight_s · freq_s(cell_constructs)
where freq_s(cell) = co-occurrence frequency of the construct set in source s,
normalized to [0, 1] within each source.
Cells with prior_weight = 0 (no source coverage) receive a minimum weight ε = 0.05
to ensure all cells remain reachable.
```
**Modified cell selection with ConstructPrior:**
```
PHASE 1 (exploration):
Select empty cells, weighted by cell_prior_weight.
High-prior cells filled first — these are patterns real developers use.
PHASE 2 (exploitation):
Select lowest-quality filled cells, UCB-weighted,
also weighted by cell_prior_weight.
High-prior, low-quality cells deprioritized for richer improvement.
PHASE 3 (tail coverage):
Cells with prior_weight = ε are visited last, after all
production-relevant cells reach quality > 0.7.
Ensures complete mathematical coverage without wasting
early generation budget on rare combinations.
```
**Why this is disruptive:**
1. **First formal connection between DSL dataset synthesis and production code distributions.** Prior dataset synthesis work (MBPP, HumanEval, APPS) uses human-authored problems or scrapes competitive programming sites. For novel DSLs with no prior human authors, this approach provides the first principled method to bootstrap coverage from semantically equivalent languages.
2. **Eliminates the uniform sampling assumption.** Standard Quality-Diversity algorithms treat all niches as equally valuable. The ConstructPrior breaks this assumption: cells that correspond to real production patterns are assigned higher value, producing a dataset whose distribution mirrors real developer usage rather than mathematical combinatorial completeness.
3. **Zero human annotation required.** The prior is derived automatically from public datasets under permissive licenses (The Stack: Apache 2.0; CodeSearchNet: MIT; HumanEval-X: MIT; Spider: CC BY-SA 4.0).
4. **Residual bias is semantic, not structural.** Candidate E's residual bias is the model's ability to satisfy arbitrary construct specifications (some cells may be hard to fill). Candidate F's residual bias is the construct mapping quality (how faithfully Python/Go/SQL constructs map to AVAP equivalents). The latter is measurable, improvable, and fully transparent.
**Expected improvement over Candidate E:**
- RAGAS Composite: +0.030.08 (hypothesis: production-weighted cells retrieve more relevant examples for real queries)
- Distribution entropy: similar or slightly lower than E (intentionally non-uniform — mirrors production distribution)
- Downstream task success: +515% on held-out real developer queries (hypothesis: high-prior cells produce examples that match actual query patterns)
**Appropriate when:** Target DSL has identifiable semantic equivalents in mainstream languages, and a production-weighted dataset is preferred over a mathematically uniform one.
---
### Out of Scope — Fine-tuning Approaches (GRPO, DPO)
Gradient-based approaches (GRPO, DPO) address a **different problem**: fine-tuning the inference model after the dataset is built. This ADR concerns dataset synthesis algorithm design. Fine-tuning the inference model is a separate architectural decision, tracked separately, and is not evaluated here.
Per-iteration fine-tuning of the generator (training the generator on its own outputs between batches) is explicitly rejected as a design choice. Iteratively training a model on its own outputs produces cumulative distribution narrowing. The generator (Claude API) and any future inference model must be trained on separate, independently validated datasets.
---
### Candidate D — UCB Bandit over Coverage Regions
**Algorithm class:** Multi-armed bandit.
Coverage regions are arms. UCB selects which region to target via exploration-exploitation tradeoff. Theoretically well-understood convergence guarantees but does not provide construct-level specification — it targets regions, not specific combinations. Less precise than Candidate E.
**Superseded by Candidate E** for the same computational cost with stronger guarantees.
---
## Comparative Summary
| Property | A: CW-Reward | E: MAP-Elites | F: MAP-Elites+Prior |
|---|---|---|---|
| Distribution bias risk | **High** | **None** | **None** |
| Coverage guarantee | Probabilistic | **By construction** | **By construction** |
| Production code alignment | None | None | **Yes (weighted)** |
| LLM parameter updates | No | No | No |
| GPU requirement | None | None | None |
| Works with API-only LLM | Yes | Yes | Yes |
| Interpretability | High | **Very high** | **Very high** |
| Implementation complexity | Low | Medium | **Medium-High** |
| Convergence guarantee | No | **Yes (fill rate)** | **Yes (fill rate)** |
| Residual bias | Model distribution | Cell fill difficulty | Mapping quality |
| External data required | No | No | Yes (public, free) |
| Novel contribution | Low | Medium | **High** |
---
## Evaluation Protocol
### Phase 1 — Candidate A vs Candidate E vs Candidate F
Run all three candidates for 500 generated examples each, same LRM, same parser, same Stage 1 filter. Fixed random seed for reproducibility.
**Primary metrics:**
| Metric | Definition | Expected winner |
|---|---|---|
| Cell fill rate | Fraction of 9,139 cells with ≥1 example (E/F only) | E≈F by construction |
| Coverage breadth | Distinct node types covered / total | E≈F |
| Distribution uniformity | Entropy of node type frequency distribution | E (flatter = better) |
| Production alignment | KL divergence between dataset and ConstructPrior distribution | **F** (by design) |
| Mean cell quality | Average quality score across filled cells | TBD empirically |
| Parser pass rate trend | Pass rate across iterations | A (if few-shots help) |
| Downstream RAGAS | RAGAS Composite on 50 held-out AVAP queries | **Primary decision signal** |
**Distribution uniformity** is the key metric for bias detection (A vs E). Plot node type frequency as a histogram. Candidate A will show a long-tail distribution. Candidate E should show a near-uniform distribution. Candidate F will show a production-weighted distribution (intentionally non-uniform — this is a feature, not a bug).
**Production alignment** is the key metric for F vs E. A dataset with low KL divergence from ConstructPrior produces examples that match real developer usage patterns. If RAGAS(F) > RAGAS(E), this validates the transfer prior hypothesis.
**Selection criterion:**
- A vs E: Candidate E wins if entropy > 3.0 bits AND RAGAS(E) ≥ RAGAS(A).
- E vs F: Candidate F wins if RAGAS(F) > RAGAS(E) by margin ≥ 0.02.
- If F wins both comparisons, F is the production algorithm.
- Fallback: if RAGAS margin F vs E < 0.02, use E (simpler, no external data dependency).
---
## Weight and Hyperparameter Grids
### Candidate A weight grid
| Config | w_ecs | w_novelty | w_tests | Hypothesis |
|---|---|---|---|---|
| A1 | 0.50 | 0.35 | 0.15 | Balanced (baseline) |
| A2 | 0.70 | 0.20 | 0.10 | Coverage-heavy |
| A3 | 0.30 | 0.60 | 0.10 | Novelty-heavy |
| A4 | 0.85 | 0.00 | 0.15 | No novelty (ablation) |
A4 is the critical ablation: does novelty weighting reduce distribution bias, or is ECS alone sufficient?
### Candidate E hyperparameter grid
| Config | Cell size | Selection strategy | α (bonus constructs) |
|---|---|---|---|
| E1 | Pairs only | Empty-first | 0.2 |
| E2 | Pairs + Trios | Empty-first | 0.2 |
| E3 | Pairs + Trios | UCB-weighted | 0.2 |
| E4 | Pairs + Trios | Empty-first | 0.5 |
E2 is the baseline. E3 tests whether UCB cell selection improves quality over simple empty-first ordering. E4 tests whether a higher bonus for extra constructs produces richer examples.
### Candidate F hyperparameter grid
| Config | Prior sources | Phase 3 threshold | ε (tail minimum) | Mapping strictness |
|---|---|---|---|---|
| F1 | All 4 sources (50/30/10/10) | q > 0.7 | 0.05 | Lenient (keyword match) |
| F2 | All 4 sources (50/30/10/10) | q > 0.7 | 0.05 | Strict (AST-level match) |
| F3 | Stack only (100%) | q > 0.7 | 0.05 | Lenient |
| F4 | All 4 sources (50/30/10/10) | q > 0.5 | 0.10 | Lenient |
F1 is the baseline. F2 tests whether strict construct mapping (requiring AST-level evidence vs keyword presence) improves prior quality. F3 is the ablation: does the multi-source mixture add value over The Stack alone? F4 tests earlier phase transition and higher minimum tail weight.
---
## Open Questions for the Scientific Team
1. **Cell selection with difficulty weighting:** Some cells may be intrinsically hard to fill (e.g., combining `go` + `avapConnector` + `ormAccessSelect` in a single coherent example). Should the cell selection strategy account for historical fill difficulty, or treat all cells equally?
2. **Cross-cell quality:** An example generated for cell {A, B} may also be a high-quality example for cell {A, C} if it happens to use C as well. Should examples be indexed against all cells they satisfy, or only the cell they were generated for?
3. **Minimum example length per cell:** Short examples (35 lines) can technically satisfy a cell specification with minimal semantic content. Should a minimum code complexity threshold (e.g., minimum AST depth, minimum number of statements) be required for cell admission?
4. **Cell retirement:** Once a cell reaches quality score > 0.90, should it be retired from the selection pool to focus generation effort on harder cells?
5. **Generalisation to KCL:** The KCL grammar has different node types. Does the MAP-Elites cell space need to be redefined per language, or can a universal cell structure be derived from shared construct categories (type_definition, validation, control_flow, io)?
6. **ConstructPrior mapping quality:** The construct mapping (e.g., Python `session.query()` → AVAP `ormAccessSelect`) is heuristic. Should mapping quality be validated against a small manually annotated equivalence set before running the full generation pipeline? If the mapping is noisy, the prior weights may be misleading — a high-frequency Python pattern that maps incorrectly to a rare AVAP pattern would over-weight a non-representative cell.
7. **Prior refresh cadence:** The Stack and CodeSearchNet are static snapshots. If AVAP adoption grows and native AVAP code becomes available, should the ConstructPrior be retrained on AVAP-native data, effectively transitioning from transfer learning to self-supervised learning? Define the minimum corpus size threshold at which native data supersedes the cross-language prior.
---
## Consequences
- `generate_mbap_v2.py` is rewritten to implement Candidate F (MAP-Elites + ConstructPrior) as the primary algorithm. Candidate E (MAP-Elites without prior) is available via `--mode map-elites`. Candidate A (CW-Reward) is available via `--mode reward`. All three modes use identical parser, stage filters, and cell definitions to ensure fair comparison.
- A `ConstructPrior` module (`construct_prior.py`) handles multi-source data download, construct extraction, language-to-AVAP mapping, and co-occurrence matrix construction. This module is isolated from the core MAP-Elites loop and can be updated independently.
- The construct mapping table (language construct → AVAP equivalent) is maintained as a versioned configuration file (`construct_map.yaml`) and must not be modified after generation begins for a given dataset version.
- Results must be documented in `research/reward/` before this ADR is closed. Required artefacts: entropy histograms for A/E/F, KL divergence plots, RAGAS Composite comparison table, cell fill rate heatmaps.
- Any change to cell definitions, quality metrics, or the construct mapping table requires full dataset regeneration.
- Per-iteration fine-tuning of the generator is rejected and will not be re-evaluated without new evidence addressing the distribution narrowing risk.