22 KiB

Raw Permalink Blame History

ADR-0006: Reward Algorithm for Self-Improving Dataset Synthesis

Date: 2026-03-25 Status: Under Evaluation — Primary comparison: Candidate A vs Candidate E vs Candidate F Deciders: Rafael Ruiz (CTO), MrHouston Engineering (AI Team) Research lead: Ivar Zapata

Context

The AVAP dataset synthesis pipeline (Track A) generates AVAP code examples using a large language model, filtered by a three-stage quality pipeline: parser validation (Stage 1), Execution Coverage Score (Stage 2), and semantic novelty (Stage 3). The current pipeline has two structural limitations that the reward mechanism must address.

Limitation 1 — Static generation

Each batch is generated from the same static prompt (LRM + category description). The generator has no memory of what it has already produced and no model of what "good" looks like for the constructs it hasn't explored yet.

Limitation 2 — Distribution bias (the fundamental problem)

The generator (Claude claude-sonnet) has its own internal distribution over what AVAP code "looks like", derived from its training on mainstream languages. It naturally gravitates toward the simplest patterns — linear code, basic conditionals, single-construct examples — because those are closest to what it knows. Any reward mechanism based on selecting the best from what the model spontaneously produces and feeding those back as few-shots amplifies this bias: the pool fills with what the model does easily, and the model never explores what it does poorly.

This is not model collapse in the classical sense (weights are not updated), but it is cumulative distribution bias — the effective generation distribution narrows toward the model's comfort zone with each iteration.

The correct framing

The solution is not to reward what the model produces spontaneously. It is to specify externally what must be produced and evaluate quality relative to that specification. Coverage of the DSL's grammar space must be guaranteed by construction, not hoped for through probabilistic exploration.

Decision

Conduct a primary comparative evaluation of Candidate A (CW-Reward, reward-driven pool), Candidate E (MAP-Elites, externally-specified coverage cells), and Candidate F (MAP-Elites with ConstructPrior transfer from real production code) before selecting the production algorithm. Candidates B, C, D are secondary alternatives evaluated only if none of A, E, or F meets quality thresholds.

The fundamental research question has two layers:

Does forced external specification of construct combinations produce a less biased, higher-quality dataset than reward-driven spontaneous exploration? (A vs E)
Does seeding cell selection with real production code co-occurrence distributions further improve coverage quality and downstream RAG performance over blind MAP-Elites? (E vs F)

Candidate Analysis

Candidate A — CW-Reward (Composite Weighted Reward)

Algorithm class: In-context reward — no parameter updates.

Mechanism: A composite reward is computed for each parser-valid example:

reward(e) = w_ecs · ECS(e) + w_novelty · Jaccard_novelty(e, Pool) + w_tests · test_quality(e)

High-reward examples enter a GoldPool (top-K). The pool is injected as few-shot context in subsequent generation calls. Coverage summary steers the prompt toward underrepresented constructs.

Known bias risk: The pool amplifies the model's natural generation distribution. Examples that are easy for the model (simple patterns, single constructs) tend to enter the pool first and persist. The Jaccard novelty metric penalises structural similarity but cannot detect semantic simplicity — two examples with different node type sets can both be trivially shallow.

Appropriate when: The base LLM has strong prior knowledge of the target language (mainstream languages). For AVAP, where the model has zero prior knowledge, the bias risk is materially higher.

Candidate E — MAP-Elites with Externally-Defined Coverage Cells (Proposed Primary)

Algorithm class: Quality-Diversity algorithm — no parameter updates, coverage guaranteed by construction.

Core insight: Instead of rewarding the best examples from spontaneous generation, define the coverage space externally from the grammar and direct the generator to fill specific cells. The model's distribution bias is neutralised because it is never asked to "explore freely" — it is always given a precise specification.

Coverage space definition:

The behavior space is defined over pairs and trios of AVAP node types drawn from the full grammar vocabulary. Each cell represents a construct combination that must be represented in the dataset:

Cell key   = frozenset of 2 or 3 AVAP node types
Cell value = (best_example_so_far, quality_score)

Example cells:
  {"startLoop", "ormAccessSelect"}          → best example using both
  {"try", "go", "RequestPost"}              → best example using all three
  {"function", "if_mode2", "encodeSHA256"}  → best example using all three

Space size:

Pairs: C(38, 2) = 703 cells
Trios: C(38, 3) = 8,436 cells
Total: 9,139 cells

With 5,000 examples targeted, average coverage is ~0.55 examples per cell — statistical coverage of pairwise and triadic construct combinations is achievable with focused cell selection strategy. Full coverage of high-prior cells is expected within budget; tail cells are addressed in Phase 3.

Generation protocol:

1. SELECT target cell:
     - Empty cells first (exploration phase)
     - Then lowest-quality cells (exploitation phase)
     - Interleave: every 10 calls, select a cell adjacent to a
       recently improved cell (local neighborhood search)

2. SPECIFY in the prompt:
     "Generate an AVAP example that MUST use ALL of these constructs:
      {cell_constructs}. Use additional constructs where natural."

3. VALIDATE:
     a. Parser: syntactically valid? (Stage 1)
     b. Construct presence: all cell constructs in AST? (cell gate)
     c. If both pass → compute cell quality score

4. UPDATE cell:
     If quality > current cell quality → replace cell entry

Cell quality score:

cell_quality(e, cell) =
    construct_fidelity(e, cell)    # fraction of cell constructs actually present
  + α · bonus_constructs(e, cell)  # extra constructs beyond cell specification
  + β · test_quality(e)            # quality of test assertions
  + γ · code_length_norm(e)        # normalised code length (longer = richer)

construct_fidelity is the primary gate: an example that does not contain all cell constructs scores 0 regardless of other criteria.

Why this eliminates distribution bias:

The model is never asked what it "wants" to generate. It receives a precise specification: "you must use these three constructs." If it produces something that satisfies the specification, it enters the map. If not, it is discarded and the cell remains available for the next attempt. The coverage trajectory is determined by the cell selection strategy, not by the model's natural distribution.

The only residual bias is the model's ability to satisfy arbitrary construct specifications — some cells may be harder to fill than others. This is empirically measurable (fill rate per cell) and is itself a research finding about the generator's capabilities.

Appropriate when: The target language is novel or partially unknown to the generator. The external specification mechanism compensates for the model's lack of prior knowledge.

Candidate F — MAP-Elites with ConstructPrior Transfer (Proposed Disruptive Extension)

Algorithm class: Quality-Diversity algorithm with informed cell selection — no parameter updates, coverage guaranteed by construction.

Core insight: Candidate E specifies which constructs must appear but treats all cells as equally valuable. Real production code does not use constructs uniformly: some combinations (e.g., ormAccessSelect + try) appear in virtually every real API endpoint; others (e.g., encodeSHA256 + startLoop) appear rarely. A golden dataset that mirrors production code distributions will retrieve more relevant examples for real developer queries. The ConstructPrior module transfers this knowledge from large public codebases to weight MAP-Elites cell selection.

ConstructPrior design:

ConstructPrior = weighted combination of 4 domain sources:

  Source 1 — The Stack (BigCode, 50% weight)
    Filter: paths matching /api/, /routes/, /handlers/, /endpoints/
    Languages: Python, Go, JavaScript/TypeScript, Java
    Process: extract function-level code blocks → map language constructs
             to AVAP semantic equivalents → compute co-occurrence frequency
             per (construct_a, construct_b) and (construct_a, construct_b, construct_c)
    Rationale: real microservice API code; largest and most representative source

  Source 2 — CodeSearchNet (30% weight)
    Filter: semantic search for "api endpoint", "http handler", "database query"
    Languages: Python, Go, Java, JavaScript
    Process: same mapping pipeline as Source 1
    Rationale: function-docstring pairs provide semantic context for mapping quality

  Source 3 — HumanEval-X Go (10% weight)
    Filter: problems using goroutines, channels, wait groups
    Process: map Go concurrency primitives → AVAP {go, gather, startLoop}
    Rationale: AVAP's concurrency model mirrors Go's; coverage of concurrent patterns

  Source 4 — Spider SQL Dataset (10% weight)
    Filter: multi-table joins, aggregations, nested queries
    Process: map SQL operations → AVAP {ormAccessSelect, ormAccessInsert, ormAccessUpdate}
    Rationale: AVAP ORM constructs semantically equivalent to SQL clauses

Construct mapping table (AVAP ← source constructs):

AVAP construct	Python equivalent	Go equivalent	SQL equivalent
`ormAccessSelect`	`cursor.fetchall()`, `session.query()`	`db.Query()`, `rows.Scan()`	`SELECT`
`ormAccessInsert`	`session.add()`, `cursor.execute(INSERT)`	`db.Exec(INSERT)`	`INSERT INTO`
`ormAccessUpdate`	`session.merge()`, `cursor.execute(UPDATE)`	`db.Exec(UPDATE)`	`UPDATE`
`RequestGet`	`requests.get()`, `httpx.get()`	`http.Get()`, `client.Get()`	—
`RequestPost`	`requests.post()`, `httpx.post()`	`http.Post()`, `client.Post()`	—
`startLoop`	`for item in list:`	`for _, v := range`	`CURSOR LOOP`
`go` + `gather`	`asyncio.gather()`, `ThreadPoolExecutor`	`go func()`, `sync.WaitGroup`	—
`try` + `exception`	`try: except:`	`if err != nil`	—
`encodeSHA256`	`hashlib.sha256()`	`sha256.New()`	—
`function`	`def func():`	`func name()`	`CREATE FUNCTION`

Cell weighting formula:

cell_prior_weight(cell) =
    Σ_{s ∈ Sources} weight_s · freq_s(cell_constructs)

  where freq_s(cell) = co-occurrence frequency of the construct set in source s,
  normalized to [0, 1] within each source.

  Cells with prior_weight = 0 (no source coverage) receive a minimum weight ε = 0.05
  to ensure all cells remain reachable.

Modified cell selection with ConstructPrior:

PHASE 1 (exploration):
  Select empty cells, weighted by cell_prior_weight.
  High-prior cells filled first — these are patterns real developers use.

PHASE 2 (exploitation):
  Select lowest-quality filled cells, UCB-weighted,
  also weighted by cell_prior_weight.
  High-prior, low-quality cells deprioritized for richer improvement.

PHASE 3 (tail coverage):
  Cells with prior_weight = ε are visited last, after all
  production-relevant cells reach quality > 0.7.
  Ensures complete mathematical coverage without wasting
  early generation budget on rare combinations.

Why this is disruptive:

First formal connection between DSL dataset synthesis and production code distributions. Prior dataset synthesis work (MBPP, HumanEval, APPS) uses human-authored problems or scrapes competitive programming sites. For novel DSLs with no prior human authors, this approach provides the first principled method to bootstrap coverage from semantically equivalent languages.
Eliminates the uniform sampling assumption. Standard Quality-Diversity algorithms treat all niches as equally valuable. The ConstructPrior breaks this assumption: cells that correspond to real production patterns are assigned higher value, producing a dataset whose distribution mirrors real developer usage rather than mathematical combinatorial completeness.
Zero human annotation required. The prior is derived automatically from public datasets under permissive licenses (The Stack: Apache 2.0; CodeSearchNet: MIT; HumanEval-X: MIT; Spider: CC BY-SA 4.0).
Residual bias is semantic, not structural. Candidate E's residual bias is the model's ability to satisfy arbitrary construct specifications (some cells may be hard to fill). Candidate F's residual bias is the construct mapping quality (how faithfully Python/Go/SQL constructs map to AVAP equivalents). The latter is measurable, improvable, and fully transparent.

Expected improvement over Candidate E:

RAGAS Composite: +0.03–0.08 (hypothesis: production-weighted cells retrieve more relevant examples for real queries)
Distribution entropy: similar or slightly lower than E (intentionally non-uniform — mirrors production distribution)
Downstream task success: +5–15% on held-out real developer queries (hypothesis: high-prior cells produce examples that match actual query patterns)

Appropriate when: Target DSL has identifiable semantic equivalents in mainstream languages, and a production-weighted dataset is preferred over a mathematically uniform one.

Out of Scope — Fine-tuning Approaches (GRPO, DPO)

Gradient-based approaches (GRPO, DPO) address a different problem: fine-tuning the inference model after the dataset is built. This ADR concerns dataset synthesis algorithm design. Fine-tuning the inference model is a separate architectural decision, tracked separately, and is not evaluated here.

Per-iteration fine-tuning of the generator (training the generator on its own outputs between batches) is explicitly rejected as a design choice. Iteratively training a model on its own outputs produces cumulative distribution narrowing. The generator (Claude API) and any future inference model must be trained on separate, independently validated datasets.

Candidate D — UCB Bandit over Coverage Regions

Algorithm class: Multi-armed bandit.

Coverage regions are arms. UCB selects which region to target via exploration-exploitation tradeoff. Theoretically well-understood convergence guarantees but does not provide construct-level specification — it targets regions, not specific combinations. Less precise than Candidate E.

Superseded by Candidate E for the same computational cost with stronger guarantees.

Comparative Summary

Property	A: CW-Reward	E: MAP-Elites	F: MAP-Elites+Prior
Distribution bias risk	High	None	None
Coverage guarantee	Probabilistic	By construction	By construction
Production code alignment	None	None	Yes (weighted)
LLM parameter updates	No	No	No
GPU requirement	None	None	None
Works with API-only LLM	Yes	Yes	Yes
Interpretability	High	Very high	Very high
Implementation complexity	Low	Medium	Medium-High
Convergence guarantee	No	Yes (fill rate)	Yes (fill rate)
Residual bias	Model distribution	Cell fill difficulty	Mapping quality
External data required	No	No	Yes (public, free)
Novel contribution	Low	Medium	High

Evaluation Protocol

Phase 1 — Candidate A vs Candidate E vs Candidate F

Run all three candidates for 500 generated examples each, same LRM, same parser, same Stage 1 filter. Fixed random seed for reproducibility.

Primary metrics:

Metric	Definition	Expected winner
Cell fill rate	Fraction of 9,139 cells with ≥1 example (E/F only)	E≈F by construction
Coverage breadth	Distinct node types covered / total	E≈F
Distribution uniformity	Entropy of node type frequency distribution	E (flatter = better)
Production alignment	KL divergence between dataset and ConstructPrior distribution	F (by design)
Mean cell quality	Average quality score across filled cells	TBD empirically
Parser pass rate trend	Pass rate across iterations	A (if few-shots help)
Downstream RAGAS	RAGAS Composite on 50 held-out AVAP queries	Primary decision signal

Distribution uniformity is the key metric for bias detection (A vs E). Plot node type frequency as a histogram. Candidate A will show a long-tail distribution. Candidate E should show a near-uniform distribution. Candidate F will show a production-weighted distribution (intentionally non-uniform — this is a feature, not a bug).

Production alignment is the key metric for F vs E. A dataset with low KL divergence from ConstructPrior produces examples that match real developer usage patterns. If RAGAS(F) > RAGAS(E), this validates the transfer prior hypothesis.

Selection criterion:

A vs E: Candidate E wins if entropy > 3.0 bits AND RAGAS(E) ≥ RAGAS(A).
E vs F: Candidate F wins if RAGAS(F) > RAGAS(E) by margin ≥ 0.02.
If F wins both comparisons, F is the production algorithm.
Fallback: if RAGAS margin F vs E < 0.02, use E (simpler, no external data dependency).

Weight and Hyperparameter Grids

Candidate A weight grid

Config	w_ecs	w_novelty	w_tests	Hypothesis
A1	0.50	0.35	0.15	Balanced (baseline)
A2	0.70	0.20	0.10	Coverage-heavy
A3	0.30	0.60	0.10	Novelty-heavy
A4	0.85	0.00	0.15	No novelty (ablation)

A4 is the critical ablation: does novelty weighting reduce distribution bias, or is ECS alone sufficient?

Candidate E hyperparameter grid

Config	Cell size	Selection strategy	α (bonus constructs)
E1	Pairs only	Empty-first	0.2
E2	Pairs + Trios	Empty-first	0.2
E3	Pairs + Trios	UCB-weighted	0.2
E4	Pairs + Trios	Empty-first	0.5

E2 is the baseline. E3 tests whether UCB cell selection improves quality over simple empty-first ordering. E4 tests whether a higher bonus for extra constructs produces richer examples.

Candidate F hyperparameter grid

Config	Prior sources	Phase 3 threshold	ε (tail minimum)	Mapping strictness
F1	All 4 sources (50/30/10/10)	q > 0.7	0.05	Lenient (keyword match)
F2	All 4 sources (50/30/10/10)	q > 0.7	0.05	Strict (AST-level match)
F3	Stack only (100%)	q > 0.7	0.05	Lenient
F4	All 4 sources (50/30/10/10)	q > 0.5	0.10	Lenient

F1 is the baseline. F2 tests whether strict construct mapping (requiring AST-level evidence vs keyword presence) improves prior quality. F3 is the ablation: does the multi-source mixture add value over The Stack alone? F4 tests earlier phase transition and higher minimum tail weight.

Open Questions for the Scientific Team

Cell selection with difficulty weighting: Some cells may be intrinsically hard to fill (e.g., combining go + avapConnector + ormAccessSelect in a single coherent example). Should the cell selection strategy account for historical fill difficulty, or treat all cells equally?
Cross-cell quality: An example generated for cell {A, B} may also be a high-quality example for cell {A, C} if it happens to use C as well. Should examples be indexed against all cells they satisfy, or only the cell they were generated for?
Minimum example length per cell: Short examples (3–5 lines) can technically satisfy a cell specification with minimal semantic content. Should a minimum code complexity threshold (e.g., minimum AST depth, minimum number of statements) be required for cell admission?
Cell retirement: Once a cell reaches quality score > 0.90, should it be retired from the selection pool to focus generation effort on harder cells?
Generalisation to KCL: The KCL grammar has different node types. Does the MAP-Elites cell space need to be redefined per language, or can a universal cell structure be derived from shared construct categories (type_definition, validation, control_flow, io)?
ConstructPrior mapping quality: The construct mapping (e.g., Python session.query() → AVAP ormAccessSelect) is heuristic. Should mapping quality be validated against a small manually annotated equivalence set before running the full generation pipeline? If the mapping is noisy, the prior weights may be misleading — a high-frequency Python pattern that maps incorrectly to a rare AVAP pattern would over-weight a non-representative cell.
Prior refresh cadence: The Stack and CodeSearchNet are static snapshots. If AVAP adoption grows and native AVAP code becomes available, should the ConstructPrior be retrained on AVAP-native data, effectively transitioning from transfer learning to self-supervised learning? Define the minimum corpus size threshold at which native data supersedes the cross-language prior.

Consequences

generate_mbap_v2.py is rewritten to implement Candidate F (MAP-Elites + ConstructPrior) as the primary algorithm. Candidate E (MAP-Elites without prior) is available via --mode map-elites. Candidate A (CW-Reward) is available via --mode reward. All three modes use identical parser, stage filters, and cell definitions to ensure fair comparison.
A ConstructPrior module (construct_prior.py) handles multi-source data download, construct extraction, language-to-AVAP mapping, and co-occurrence matrix construction. This module is isolated from the core MAP-Elites loop and can be updated independently.
The construct mapping table (language construct → AVAP equivalent) is maintained as a versioned configuration file (construct_map.yaml) and must not be modified after generation begins for a given dataset version.
Results must be documented in research/reward/ before this ADR is closed. Required artefacts: entropy histograms for A/E/F, KL divergence plots, RAGAS Composite comparison table, cell fill rate heatmaps.
Any change to cell definitions, quality metrics, or the construct mapping table requires full dataset regeneration.
Per-iteration fine-tuning of the generator is rejected and will not be re-evaluated without new evidence addressing the distribution narrowing risk.

22 KiB Raw Permalink Blame History Unescape Escape