feat(dataset): add ADR-0006 and scaffold reward algorithm pipeline
This commit is contained in:
parent
f9b2b014bb
commit
ccd9073a52
|
|
@ -0,0 +1,363 @@
|
||||||
|
# ADR-0006: Reward Algorithm for Self-Improving Dataset Synthesis
|
||||||
|
|
||||||
|
**Date:** 2026-03-25
|
||||||
|
**Status:** Under Evaluation — Primary comparison: Candidate A vs Candidate E vs Candidate F
|
||||||
|
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering (AI Team)
|
||||||
|
**Research lead:** Ivar Zapata
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The AVAP dataset synthesis pipeline (Track A) generates AVAP code examples using a large language model, filtered by a three-stage quality pipeline: parser validation (Stage 1), Execution Coverage Score (Stage 2), and semantic novelty (Stage 3). The current pipeline has two structural limitations that the reward mechanism must address.
|
||||||
|
|
||||||
|
### Limitation 1 — Static generation
|
||||||
|
|
||||||
|
Each batch is generated from the same static prompt (LRM + category description). The generator has no memory of what it has already produced and no model of what "good" looks like for the constructs it hasn't explored yet.
|
||||||
|
|
||||||
|
### Limitation 2 — Distribution bias (the fundamental problem)
|
||||||
|
|
||||||
|
The generator (Claude claude-sonnet) has its own internal distribution over what AVAP code "looks like", derived from its training on mainstream languages. It naturally gravitates toward the simplest patterns — linear code, basic conditionals, single-construct examples — because those are closest to what it knows. Any reward mechanism based on selecting the best from what the model spontaneously produces and feeding those back as few-shots **amplifies this bias**: the pool fills with what the model does easily, and the model never explores what it does poorly.
|
||||||
|
|
||||||
|
This is not model collapse in the classical sense (weights are not updated), but it is **cumulative distribution bias** — the effective generation distribution narrows toward the model's comfort zone with each iteration.
|
||||||
|
|
||||||
|
### The correct framing
|
||||||
|
|
||||||
|
The solution is not to reward what the model produces spontaneously. It is to **specify externally what must be produced** and evaluate quality relative to that specification. Coverage of the DSL's grammar space must be guaranteed by construction, not hoped for through probabilistic exploration.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
**Conduct a primary comparative evaluation of Candidate A (CW-Reward, reward-driven pool), Candidate E (MAP-Elites, externally-specified coverage cells), and Candidate F (MAP-Elites with ConstructPrior transfer from real production code)** before selecting the production algorithm. Candidates B, C, D are secondary alternatives evaluated only if none of A, E, or F meets quality thresholds.
|
||||||
|
|
||||||
|
The fundamental research question has two layers:
|
||||||
|
1. **Does forced external specification of construct combinations produce a less biased, higher-quality dataset than reward-driven spontaneous exploration?** (A vs E)
|
||||||
|
2. **Does seeding cell selection with real production code co-occurrence distributions further improve coverage quality and downstream RAG performance over blind MAP-Elites?** (E vs F)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Candidate Analysis
|
||||||
|
|
||||||
|
### Candidate A — CW-Reward (Composite Weighted Reward)
|
||||||
|
|
||||||
|
**Algorithm class:** In-context reward — no parameter updates.
|
||||||
|
|
||||||
|
**Mechanism:** A composite reward is computed for each parser-valid example:
|
||||||
|
|
||||||
|
```
|
||||||
|
reward(e) = w_ecs · ECS(e) + w_novelty · Jaccard_novelty(e, Pool) + w_tests · test_quality(e)
|
||||||
|
```
|
||||||
|
|
||||||
|
High-reward examples enter a GoldPool (top-K). The pool is injected as few-shot context in subsequent generation calls. Coverage summary steers the prompt toward underrepresented constructs.
|
||||||
|
|
||||||
|
**Known bias risk:** The pool amplifies the model's natural generation distribution. Examples that are easy for the model (simple patterns, single constructs) tend to enter the pool first and persist. The Jaccard novelty metric penalises structural similarity but cannot detect semantic simplicity — two examples with different node type sets can both be trivially shallow.
|
||||||
|
|
||||||
|
**Appropriate when:** The base LLM has strong prior knowledge of the target language (mainstream languages). For AVAP, where the model has zero prior knowledge, the bias risk is materially higher.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Candidate E — MAP-Elites with Externally-Defined Coverage Cells (Proposed Primary)
|
||||||
|
|
||||||
|
**Algorithm class:** Quality-Diversity algorithm — no parameter updates, coverage guaranteed by construction.
|
||||||
|
|
||||||
|
**Core insight:** Instead of rewarding the best examples from spontaneous generation, define the coverage space externally from the grammar and direct the generator to fill specific cells. The model's distribution bias is neutralised because it is never asked to "explore freely" — it is always given a precise specification.
|
||||||
|
|
||||||
|
**Coverage space definition:**
|
||||||
|
|
||||||
|
The behavior space is defined over **pairs and trios of AVAP node types** drawn from the full grammar vocabulary. Each cell represents a construct combination that must be represented in the dataset:
|
||||||
|
|
||||||
|
```
|
||||||
|
Cell key = frozenset of 2 or 3 AVAP node types
|
||||||
|
Cell value = (best_example_so_far, quality_score)
|
||||||
|
|
||||||
|
Example cells:
|
||||||
|
{"startLoop", "ormAccessSelect"} → best example using both
|
||||||
|
{"try", "go", "RequestPost"} → best example using all three
|
||||||
|
{"function", "if_mode2", "encodeSHA256"} → best example using all three
|
||||||
|
```
|
||||||
|
|
||||||
|
**Space size:**
|
||||||
|
- Pairs: C(38, 2) = 703 cells
|
||||||
|
- Trios: C(38, 3) = 8,436 cells
|
||||||
|
- Total: 9,139 cells
|
||||||
|
|
||||||
|
With 5,000 examples targeted, average coverage is ~0.55 examples per cell — statistical coverage of pairwise and triadic construct combinations is achievable with focused cell selection strategy. Full coverage of high-prior cells is expected within budget; tail cells are addressed in Phase 3.
|
||||||
|
|
||||||
|
**Generation protocol:**
|
||||||
|
|
||||||
|
```
|
||||||
|
1. SELECT target cell:
|
||||||
|
- Empty cells first (exploration phase)
|
||||||
|
- Then lowest-quality cells (exploitation phase)
|
||||||
|
- Interleave: every 10 calls, select a cell adjacent to a
|
||||||
|
recently improved cell (local neighborhood search)
|
||||||
|
|
||||||
|
2. SPECIFY in the prompt:
|
||||||
|
"Generate an AVAP example that MUST use ALL of these constructs:
|
||||||
|
{cell_constructs}. Use additional constructs where natural."
|
||||||
|
|
||||||
|
3. VALIDATE:
|
||||||
|
a. Parser: syntactically valid? (Stage 1)
|
||||||
|
b. Construct presence: all cell constructs in AST? (cell gate)
|
||||||
|
c. If both pass → compute cell quality score
|
||||||
|
|
||||||
|
4. UPDATE cell:
|
||||||
|
If quality > current cell quality → replace cell entry
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cell quality score:**
|
||||||
|
|
||||||
|
```
|
||||||
|
cell_quality(e, cell) =
|
||||||
|
construct_fidelity(e, cell) # fraction of cell constructs actually present
|
||||||
|
+ α · bonus_constructs(e, cell) # extra constructs beyond cell specification
|
||||||
|
+ β · test_quality(e) # quality of test assertions
|
||||||
|
+ γ · code_length_norm(e) # normalised code length (longer = richer)
|
||||||
|
```
|
||||||
|
|
||||||
|
`construct_fidelity` is the primary gate: an example that does not contain all cell constructs scores 0 regardless of other criteria.
|
||||||
|
|
||||||
|
**Why this eliminates distribution bias:**
|
||||||
|
|
||||||
|
The model is never asked what it "wants" to generate. It receives a precise specification: "you must use these three constructs." If it produces something that satisfies the specification, it enters the map. If not, it is discarded and the cell remains available for the next attempt. The coverage trajectory is determined by the cell selection strategy, not by the model's natural distribution.
|
||||||
|
|
||||||
|
The only residual bias is the model's ability to satisfy arbitrary construct specifications — some cells may be harder to fill than others. This is empirically measurable (fill rate per cell) and is itself a research finding about the generator's capabilities.
|
||||||
|
|
||||||
|
**Appropriate when:** The target language is novel or partially unknown to the generator. The external specification mechanism compensates for the model's lack of prior knowledge.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Candidate F — MAP-Elites with ConstructPrior Transfer (Proposed Disruptive Extension)
|
||||||
|
|
||||||
|
**Algorithm class:** Quality-Diversity algorithm with informed cell selection — no parameter updates, coverage guaranteed by construction.
|
||||||
|
|
||||||
|
**Core insight:** Candidate E specifies *which* constructs must appear but treats all cells as equally valuable. Real production code does not use constructs uniformly: some combinations (e.g., `ormAccessSelect` + `try`) appear in virtually every real API endpoint; others (e.g., `encodeSHA256` + `startLoop`) appear rarely. A golden dataset that mirrors production code distributions will retrieve more relevant examples for real developer queries. The ConstructPrior module transfers this knowledge from large public codebases to weight MAP-Elites cell selection.
|
||||||
|
|
||||||
|
**ConstructPrior design:**
|
||||||
|
|
||||||
|
```
|
||||||
|
ConstructPrior = weighted combination of 4 domain sources:
|
||||||
|
|
||||||
|
Source 1 — The Stack (BigCode, 50% weight)
|
||||||
|
Filter: paths matching /api/, /routes/, /handlers/, /endpoints/
|
||||||
|
Languages: Python, Go, JavaScript/TypeScript, Java
|
||||||
|
Process: extract function-level code blocks → map language constructs
|
||||||
|
to AVAP semantic equivalents → compute co-occurrence frequency
|
||||||
|
per (construct_a, construct_b) and (construct_a, construct_b, construct_c)
|
||||||
|
Rationale: real microservice API code; largest and most representative source
|
||||||
|
|
||||||
|
Source 2 — CodeSearchNet (30% weight)
|
||||||
|
Filter: semantic search for "api endpoint", "http handler", "database query"
|
||||||
|
Languages: Python, Go, Java, JavaScript
|
||||||
|
Process: same mapping pipeline as Source 1
|
||||||
|
Rationale: function-docstring pairs provide semantic context for mapping quality
|
||||||
|
|
||||||
|
Source 3 — HumanEval-X Go (10% weight)
|
||||||
|
Filter: problems using goroutines, channels, wait groups
|
||||||
|
Process: map Go concurrency primitives → AVAP {go, gather, startLoop}
|
||||||
|
Rationale: AVAP's concurrency model mirrors Go's; coverage of concurrent patterns
|
||||||
|
|
||||||
|
Source 4 — Spider SQL Dataset (10% weight)
|
||||||
|
Filter: multi-table joins, aggregations, nested queries
|
||||||
|
Process: map SQL operations → AVAP {ormAccessSelect, ormAccessInsert, ormAccessUpdate}
|
||||||
|
Rationale: AVAP ORM constructs semantically equivalent to SQL clauses
|
||||||
|
```
|
||||||
|
|
||||||
|
**Construct mapping table (AVAP ← source constructs):**
|
||||||
|
|
||||||
|
| AVAP construct | Python equivalent | Go equivalent | SQL equivalent |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `ormAccessSelect` | `cursor.fetchall()`, `session.query()` | `db.Query()`, `rows.Scan()` | `SELECT` |
|
||||||
|
| `ormAccessInsert` | `session.add()`, `cursor.execute(INSERT)` | `db.Exec(INSERT)` | `INSERT INTO` |
|
||||||
|
| `ormAccessUpdate` | `session.merge()`, `cursor.execute(UPDATE)` | `db.Exec(UPDATE)` | `UPDATE` |
|
||||||
|
| `RequestGet` | `requests.get()`, `httpx.get()` | `http.Get()`, `client.Get()` | — |
|
||||||
|
| `RequestPost` | `requests.post()`, `httpx.post()` | `http.Post()`, `client.Post()` | — |
|
||||||
|
| `startLoop` | `for item in list:` | `for _, v := range` | `CURSOR LOOP` |
|
||||||
|
| `go` + `gather` | `asyncio.gather()`, `ThreadPoolExecutor` | `go func()`, `sync.WaitGroup` | — |
|
||||||
|
| `try` + `exception` | `try: except:` | `if err != nil` | — |
|
||||||
|
| `encodeSHA256` | `hashlib.sha256()` | `sha256.New()` | — |
|
||||||
|
| `function` | `def func():` | `func name()` | `CREATE FUNCTION` |
|
||||||
|
|
||||||
|
**Cell weighting formula:**
|
||||||
|
|
||||||
|
```
|
||||||
|
cell_prior_weight(cell) =
|
||||||
|
Σ_{s ∈ Sources} weight_s · freq_s(cell_constructs)
|
||||||
|
|
||||||
|
where freq_s(cell) = co-occurrence frequency of the construct set in source s,
|
||||||
|
normalized to [0, 1] within each source.
|
||||||
|
|
||||||
|
Cells with prior_weight = 0 (no source coverage) receive a minimum weight ε = 0.05
|
||||||
|
to ensure all cells remain reachable.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Modified cell selection with ConstructPrior:**
|
||||||
|
|
||||||
|
```
|
||||||
|
PHASE 1 (exploration):
|
||||||
|
Select empty cells, weighted by cell_prior_weight.
|
||||||
|
High-prior cells filled first — these are patterns real developers use.
|
||||||
|
|
||||||
|
PHASE 2 (exploitation):
|
||||||
|
Select lowest-quality filled cells, UCB-weighted,
|
||||||
|
also weighted by cell_prior_weight.
|
||||||
|
High-prior, low-quality cells deprioritized for richer improvement.
|
||||||
|
|
||||||
|
PHASE 3 (tail coverage):
|
||||||
|
Cells with prior_weight = ε are visited last, after all
|
||||||
|
production-relevant cells reach quality > 0.7.
|
||||||
|
Ensures complete mathematical coverage without wasting
|
||||||
|
early generation budget on rare combinations.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why this is disruptive:**
|
||||||
|
|
||||||
|
1. **First formal connection between DSL dataset synthesis and production code distributions.** Prior dataset synthesis work (MBPP, HumanEval, APPS) uses human-authored problems or scrapes competitive programming sites. For novel DSLs with no prior human authors, this approach provides the first principled method to bootstrap coverage from semantically equivalent languages.
|
||||||
|
|
||||||
|
2. **Eliminates the uniform sampling assumption.** Standard Quality-Diversity algorithms treat all niches as equally valuable. The ConstructPrior breaks this assumption: cells that correspond to real production patterns are assigned higher value, producing a dataset whose distribution mirrors real developer usage rather than mathematical combinatorial completeness.
|
||||||
|
|
||||||
|
3. **Zero human annotation required.** The prior is derived automatically from public datasets under permissive licenses (The Stack: Apache 2.0; CodeSearchNet: MIT; HumanEval-X: MIT; Spider: CC BY-SA 4.0).
|
||||||
|
|
||||||
|
4. **Residual bias is semantic, not structural.** Candidate E's residual bias is the model's ability to satisfy arbitrary construct specifications (some cells may be hard to fill). Candidate F's residual bias is the construct mapping quality (how faithfully Python/Go/SQL constructs map to AVAP equivalents). The latter is measurable, improvable, and fully transparent.
|
||||||
|
|
||||||
|
**Expected improvement over Candidate E:**
|
||||||
|
|
||||||
|
- RAGAS Composite: +0.03–0.08 (hypothesis: production-weighted cells retrieve more relevant examples for real queries)
|
||||||
|
- Distribution entropy: similar or slightly lower than E (intentionally non-uniform — mirrors production distribution)
|
||||||
|
- Downstream task success: +5–15% on held-out real developer queries (hypothesis: high-prior cells produce examples that match actual query patterns)
|
||||||
|
|
||||||
|
**Appropriate when:** Target DSL has identifiable semantic equivalents in mainstream languages, and a production-weighted dataset is preferred over a mathematically uniform one.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Out of Scope — Fine-tuning Approaches (GRPO, DPO)
|
||||||
|
|
||||||
|
Gradient-based approaches (GRPO, DPO) address a **different problem**: fine-tuning the inference model after the dataset is built. This ADR concerns dataset synthesis algorithm design. Fine-tuning the inference model is a separate architectural decision, tracked separately, and is not evaluated here.
|
||||||
|
|
||||||
|
Per-iteration fine-tuning of the generator (training the generator on its own outputs between batches) is explicitly rejected as a design choice. Iteratively training a model on its own outputs produces cumulative distribution narrowing. The generator (Claude API) and any future inference model must be trained on separate, independently validated datasets.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Candidate D — UCB Bandit over Coverage Regions
|
||||||
|
|
||||||
|
**Algorithm class:** Multi-armed bandit.
|
||||||
|
|
||||||
|
Coverage regions are arms. UCB selects which region to target via exploration-exploitation tradeoff. Theoretically well-understood convergence guarantees but does not provide construct-level specification — it targets regions, not specific combinations. Less precise than Candidate E.
|
||||||
|
|
||||||
|
**Superseded by Candidate E** for the same computational cost with stronger guarantees.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Comparative Summary
|
||||||
|
|
||||||
|
| Property | A: CW-Reward | E: MAP-Elites | F: MAP-Elites+Prior |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Distribution bias risk | **High** | **None** | **None** |
|
||||||
|
| Coverage guarantee | Probabilistic | **By construction** | **By construction** |
|
||||||
|
| Production code alignment | None | None | **Yes (weighted)** |
|
||||||
|
| LLM parameter updates | No | No | No |
|
||||||
|
| GPU requirement | None | None | None |
|
||||||
|
| Works with API-only LLM | Yes | Yes | Yes |
|
||||||
|
| Interpretability | High | **Very high** | **Very high** |
|
||||||
|
| Implementation complexity | Low | Medium | **Medium-High** |
|
||||||
|
| Convergence guarantee | No | **Yes (fill rate)** | **Yes (fill rate)** |
|
||||||
|
| Residual bias | Model distribution | Cell fill difficulty | Mapping quality |
|
||||||
|
| External data required | No | No | Yes (public, free) |
|
||||||
|
| Novel contribution | Low | Medium | **High** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Evaluation Protocol
|
||||||
|
|
||||||
|
### Phase 1 — Candidate A vs Candidate E vs Candidate F
|
||||||
|
|
||||||
|
Run all three candidates for 500 generated examples each, same LRM, same parser, same Stage 1 filter. Fixed random seed for reproducibility.
|
||||||
|
|
||||||
|
**Primary metrics:**
|
||||||
|
|
||||||
|
| Metric | Definition | Expected winner |
|
||||||
|
|---|---|---|
|
||||||
|
| Cell fill rate | Fraction of 9,139 cells with ≥1 example (E/F only) | E≈F by construction |
|
||||||
|
| Coverage breadth | Distinct node types covered / total | E≈F |
|
||||||
|
| Distribution uniformity | Entropy of node type frequency distribution | E (flatter = better) |
|
||||||
|
| Production alignment | KL divergence between dataset and ConstructPrior distribution | **F** (by design) |
|
||||||
|
| Mean cell quality | Average quality score across filled cells | TBD empirically |
|
||||||
|
| Parser pass rate trend | Pass rate across iterations | A (if few-shots help) |
|
||||||
|
| Downstream RAGAS | RAGAS Composite on 50 held-out AVAP queries | **Primary decision signal** |
|
||||||
|
|
||||||
|
**Distribution uniformity** is the key metric for bias detection (A vs E). Plot node type frequency as a histogram. Candidate A will show a long-tail distribution. Candidate E should show a near-uniform distribution. Candidate F will show a production-weighted distribution (intentionally non-uniform — this is a feature, not a bug).
|
||||||
|
|
||||||
|
**Production alignment** is the key metric for F vs E. A dataset with low KL divergence from ConstructPrior produces examples that match real developer usage patterns. If RAGAS(F) > RAGAS(E), this validates the transfer prior hypothesis.
|
||||||
|
|
||||||
|
**Selection criterion:**
|
||||||
|
- A vs E: Candidate E wins if entropy > 3.0 bits AND RAGAS(E) ≥ RAGAS(A).
|
||||||
|
- E vs F: Candidate F wins if RAGAS(F) > RAGAS(E) by margin ≥ 0.02.
|
||||||
|
- If F wins both comparisons, F is the production algorithm.
|
||||||
|
- Fallback: if RAGAS margin F vs E < 0.02, use E (simpler, no external data dependency).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Weight and Hyperparameter Grids
|
||||||
|
|
||||||
|
### Candidate A weight grid
|
||||||
|
|
||||||
|
| Config | w_ecs | w_novelty | w_tests | Hypothesis |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| A1 | 0.50 | 0.35 | 0.15 | Balanced (baseline) |
|
||||||
|
| A2 | 0.70 | 0.20 | 0.10 | Coverage-heavy |
|
||||||
|
| A3 | 0.30 | 0.60 | 0.10 | Novelty-heavy |
|
||||||
|
| A4 | 0.85 | 0.00 | 0.15 | No novelty (ablation) |
|
||||||
|
|
||||||
|
A4 is the critical ablation: does novelty weighting reduce distribution bias, or is ECS alone sufficient?
|
||||||
|
|
||||||
|
### Candidate E hyperparameter grid
|
||||||
|
|
||||||
|
| Config | Cell size | Selection strategy | α (bonus constructs) |
|
||||||
|
|---|---|---|---|
|
||||||
|
| E1 | Pairs only | Empty-first | 0.2 |
|
||||||
|
| E2 | Pairs + Trios | Empty-first | 0.2 |
|
||||||
|
| E3 | Pairs + Trios | UCB-weighted | 0.2 |
|
||||||
|
| E4 | Pairs + Trios | Empty-first | 0.5 |
|
||||||
|
|
||||||
|
E2 is the baseline. E3 tests whether UCB cell selection improves quality over simple empty-first ordering. E4 tests whether a higher bonus for extra constructs produces richer examples.
|
||||||
|
|
||||||
|
### Candidate F hyperparameter grid
|
||||||
|
|
||||||
|
| Config | Prior sources | Phase 3 threshold | ε (tail minimum) | Mapping strictness |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| F1 | All 4 sources (50/30/10/10) | q > 0.7 | 0.05 | Lenient (keyword match) |
|
||||||
|
| F2 | All 4 sources (50/30/10/10) | q > 0.7 | 0.05 | Strict (AST-level match) |
|
||||||
|
| F3 | Stack only (100%) | q > 0.7 | 0.05 | Lenient |
|
||||||
|
| F4 | All 4 sources (50/30/10/10) | q > 0.5 | 0.10 | Lenient |
|
||||||
|
|
||||||
|
F1 is the baseline. F2 tests whether strict construct mapping (requiring AST-level evidence vs keyword presence) improves prior quality. F3 is the ablation: does the multi-source mixture add value over The Stack alone? F4 tests earlier phase transition and higher minimum tail weight.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Questions for the Scientific Team
|
||||||
|
|
||||||
|
1. **Cell selection with difficulty weighting:** Some cells may be intrinsically hard to fill (e.g., combining `go` + `avapConnector` + `ormAccessSelect` in a single coherent example). Should the cell selection strategy account for historical fill difficulty, or treat all cells equally?
|
||||||
|
|
||||||
|
2. **Cross-cell quality:** An example generated for cell {A, B} may also be a high-quality example for cell {A, C} if it happens to use C as well. Should examples be indexed against all cells they satisfy, or only the cell they were generated for?
|
||||||
|
|
||||||
|
3. **Minimum example length per cell:** Short examples (3–5 lines) can technically satisfy a cell specification with minimal semantic content. Should a minimum code complexity threshold (e.g., minimum AST depth, minimum number of statements) be required for cell admission?
|
||||||
|
|
||||||
|
4. **Cell retirement:** Once a cell reaches quality score > 0.90, should it be retired from the selection pool to focus generation effort on harder cells?
|
||||||
|
|
||||||
|
5. **Generalisation to KCL:** The KCL grammar has different node types. Does the MAP-Elites cell space need to be redefined per language, or can a universal cell structure be derived from shared construct categories (type_definition, validation, control_flow, io)?
|
||||||
|
|
||||||
|
6. **ConstructPrior mapping quality:** The construct mapping (e.g., Python `session.query()` → AVAP `ormAccessSelect`) is heuristic. Should mapping quality be validated against a small manually annotated equivalence set before running the full generation pipeline? If the mapping is noisy, the prior weights may be misleading — a high-frequency Python pattern that maps incorrectly to a rare AVAP pattern would over-weight a non-representative cell.
|
||||||
|
|
||||||
|
7. **Prior refresh cadence:** The Stack and CodeSearchNet are static snapshots. If AVAP adoption grows and native AVAP code becomes available, should the ConstructPrior be retrained on AVAP-native data, effectively transitioning from transfer learning to self-supervised learning? Define the minimum corpus size threshold at which native data supersedes the cross-language prior.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- `generate_mbap_v2.py` is rewritten to implement Candidate F (MAP-Elites + ConstructPrior) as the primary algorithm. Candidate E (MAP-Elites without prior) is available via `--mode map-elites`. Candidate A (CW-Reward) is available via `--mode reward`. All three modes use identical parser, stage filters, and cell definitions to ensure fair comparison.
|
||||||
|
- A `ConstructPrior` module (`construct_prior.py`) handles multi-source data download, construct extraction, language-to-AVAP mapping, and co-occurrence matrix construction. This module is isolated from the core MAP-Elites loop and can be updated independently.
|
||||||
|
- The construct mapping table (language construct → AVAP equivalent) is maintained as a versioned configuration file (`construct_map.yaml`) and must not be modified after generation begins for a given dataset version.
|
||||||
|
- Results must be documented in `research/reward/` before this ADR is closed. Required artefacts: entropy histograms for A/E/F, KL divergence plots, RAGAS Composite comparison table, cell fill rate heatmaps.
|
||||||
|
- Any change to cell definitions, quality metrics, or the construct mapping table requires full dataset regeneration.
|
||||||
|
- Per-iteration fine-tuning of the generator is rejected and will not be re-evaluated without new evidence addressing the distribution narrowing risk.
|
||||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,884 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
AVAP Dataset Generator v2 — MAP-Elites Quality-Diversity Pipeline
|
||||||
|
==================================================================
|
||||||
|
|
||||||
|
View reference
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from collections import defaultdict
|
||||||
|
from itertools import combinations
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import anthropic
|
||||||
|
import requests
|
||||||
|
|
||||||
|
from construct_prior import ConstructPrior, AVAP_NODE_NAMES
|
||||||
|
|
||||||
|
AVAP_NODE_TYPES = {
|
||||||
|
"addParam": ["addParam("],
|
||||||
|
"addResult": ["addResult("],
|
||||||
|
"_status": ["_status"],
|
||||||
|
"addVar": ["addVar("],
|
||||||
|
"getListLen": ["getListLen("],
|
||||||
|
"getQueryParamList": ["getQueryParamList("],
|
||||||
|
"itemFromList": ["itemFromList("],
|
||||||
|
"replace": ["replace("],
|
||||||
|
"randomString": ["randomString("],
|
||||||
|
"if_mode1": ["if("],
|
||||||
|
"if_mode2": ["if(None, None,"],
|
||||||
|
"else": ["else()"],
|
||||||
|
"end": ["end()"],
|
||||||
|
"startLoop": ["startLoop("],
|
||||||
|
"endLoop": ["endLoop()"],
|
||||||
|
"try": ["try()"],
|
||||||
|
"exception": ["exception()"],
|
||||||
|
"return": ["return("],
|
||||||
|
"go": ["go("],
|
||||||
|
"gather": ["gather("],
|
||||||
|
"avapConnector": ["avapConnector("],
|
||||||
|
"ormCheckTable": ["ormCheckTable("],
|
||||||
|
"ormDirect": ["ormDirect("],
|
||||||
|
"ormAccessSelect": ["ormAccessSelect("],
|
||||||
|
"ormAccessInsert": ["ormAccessInsert("],
|
||||||
|
"ormAccessUpdate": ["ormAccessUpdate("],
|
||||||
|
"variableFromJSON": ["variableFromJSON("],
|
||||||
|
"AddVariableToJSON": ["AddVariableToJSON("],
|
||||||
|
"encodeSHA256": ["encodeSHA256("],
|
||||||
|
"encodeMD5": ["encodeMD5("],
|
||||||
|
"getTimeStamp": ["getTimeStamp("],
|
||||||
|
"getDateTime": ["getDateTime("],
|
||||||
|
"stampToDatetime": ["stampToDatetime("],
|
||||||
|
"RequestGet": ["RequestGet("],
|
||||||
|
"RequestPost": ["RequestPost("],
|
||||||
|
"function": ["function "],
|
||||||
|
"import": ["import "],
|
||||||
|
"include": ["include("],
|
||||||
|
}
|
||||||
|
|
||||||
|
NODE_TYPE_NAMES = AVAP_NODE_NAMES
|
||||||
|
_PRIOR_EPSILON = 0.05
|
||||||
|
|
||||||
|
class CellValidator:
|
||||||
|
|
||||||
|
def __init__(self, parser_url: str, parser_timeout: int = 5):
|
||||||
|
self.parser_url = parser_url.rstrip("/")
|
||||||
|
self.parser_timeout = parser_timeout
|
||||||
|
self._parser_available = True
|
||||||
|
|
||||||
|
|
||||||
|
def parse(self, code: str) -> tuple[bool, dict, str]:
|
||||||
|
|
||||||
|
if not self._parser_available:
|
||||||
|
return None, {}, "parser_unavailable"
|
||||||
|
try:
|
||||||
|
resp = requests.post(
|
||||||
|
f"{self.parser_url}/parse",
|
||||||
|
json={"code": code},
|
||||||
|
timeout=self.parser_timeout,
|
||||||
|
)
|
||||||
|
data = resp.json()
|
||||||
|
if data.get("valid", False):
|
||||||
|
return True, data.get("ast", {}), ""
|
||||||
|
return False, {}, data.get("error", "parse error")
|
||||||
|
except requests.exceptions.ConnectionError:
|
||||||
|
self._parser_available = False
|
||||||
|
return None, {}, "parser_unavailable"
|
||||||
|
except Exception as e:
|
||||||
|
return False, {}, str(e)
|
||||||
|
def detect_constructs(self, code: str, ast: dict) -> set:
|
||||||
|
if ast:
|
||||||
|
return self._from_ast(ast)
|
||||||
|
return self._from_source(code)
|
||||||
|
|
||||||
|
def _from_ast(self, ast: dict) -> set:
|
||||||
|
found = set()
|
||||||
|
if isinstance(ast, dict):
|
||||||
|
if "type" in ast:
|
||||||
|
found.add(ast["type"])
|
||||||
|
for v in ast.values():
|
||||||
|
found |= self._from_ast(v)
|
||||||
|
elif isinstance(ast, list):
|
||||||
|
for item in ast:
|
||||||
|
found |= self._from_ast(item)
|
||||||
|
return found
|
||||||
|
|
||||||
|
def _from_source(self, code: str) -> set:
|
||||||
|
found = set()
|
||||||
|
if "if(None, None," in code:
|
||||||
|
found.add("if_mode2")
|
||||||
|
elif "if(" in code:
|
||||||
|
found.add("if_mode1")
|
||||||
|
for name, patterns in AVAP_NODE_TYPES.items():
|
||||||
|
if name in ("if_mode1", "if_mode2"):
|
||||||
|
continue # already handled
|
||||||
|
for pat in patterns:
|
||||||
|
if pat in code:
|
||||||
|
found.add(name)
|
||||||
|
break
|
||||||
|
return found
|
||||||
|
|
||||||
|
def cell_quality(
|
||||||
|
self,
|
||||||
|
code: str,
|
||||||
|
ast: dict,
|
||||||
|
test_list: list,
|
||||||
|
cell: frozenset,
|
||||||
|
alpha: float = 0.3,
|
||||||
|
beta: float = 0.2,
|
||||||
|
gamma: float = 0.1,
|
||||||
|
) -> tuple[float, dict]:
|
||||||
|
|
||||||
|
detected = self.detect_constructs(code, ast)
|
||||||
|
all_types = set(NODE_TYPE_NAMES)
|
||||||
|
|
||||||
|
cell_constructs = set(cell)
|
||||||
|
present_required = cell_constructs & detected
|
||||||
|
fidelity = len(present_required) / max(len(cell_constructs), 1)
|
||||||
|
|
||||||
|
extra = detected - cell_constructs
|
||||||
|
bonus_ratio = len(extra) / max(len(all_types) - len(cell_constructs), 1)
|
||||||
|
|
||||||
|
tq = sum(
|
||||||
|
1 for t in test_list
|
||||||
|
if isinstance(t, str) and "re.match(" in t and len(t.strip()) > 10
|
||||||
|
) / max(len(test_list), 1)
|
||||||
|
|
||||||
|
lines = [l.strip() for l in code.split("\n") if l.strip()]
|
||||||
|
richness = min(len(lines) / 30.0, 1.0) # cap at 30 lines = 1.0
|
||||||
|
|
||||||
|
quality = fidelity + alpha * bonus_ratio + beta * tq + gamma * richness
|
||||||
|
|
||||||
|
return quality, {
|
||||||
|
"fidelity": round(fidelity, 3),
|
||||||
|
"bonus_ratio": round(bonus_ratio, 3),
|
||||||
|
"test_quality": round(tq, 3),
|
||||||
|
"richness": round(richness, 3),
|
||||||
|
"quality": round(quality, 3),
|
||||||
|
"detected": sorted(detected),
|
||||||
|
"cell": sorted(cell),
|
||||||
|
"extra": sorted(extra),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class CoverageMap:
|
||||||
|
|
||||||
|
|
||||||
|
def __init__(self, cell_size: int = 3):
|
||||||
|
|
||||||
|
self.cell_size = cell_size
|
||||||
|
self._map: dict[frozenset, tuple[dict, float, dict]] = {}
|
||||||
|
self._attempts: dict[frozenset, int] = defaultdict(int)
|
||||||
|
self._all_cells = self._build_cells()
|
||||||
|
|
||||||
|
def _build_cells(self) -> list[frozenset]:
|
||||||
|
cells = []
|
||||||
|
for size in range(2, self.cell_size + 1):
|
||||||
|
for combo in combinations(NODE_TYPE_NAMES, size):
|
||||||
|
cells.append(frozenset(combo))
|
||||||
|
return cells
|
||||||
|
|
||||||
|
@property
|
||||||
|
def total_cells(self) -> int:
|
||||||
|
return len(self._all_cells)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def filled_cells(self) -> int:
|
||||||
|
return len(self._map)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def fill_rate(self) -> float:
|
||||||
|
return self.filled_cells / max(self.total_cells, 1)
|
||||||
|
|
||||||
|
def update(
|
||||||
|
self,
|
||||||
|
cell: frozenset,
|
||||||
|
example: dict,
|
||||||
|
quality: float,
|
||||||
|
components: dict,
|
||||||
|
) -> bool:
|
||||||
|
self._attempts[cell] += 1
|
||||||
|
current = self._map.get(cell)
|
||||||
|
if current is None or quality > current[1]:
|
||||||
|
self._map[cell] = (example, quality, components)
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
def get_empty_cells(self) -> list[frozenset]:
|
||||||
|
return [c for c in self._all_cells if c not in self._map]
|
||||||
|
|
||||||
|
def get_low_quality_cells(self, threshold: float = 0.7) -> list[frozenset]:
|
||||||
|
return [
|
||||||
|
c for c, (_, q, _) in self._map.items()
|
||||||
|
if q < threshold
|
||||||
|
]
|
||||||
|
|
||||||
|
def get_example(self, cell: frozenset) -> dict | None:
|
||||||
|
entry = self._map.get(cell)
|
||||||
|
return entry[0] if entry else None
|
||||||
|
|
||||||
|
def all_examples(self) -> list[dict]:
|
||||||
|
return [ex for ex, _, _ in self._map.values()]
|
||||||
|
|
||||||
|
def node_type_frequency(self) -> dict[str, int]:
|
||||||
|
|
||||||
|
freq = defaultdict(int)
|
||||||
|
for cell in self._map:
|
||||||
|
for nt in cell:
|
||||||
|
freq[nt] += 1
|
||||||
|
return dict(freq)
|
||||||
|
|
||||||
|
def distribution_entropy(self) -> float:
|
||||||
|
|
||||||
|
freq = self.node_type_frequency()
|
||||||
|
total = sum(freq.values())
|
||||||
|
if total == 0:
|
||||||
|
return 0.0
|
||||||
|
entropy = 0.0
|
||||||
|
for count in freq.values():
|
||||||
|
p = count / total
|
||||||
|
if p > 0:
|
||||||
|
entropy -= p * math.log2(p)
|
||||||
|
return round(entropy, 3)
|
||||||
|
|
||||||
|
def fill_summary(self) -> str:
|
||||||
|
empty = len(self.get_empty_cells())
|
||||||
|
low = len(self.get_low_quality_cells())
|
||||||
|
entropy = self.distribution_entropy()
|
||||||
|
return (
|
||||||
|
f"Cells: {self.filled_cells}/{self.total_cells} filled "
|
||||||
|
f"({100*self.fill_rate:.1f}%) | "
|
||||||
|
f"Low quality: {low} | "
|
||||||
|
f"Empty: {empty} | "
|
||||||
|
f"Entropy: {entropy:.2f} bits"
|
||||||
|
)
|
||||||
|
|
||||||
|
class CellSelector:
|
||||||
|
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
coverage_map: CoverageMap,
|
||||||
|
quality_threshold: float = 0.80,
|
||||||
|
ucb_c: float = 1.0,
|
||||||
|
):
|
||||||
|
self.map = coverage_map
|
||||||
|
self.quality_threshold = quality_threshold
|
||||||
|
self.ucb_c = ucb_c
|
||||||
|
self._total_calls = 0
|
||||||
|
import random
|
||||||
|
self._rng = random.Random(42)
|
||||||
|
|
||||||
|
def select(self) -> frozenset:
|
||||||
|
self._total_calls += 1
|
||||||
|
empty = self.map.get_empty_cells()
|
||||||
|
if empty:
|
||||||
|
return self._rng.choice(empty)
|
||||||
|
|
||||||
|
low = self.map.get_low_quality_cells(self.quality_threshold)
|
||||||
|
if low:
|
||||||
|
return self._rng.choice(low)
|
||||||
|
|
||||||
|
return self._ucb_select()
|
||||||
|
|
||||||
|
def _ucb_select(self) -> frozenset:
|
||||||
|
best_cell = None
|
||||||
|
best_score = -float("inf")
|
||||||
|
total = max(self._total_calls, 1)
|
||||||
|
|
||||||
|
for cell in self.map._all_cells:
|
||||||
|
attempts = max(self.map._attempts.get(cell, 0), 1)
|
||||||
|
entry = self.map._map.get(cell)
|
||||||
|
quality = entry[1] if entry else 0.0
|
||||||
|
score = quality + self.ucb_c * math.sqrt(math.log(total) / attempts)
|
||||||
|
if score > best_score:
|
||||||
|
best_score = score
|
||||||
|
best_cell = cell
|
||||||
|
|
||||||
|
return best_cell
|
||||||
|
|
||||||
|
class CellSelectorPrior(CellSelector):
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
coverage_map: CoverageMap,
|
||||||
|
prior: ConstructPrior,
|
||||||
|
quality_threshold: float = 0.80,
|
||||||
|
ucb_c: float = 1.0,
|
||||||
|
phase3_threshold: float = 0.70,
|
||||||
|
):
|
||||||
|
super().__init__(coverage_map, quality_threshold, ucb_c)
|
||||||
|
self.prior = prior
|
||||||
|
self.phase3_threshold = phase3_threshold
|
||||||
|
self._tail_cells: set[frozenset] = set()
|
||||||
|
self._phase3_active = False
|
||||||
|
|
||||||
|
def select(self) -> frozenset:
|
||||||
|
self._total_calls += 1
|
||||||
|
empty = self.map.get_empty_cells()
|
||||||
|
|
||||||
|
if empty:
|
||||||
|
high_prior_empty = [
|
||||||
|
c for c in empty
|
||||||
|
if self.prior.cell_weight(c) > self.prior.epsilon * 1.5
|
||||||
|
]
|
||||||
|
if high_prior_empty:
|
||||||
|
return self._weighted_sample(high_prior_empty)
|
||||||
|
return self._weighted_sample(empty)
|
||||||
|
|
||||||
|
low = self.map.get_low_quality_cells(self.quality_threshold)
|
||||||
|
if low:
|
||||||
|
return self._ucb_prior_select(low)
|
||||||
|
|
||||||
|
return self._ucb_prior_select(self.map._all_cells)
|
||||||
|
|
||||||
|
def _weighted_sample(self, cells: list[frozenset]) -> frozenset:
|
||||||
|
weights = [self.prior.cell_weight(c) for c in cells]
|
||||||
|
total = sum(weights)
|
||||||
|
if total == 0:
|
||||||
|
return self._rng.choice(cells)
|
||||||
|
r = self._rng.random() * total
|
||||||
|
cumsum = 0.0
|
||||||
|
for cell, w in zip(cells, weights):
|
||||||
|
cumsum += w
|
||||||
|
if r <= cumsum:
|
||||||
|
return cell
|
||||||
|
return cells[-1]
|
||||||
|
|
||||||
|
def _ucb_prior_select(self, cells) -> frozenset:
|
||||||
|
|
||||||
|
best_cell = None
|
||||||
|
best_score = -float("inf")
|
||||||
|
total = max(self._total_calls, 1)
|
||||||
|
|
||||||
|
for cell in cells:
|
||||||
|
attempts = max(self.map._attempts.get(cell, 0), 1)
|
||||||
|
entry = self.map._map.get(cell)
|
||||||
|
quality = entry[1] if entry else 0.0
|
||||||
|
prior_w = self.prior.cell_weight(cell)
|
||||||
|
ucb_term = self.ucb_c * math.sqrt(math.log(total) / attempts)
|
||||||
|
score = prior_w * (quality + ucb_term)
|
||||||
|
if score > best_score:
|
||||||
|
best_score = score
|
||||||
|
best_cell = cell
|
||||||
|
|
||||||
|
return best_cell
|
||||||
|
|
||||||
|
SYSTEM_PROMPT = """Eres un experto en el lenguaje AVAP.
|
||||||
|
Se te proporciona el Language Reference Manual (LRM) completo de AVAP.
|
||||||
|
Tu tarea es generar UN problema de benchmark estilo MBPP para evaluar
|
||||||
|
modelos de lenguaje en su capacidad de generar código AVAP correcto.
|
||||||
|
|
||||||
|
REGLAS ESTRICTAS para el código AVAP generado:
|
||||||
|
1. Una instrucción por línea. EOL es el terminador absoluto.
|
||||||
|
2. Sin indentación significativa (es solo decorativa).
|
||||||
|
3. Bloques: if()...else()...end(), startLoop()...endLoop(), try()...exception()...end()
|
||||||
|
4. Funciones: function name(args) { ... return(val) }
|
||||||
|
5. if() Modo 1: if(var_o_literal, var_o_literal, "operador")
|
||||||
|
6. if() Modo 2: if(None, None, `expresion_completa_como_string`)
|
||||||
|
7. _status se asigna con: addVar(_status, 404)
|
||||||
|
8. ormAccessSelect firma: ormAccessSelect(campos, "tabla", selector, varTarget)
|
||||||
|
9. ormCheckTable firma: ormCheckTable(nombre_tabla, varTarget)
|
||||||
|
10. ormDirect firma: ormDirect("SELECT ... %s" % var, varTarget)
|
||||||
|
11. getQueryParamList firma: getQueryParamList(param_name, varTarget)
|
||||||
|
12. NUNCA uses registerEndpoint(), NUNCA uses mainHandler().
|
||||||
|
13. El código se ejecuta DIRECTAMENTE, línea a línea.
|
||||||
|
|
||||||
|
FORMATO DE SALIDA: responde ÚNICAMENTE con UN objeto JSON válido (no array).
|
||||||
|
Sin texto adicional, sin bloques de código markdown.
|
||||||
|
{
|
||||||
|
"task_id": 1,
|
||||||
|
"text": "<enunciado del problema en español>",
|
||||||
|
"code": "<código AVAP con saltos de línea como \\n>",
|
||||||
|
"test_inputs": { "<param1>": <valor1> },
|
||||||
|
"test_list": ["re.match(r'<patrón>', <variable>)", ...]
|
||||||
|
}
|
||||||
|
|
||||||
|
test_list: USA ÚNICAMENTE re.match(). NUNCA comparaciones directas (==, !=).
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def build_cell_prompt(
|
||||||
|
lrm: str,
|
||||||
|
cell: frozenset,
|
||||||
|
existing_example: dict | None,
|
||||||
|
map_summary: str,
|
||||||
|
) -> str:
|
||||||
|
constructs_list = ", ".join(f"`{c}`" for c in sorted(cell))
|
||||||
|
|
||||||
|
improvement_note = ""
|
||||||
|
if existing_example:
|
||||||
|
improvement_note = f"""
|
||||||
|
El siguiente ejemplo YA existe para esta combinación con calidad mejorable.
|
||||||
|
Genera algo DISTINTO y MÁS COMPLEJO que lo supere:
|
||||||
|
|
||||||
|
```
|
||||||
|
{existing_example.get('code', '')}
|
||||||
|
```
|
||||||
|
"""
|
||||||
|
|
||||||
|
return f"""# LRM AVAP — Language Reference Manual
|
||||||
|
|
||||||
|
{lrm}
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# ESTADO DEL MAPA DE COBERTURA
|
||||||
|
|
||||||
|
{map_summary}
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# TAREA — ESPECIFICACIÓN OBLIGATORIA
|
||||||
|
|
||||||
|
Genera UN ejemplo AVAP que use OBLIGATORIAMENTE TODOS estos constructs:
|
||||||
|
|
||||||
|
**{constructs_list}**
|
||||||
|
|
||||||
|
El ejemplo DEBE contener todos los constructs listados arriba.
|
||||||
|
Si tu código no los usa todos, la tarea fracasa.
|
||||||
|
|
||||||
|
Adicionalmente:
|
||||||
|
- Combina los constructs requeridos en un escenario realista de microservicio HTTP
|
||||||
|
- Añade constructs adicionales donde sea natural (aumenta la puntuación)
|
||||||
|
- Código complejo y rico — no ejemplos triviales de 3 líneas
|
||||||
|
- 2-3 aserciones re.match() en test_list
|
||||||
|
{improvement_note}
|
||||||
|
Responde ÚNICAMENTE con el objeto JSON. Sin texto antes ni después.
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def call_api(
|
||||||
|
client: anthropic.Anthropic,
|
||||||
|
lrm: str,
|
||||||
|
cell: frozenset,
|
||||||
|
task_id: int,
|
||||||
|
existing_example: dict | None,
|
||||||
|
map_summary: str,
|
||||||
|
retries: int = 3,
|
||||||
|
) -> dict | None:
|
||||||
|
|
||||||
|
for attempt in range(1, retries + 1):
|
||||||
|
try:
|
||||||
|
message = client.messages.create(
|
||||||
|
model="claude-sonnet-4-20250514",
|
||||||
|
max_tokens=4000,
|
||||||
|
system=SYSTEM_PROMPT,
|
||||||
|
messages=[{
|
||||||
|
"role": "user",
|
||||||
|
"content": build_cell_prompt(lrm, cell, existing_example, map_summary),
|
||||||
|
}],
|
||||||
|
)
|
||||||
|
raw = message.content[0].text.strip()
|
||||||
|
|
||||||
|
if raw.startswith("```"):
|
||||||
|
lines = raw.splitlines()
|
||||||
|
raw = "\n".join(lines[1:-1] if lines[-1].strip() == "```" else lines[1:])
|
||||||
|
|
||||||
|
problem = json.loads(raw)
|
||||||
|
if not isinstance(problem, dict):
|
||||||
|
raise ValueError("Response is not a JSON object")
|
||||||
|
for field in ("text", "code", "test_list"):
|
||||||
|
if field not in problem:
|
||||||
|
raise ValueError(f"Missing field '{field}'")
|
||||||
|
if "test_inputs" not in problem:
|
||||||
|
problem["test_inputs"] = {}
|
||||||
|
problem["task_id"] = task_id
|
||||||
|
return problem
|
||||||
|
|
||||||
|
except (json.JSONDecodeError, ValueError) as e:
|
||||||
|
print(f"\n Attempt {attempt}/{retries} — parse error: {e}")
|
||||||
|
if attempt < retries:
|
||||||
|
time.sleep(2 ** attempt)
|
||||||
|
except anthropic.RateLimitError:
|
||||||
|
wait = 30 * attempt
|
||||||
|
print(f"\n Rate limit — waiting {wait}s...")
|
||||||
|
time.sleep(wait)
|
||||||
|
except anthropic.APIError as e:
|
||||||
|
print(f"\n API error at attempt {attempt}: {e}")
|
||||||
|
if attempt < retries:
|
||||||
|
time.sleep(5)
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def run_map_elites(args, client, lrm, output_path):
|
||||||
|
|
||||||
|
validator = CellValidator(parser_url=args.parser)
|
||||||
|
cmap = CoverageMap(cell_size=args.cell_size)
|
||||||
|
selector = CellSelector(cmap, quality_threshold=args.quality_threshold)
|
||||||
|
dataset = []
|
||||||
|
task_id = 1
|
||||||
|
call_count = 0
|
||||||
|
valid_count = 0
|
||||||
|
cell_updates = 0
|
||||||
|
|
||||||
|
print(f"\n MAP-Elites mode | cells: {cmap.total_cells} | target: {args.problems} examples")
|
||||||
|
print(f" Cell size: {args.cell_size} | Quality threshold: {args.quality_threshold}")
|
||||||
|
print("─" * 65)
|
||||||
|
|
||||||
|
max_calls = args.problems * 4
|
||||||
|
|
||||||
|
while len(dataset) < args.problems and call_count < max_calls:
|
||||||
|
|
||||||
|
cell = selector.select()
|
||||||
|
existing = cmap.get_example(cell)
|
||||||
|
call_count += 1
|
||||||
|
|
||||||
|
print(
|
||||||
|
f" [{call_count:04d}] Cell {sorted(cell)} "
|
||||||
|
f"| filled={cmap.filled_cells}/{cmap.total_cells} "
|
||||||
|
f"| dataset={len(dataset)} ... ",
|
||||||
|
end="", flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
problem = call_api(
|
||||||
|
client, lrm, cell, task_id,
|
||||||
|
existing_example=existing,
|
||||||
|
map_summary=cmap.fill_summary(),
|
||||||
|
)
|
||||||
|
|
||||||
|
if problem is None:
|
||||||
|
print("SKIP (generation failed)")
|
||||||
|
continue
|
||||||
|
|
||||||
|
code = problem["code"]
|
||||||
|
test_list = problem.get("test_list", [])
|
||||||
|
|
||||||
|
is_valid, ast, error_msg = validator.parse(code)
|
||||||
|
|
||||||
|
if is_valid is None:
|
||||||
|
is_valid, ast = True, {}
|
||||||
|
if call_count == 1:
|
||||||
|
print(f"\n Parser unavailable — using keyword fallback", flush=True)
|
||||||
|
|
||||||
|
if is_valid is False:
|
||||||
|
print(f"INVALID ({error_msg[:40]})")
|
||||||
|
problem["_validation"] = {"valid": False, "error": error_msg}
|
||||||
|
continue
|
||||||
|
|
||||||
|
valid_count += 1
|
||||||
|
|
||||||
|
# Compute cell quality
|
||||||
|
quality, components = validator.cell_quality(
|
||||||
|
code, ast, test_list, cell,
|
||||||
|
alpha=args.alpha, beta=args.beta, gamma=args.gamma,
|
||||||
|
)
|
||||||
|
problem["_cell"] = sorted(cell)
|
||||||
|
problem["_quality"] = components
|
||||||
|
|
||||||
|
if components["fidelity"] < 1.0:
|
||||||
|
missing = set(cell) - set(components["detected"])
|
||||||
|
print(f"MISSING constructs: {sorted(missing)}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
updated = cmap.update(cell, problem, quality, components)
|
||||||
|
if updated:
|
||||||
|
cell_updates += 1
|
||||||
|
|
||||||
|
dataset.append(problem)
|
||||||
|
task_id += 1
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"OK quality={quality:.3f} "
|
||||||
|
f"fidelity={components['fidelity']:.2f} "
|
||||||
|
f"extra={len(components['extra'])}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if len(dataset) % 50 == 0:
|
||||||
|
_save(dataset, output_path, cmap)
|
||||||
|
freq = cmap.node_type_frequency()
|
||||||
|
entropy = cmap.distribution_entropy()
|
||||||
|
print(f"\n ── Checkpoint ──────────────────────────────────")
|
||||||
|
print(f" Dataset: {len(dataset)} | Valid: {valid_count}/{call_count}")
|
||||||
|
print(f" {cmap.fill_summary()}")
|
||||||
|
print(f" Top-5 most frequent: {sorted(freq, key=freq.get, reverse=True)[:5]}")
|
||||||
|
print(f" Top-5 least frequent: {sorted(freq, key=freq.get)[:5]}")
|
||||||
|
print(f" ────────────────────────────────────────────────\n")
|
||||||
|
|
||||||
|
time.sleep(0.5)
|
||||||
|
|
||||||
|
_save(dataset, output_path, cmap)
|
||||||
|
return dataset, cmap, valid_count, call_count
|
||||||
|
|
||||||
|
def run_map_elites_prior(args, client, lrm, output_path):
|
||||||
|
|
||||||
|
print("\n Loading ConstructPrior...", flush=True)
|
||||||
|
prior_map = getattr(args, "prior_map","construct_map.yaml")
|
||||||
|
epsilon = getattr(args, "prior_epsilon", _PRIOR_EPSILON)
|
||||||
|
yaml_path = Path(prior_map)
|
||||||
|
|
||||||
|
if yaml_path.exists():
|
||||||
|
prior = ConstructPrior.from_yaml(yaml_path, epsilon=epsilon)
|
||||||
|
else:
|
||||||
|
# Fallback: yaml not found — use static prior and warn
|
||||||
|
print(f" [WARN] construct_map.yaml not found at '{yaml_path}'.")
|
||||||
|
print(f" [WARN] Using static fallback prior. Generate the real prior with:")
|
||||||
|
print(f" [WARN] python construct_prior.py --generate-map --github-token TOKEN")
|
||||||
|
prior = ConstructPrior.from_static_fallback(epsilon=epsilon)
|
||||||
|
|
||||||
|
print(f" {prior.coverage_summary()}")
|
||||||
|
|
||||||
|
validator = CellValidator(parser_url=args.parser)
|
||||||
|
cmap = CoverageMap(cell_size=args.cell_size)
|
||||||
|
selector = CellSelectorPrior(
|
||||||
|
cmap, prior,
|
||||||
|
quality_threshold=args.quality_threshold,
|
||||||
|
phase3_threshold=getattr(args, "prior_phase3_threshold", 0.70),
|
||||||
|
)
|
||||||
|
dataset = []
|
||||||
|
task_id = 1
|
||||||
|
call_count = 0
|
||||||
|
valid_count = 0
|
||||||
|
cell_updates = 0
|
||||||
|
|
||||||
|
print(f"\n MAP-Elites+Prior mode | cells: {cmap.total_cells} | target: {args.problems} examples")
|
||||||
|
print(f" Cell size: {args.cell_size} | Quality threshold: {args.quality_threshold}")
|
||||||
|
print("─" * 65)
|
||||||
|
|
||||||
|
max_calls = args.problems * 4
|
||||||
|
|
||||||
|
while len(dataset) < args.problems and call_count < max_calls:
|
||||||
|
|
||||||
|
cell = selector.select()
|
||||||
|
existing = cmap.get_example(cell)
|
||||||
|
prior_w = prior.cell_weight(cell)
|
||||||
|
call_count += 1
|
||||||
|
|
||||||
|
print(
|
||||||
|
f" [{call_count:04d}] Cell {sorted(cell)} "
|
||||||
|
f"| prior={prior_w:.3f} "
|
||||||
|
f"| filled={cmap.filled_cells}/{cmap.total_cells} "
|
||||||
|
f"| dataset={len(dataset)} ... ",
|
||||||
|
end="", flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
problem = call_api(
|
||||||
|
client, lrm, cell, task_id,
|
||||||
|
existing_example=existing,
|
||||||
|
map_summary=cmap.fill_summary(),
|
||||||
|
)
|
||||||
|
|
||||||
|
if problem is None:
|
||||||
|
print("SKIP (generation failed)")
|
||||||
|
continue
|
||||||
|
|
||||||
|
code = problem["code"]
|
||||||
|
test_list = problem.get("test_list", [])
|
||||||
|
|
||||||
|
is_valid, ast, error_msg = validator.parse(code)
|
||||||
|
|
||||||
|
if is_valid is None:
|
||||||
|
is_valid, ast = True, {}
|
||||||
|
if call_count == 1:
|
||||||
|
print(f"\n Parser unavailable — using keyword fallback", flush=True)
|
||||||
|
|
||||||
|
if is_valid is False:
|
||||||
|
print(f"INVALID ({error_msg[:40]})")
|
||||||
|
problem["_validation"] = {"valid": False, "error": error_msg}
|
||||||
|
continue
|
||||||
|
|
||||||
|
valid_count += 1
|
||||||
|
|
||||||
|
quality, components = validator.cell_quality(
|
||||||
|
code, ast, test_list, cell,
|
||||||
|
alpha=args.alpha, beta=args.beta, gamma=args.gamma,
|
||||||
|
)
|
||||||
|
problem["_cell"] = sorted(cell)
|
||||||
|
problem["_prior_weight"] = round(prior_w, 4)
|
||||||
|
problem["_quality"] = components
|
||||||
|
|
||||||
|
if components["fidelity"] < 1.0:
|
||||||
|
missing = set(cell) - set(components["detected"])
|
||||||
|
print(f"MISSING constructs: {sorted(missing)}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
updated = cmap.update(cell, problem, quality, components)
|
||||||
|
if updated:
|
||||||
|
cell_updates += 1
|
||||||
|
|
||||||
|
dataset.append(problem)
|
||||||
|
task_id += 1
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"OK quality={quality:.3f} "
|
||||||
|
f"fidelity={components['fidelity']:.2f} "
|
||||||
|
f"prior={prior_w:.3f} "
|
||||||
|
f"extra={len(components['extra'])}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if len(dataset) % 50 == 0:
|
||||||
|
_save(dataset, output_path, cmap, prior=prior)
|
||||||
|
freq = cmap.node_type_frequency()
|
||||||
|
entropy = cmap.distribution_entropy()
|
||||||
|
kl = prior.kl_divergence(freq)
|
||||||
|
print(f"\n ── Checkpoint ──────────────────────────────────")
|
||||||
|
print(f" Dataset: {len(dataset)} | Valid: {valid_count}/{call_count}")
|
||||||
|
print(f" {cmap.fill_summary()}")
|
||||||
|
print(f" KL(dataset ‖ prior): {kl:.4f} (lower = closer to production patterns)")
|
||||||
|
print(f" Top-5 most frequent: {sorted(freq, key=freq.get, reverse=True)[:5]}")
|
||||||
|
print(f" Top-5 least frequent: {sorted(freq, key=freq.get)[:5]}")
|
||||||
|
print(f" ────────────────────────────────────────────────\n")
|
||||||
|
|
||||||
|
time.sleep(0.5)
|
||||||
|
|
||||||
|
_save(dataset, output_path, cmap, prior=prior)
|
||||||
|
return dataset, cmap, valid_count, call_count, prior
|
||||||
|
|
||||||
|
|
||||||
|
def _save(dataset: list, path: Path, cmap: CoverageMap, prior: ConstructPrior = None):
|
||||||
|
with open(path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(dataset, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
# Save coverage map statistics alongside dataset
|
||||||
|
stats_path = path.with_name(path.stem + "_coverage_stats.json")
|
||||||
|
freq = cmap.node_type_frequency()
|
||||||
|
stats = {
|
||||||
|
"total_cells": cmap.total_cells,
|
||||||
|
"filled_cells": cmap.filled_cells,
|
||||||
|
"fill_rate": round(cmap.fill_rate, 4),
|
||||||
|
"distribution_entropy": cmap.distribution_entropy(),
|
||||||
|
"node_type_frequency": freq,
|
||||||
|
"low_quality_cells": len(cmap.get_low_quality_cells()),
|
||||||
|
"empty_cells": len(cmap.get_empty_cells()),
|
||||||
|
}
|
||||||
|
if prior is not None:
|
||||||
|
stats["kl_divergence_dataset_vs_prior"] = prior.kl_divergence(freq)
|
||||||
|
stats["prior_summary"] = prior.coverage_summary()
|
||||||
|
with open(stats_path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(stats, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="AVAP Dataset Generator v2 — MAP-Elites Quality-Diversity Pipeline"
|
||||||
|
)
|
||||||
|
parser.add_argument("--lrm", default="avap.md")
|
||||||
|
parser.add_argument("--output", default="output/mbpp_avap_v2.json")
|
||||||
|
parser.add_argument("--problems", type=int, default=5000)
|
||||||
|
parser.add_argument("--parser", default="http://localhost:8080",
|
||||||
|
help="AVAP parser URL")
|
||||||
|
parser.add_argument("--cell-size", type=int, default=3,
|
||||||
|
help="Max constructs per cell: 2=pairs, 3=pairs+trios (default: 3)")
|
||||||
|
parser.add_argument("--quality-threshold", type=float, default=0.80,
|
||||||
|
help="Min quality to consider a cell 'good' (default: 0.80)")
|
||||||
|
parser.add_argument("--alpha", type=float, default=0.30,
|
||||||
|
help="Weight for bonus constructs in cell quality (default: 0.30)")
|
||||||
|
parser.add_argument("--beta", type=float, default=0.20,
|
||||||
|
help="Weight for test quality in cell quality (default: 0.20)")
|
||||||
|
parser.add_argument("--gamma", type=float, default=0.10,
|
||||||
|
help="Weight for code richness in cell quality (default: 0.10)")
|
||||||
|
parser.add_argument(
|
||||||
|
"--mode",
|
||||||
|
choices=["map-elites-prior", "map-elites", "reward"],
|
||||||
|
default="map-elites-prior",
|
||||||
|
help=(
|
||||||
|
"map-elites-prior: Candidate F — MAP-Elites + ConstructPrior (default)\n"
|
||||||
|
"map-elites: Candidate E — MAP-Elites, uniform cell weighting\n"
|
||||||
|
"reward: Candidate A — CW-Reward pool (comparison baseline)"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--prior-map",
|
||||||
|
default="construct_map.yaml",
|
||||||
|
metavar="FILE",
|
||||||
|
help=(
|
||||||
|
"Path to construct_map.yaml generated by construct_prior.py.\n"
|
||||||
|
"Generate it first: python construct_prior.py --generate-map\n"
|
||||||
|
"Default: construct_map.yaml (in current directory)"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--prior-epsilon",
|
||||||
|
type=float,
|
||||||
|
default=_PRIOR_EPSILON,
|
||||||
|
help=f"Minimum prior weight for tail cells (default: {_PRIOR_EPSILON})",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--prior-phase3-threshold",
|
||||||
|
type=float,
|
||||||
|
default=0.70,
|
||||||
|
help=(
|
||||||
|
"Quality threshold above which Phase 2 ends and tail (low-prior) "
|
||||||
|
"cells become the focus. Default: 0.70"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
parser.add_argument("--api-key", default=None)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
api_key = args.api_key or os.environ.get("ANTHROPIC_API_KEY")
|
||||||
|
if not api_key:
|
||||||
|
sys.exit("ERROR: ANTHROPIC_API_KEY not set.")
|
||||||
|
|
||||||
|
lrm_path = Path(args.lrm)
|
||||||
|
if not lrm_path.exists():
|
||||||
|
sys.exit(f"ERROR: LRM '{lrm_path}' not found.")
|
||||||
|
lrm = lrm_path.read_text(encoding="utf-8")
|
||||||
|
|
||||||
|
output_path = Path(args.output)
|
||||||
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
client = anthropic.Anthropic(api_key=api_key)
|
||||||
|
|
||||||
|
mode_label = {
|
||||||
|
"map-elites-prior": "Candidate F — MAP-Elites + ConstructPrior",
|
||||||
|
"map-elites": "Candidate E — MAP-Elites (uniform)",
|
||||||
|
"reward": "Candidate A — CW-Reward pool",
|
||||||
|
}[args.mode]
|
||||||
|
|
||||||
|
print("=" * 65)
|
||||||
|
print(" AVAP Dataset Generator v2 — MAP-Elites Pipeline")
|
||||||
|
print("=" * 65)
|
||||||
|
print(f" Mode : {mode_label}")
|
||||||
|
print(f" LRM : {lrm_path}")
|
||||||
|
print(f" Output : {output_path}")
|
||||||
|
print(f" Target examples: {args.problems}")
|
||||||
|
print(f" Parser URL : {args.parser}")
|
||||||
|
print(f" Cell size : {args.cell_size}")
|
||||||
|
print(f" Quality thresh : {args.quality_threshold}")
|
||||||
|
if args.mode == "map-elites-prior":
|
||||||
|
yaml_exists = Path(args.prior_map).exists()
|
||||||
|
print(f" Prior map : {args.prior_map} ({'✓ found' if yaml_exists else '✗ not found — will use static fallback'})")
|
||||||
|
print(f" Prior epsilon : {args.prior_epsilon}")
|
||||||
|
print("=" * 65)
|
||||||
|
|
||||||
|
prior = None
|
||||||
|
|
||||||
|
if args.mode == "map-elites-prior":
|
||||||
|
result = run_map_elites_prior(args, client, lrm, output_path)
|
||||||
|
dataset, cmap, valid_count, call_count, prior = result
|
||||||
|
elif args.mode == "map-elites":
|
||||||
|
dataset, cmap, valid_count, call_count = run_map_elites(args, client, lrm, output_path)
|
||||||
|
else:
|
||||||
|
sys.exit("ERROR: --mode reward (Candidate A) is not yet implemented in v2. "
|
||||||
|
"Use generate_mbap.py for the v1 reward baseline.")
|
||||||
|
|
||||||
|
# Final report
|
||||||
|
freq = cmap.node_type_frequency()
|
||||||
|
entropy = cmap.distribution_entropy()
|
||||||
|
|
||||||
|
print("\n" + "=" * 65)
|
||||||
|
print(" Pipeline complete")
|
||||||
|
print(f" Mode : {mode_label}")
|
||||||
|
print(f" Total API calls : {call_count}")
|
||||||
|
print(f" Valid examples : {valid_count} ({100*valid_count/max(call_count,1):.1f}%)")
|
||||||
|
print(f" Dataset size : {len(dataset)}")
|
||||||
|
print(f" {cmap.fill_summary()}")
|
||||||
|
print(f" Distribution entropy : {entropy:.3f} bits (max={math.log2(len(NODE_TYPE_NAMES)):.2f})")
|
||||||
|
if prior is not None:
|
||||||
|
kl = prior.kl_divergence(freq)
|
||||||
|
print(f" KL(dataset ‖ prior) : {kl:.4f} (0 = perfect alignment with production code)")
|
||||||
|
print(f" Most covered : {sorted(freq, key=freq.get, reverse=True)[:5]}")
|
||||||
|
print(f" Least covered : {sorted(freq, key=freq.get)[:5]}")
|
||||||
|
print(f" Output : {output_path}")
|
||||||
|
print("=" * 65)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Loading…
Reference in New Issue