Continued ADR-0005 and created ADR-0006
This commit is contained in:
parent
1442a632c9
commit
bd542bb14d
|
|
@ -89,19 +89,19 @@ Benchmark confirmation (BEIR evaluation, three datasets):
|
|||
|
||||
Qwen2.5-1.5B is eliminated. **Qwen3-Embedding-0.6B is the validated baseline.**
|
||||
|
||||
### Why a comparative evaluation is required before adopting Qwen3
|
||||
### Why a comparative evaluation was required before adopting Qwen3
|
||||
|
||||
Qwen3-Embedding-0.6B's benchmark results were obtained on English-only datasets. They eliminate Qwen2.5-1.5B decisively but do not characterise Qwen3's behaviour on the multilingual mixed corpus that AVAP represents. A second candidate — **BGE-M3** — presents theoretical advantages for this specific corpus that cannot be assessed without empirical comparison.
|
||||
Qwen3-Embedding-0.6B's benchmark results were obtained on English-only datasets. They eliminated Qwen2.5-1.5B decisively but did not characterise Qwen3's behaviour on the multilingual mixed corpus that AVAP represents. A second candidate — **BGE-M3** — presented theoretical advantages for this specific corpus that could not be assessed without empirical comparison.
|
||||
|
||||
The index rebuild required to adopt any model is destructive and must be done once. Given that the embedding model directly determines the quality of all RAG retrieval in production, adopting a model without a direct comparison between the two viable candidates would not meet the due diligence required for a decision of this impact.
|
||||
The index rebuild required to adopt any model is destructive and must be done once. Given that the embedding model directly determines the quality of all RAG retrieval in production, adopting a model without a direct comparison between the two viable candidates would not have met the due diligence required for a decision of this impact.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Conduct a **head-to-head comparative evaluation** of BGE-M3 and Qwen3-Embedding-0.6B under identical conditions before adopting either as the production embedding model.
|
||||
A **head-to-head comparative evaluation** of BGE-M3 and Qwen3-Embedding-0.6B is being conducted under identical conditions before either is adopted as the production embedding model.
|
||||
|
||||
The model that demonstrates superior performance under the evaluation criteria defined below will be adopted. This ADR moves to Accepted upon completion of that evaluation, with the selected model documented as the outcome.
|
||||
The model that demonstrates superior performance under the evaluation criteria defined below is adopted. This ADR moves to Accepted upon completion of that evaluation, with the selected model documented as the outcome.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -130,11 +130,11 @@ The model that demonstrates superior performance under the evaluation criteria d
|
|||
- Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain
|
||||
|
||||
**Limitations:**
|
||||
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact — no empirical results for this corpus
|
||||
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus
|
||||
- 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
|
||||
- Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent
|
||||
|
||||
**Corpus fit assessment:** The intra-chunk multilingual mixing (18.9% of chunks) and the Spanish prose component (79 narrative chunks) are the corpus characteristics most likely to differentiate BGE-M3 from Qwen3. The BEIR and EvaluateRAG evaluations will determine whether this theoretical advantage translates to measurable retrieval improvement.
|
||||
**Corpus fit assessment:** The intra-chunk multilingual mixing (18.9% of chunks) and the Spanish prose component (79 narrative chunks) are the corpus characteristics most likely to differentiate BGE-M3 from Qwen3. The BEIR and EvaluateRAG evaluations determine whether this theoretical advantage translates to measurable retrieval improvement.
|
||||
|
||||
### VRAM
|
||||
|
||||
|
|
@ -148,15 +148,15 @@ Both candidates output 1024-dimensional vectors. The Elasticsearch index mapping
|
|||
|
||||
## Evaluation Protocol
|
||||
|
||||
Both models are evaluated under identical conditions. Results must be documented in `research/embeddings/` before this ADR is closed.
|
||||
Both models are evaluated under identical conditions. All results are documented in `research/embeddings/`.
|
||||
|
||||
**Step 1 — BEIR benchmarks**
|
||||
|
||||
Run CodeXGLUE, CoSQA and SciFact with **BGE-M3** using the same BEIR evaluation scripts and configuration used for Qwen3-Embedding-0.6B. Qwen3-Embedding-0.6B results already exist in `research/embeddings/` and serve as the baseline. Report NDCG@k, MAP@k, Recall@k and Precision@k at k = 1, 3, 5, 10, 100.
|
||||
CodeXGLUE, CoSQA and SciFact were run with **BGE-M3** using the same BEIR evaluation scripts and configuration used for Qwen3-Embedding-0.6B. Qwen3-Embedding-0.6B results already existed in `research/embeddings/` and served as the baseline. Reported metrics: NDCG@k, MAP@k, Recall@k and Precision@k at k = 1, 3, 5, 10, 100.
|
||||
|
||||
**Step 2 — EvaluateRAG on AVAP corpus**
|
||||
|
||||
Rebuild the Elasticsearch index twice — once with each model — and run `EvaluateRAG` against the production AVAP golden dataset for both. Report RAGAS scores: faithfulness, answer_relevancy, context_recall, context_precision, and global score with verdict.
|
||||
The Elasticsearch index is rebuilt twice — once with each model — and `EvaluateRAG` is run against the production AVAP golden dataset for both. Reported RAGAS scores: faithfulness, answer_relevancy, context_recall, context_precision, and global score with verdict.
|
||||
|
||||
**Selection criterion**
|
||||
|
||||
|
|
@ -170,16 +170,101 @@ All margin comparisons use **absolute percentage points** in NDCG@10 (e.g., 0.39
|
|||
|
||||
If the EvaluateRAG global scores are within 5 absolute percentage points of each other, the BEIR results determine the outcome under the following conditions:
|
||||
|
||||
- BGE-M3 must exceed Qwen3-Embedding-0.6B by more than 2 absolute percentage points on mean NDCG@10 across all three BEIR datasets, AND
|
||||
- BGE-M3 must not underperform Qwen3-Embedding-0.6B by more than 2 absolute percentage points on CoSQA NDCG@10 specifically.
|
||||
- BGE-M3 exceeds Qwen3-Embedding-0.6B by more than 2 absolute percentage points on mean NDCG@10 across all three BEIR datasets, AND
|
||||
- BGE-M3 does not underperform Qwen3-Embedding-0.6B by more than 2 absolute percentage points on CoSQA NDCG@10 specifically.
|
||||
|
||||
If neither condition is met — that is, if EvaluateRAG scores are within 5 points and BGE-M3 does not clear both BEIR thresholds — Qwen3-Embedding-0.6B is adopted. It carries lower integration risk, its benchmarks are already documented, and it is the validated baseline for the system.
|
||||
|
||||
---
|
||||
|
||||
## Rationale
|
||||
|
||||
### Step 1 results — BEIR head-to-head comparison
|
||||
|
||||
BGE-M3 benchmarks were completed on the same three BEIR datasets using identical evaluation scripts and configuration. Full results are stored in `research/embeddings/embedding_eval_results/emb_models_result.json`. The following tables compare both candidates side by side.
|
||||
|
||||
**CodeXGLUE** (code retrieval from GitHub repositories):
|
||||
|
||||
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
||||
|---|---|---|---|---|
|
||||
| NDCG | 1 | **0.9520** | 0.9497 | +0.23 pp |
|
||||
| NDCG | 5 | **0.9738** | 0.9717 | +0.21 pp |
|
||||
| NDCG | 10 | **0.9749** | 0.9734 | +0.15 pp |
|
||||
| NDCG | 100 | **0.9763** | 0.9745 | +0.18 pp |
|
||||
| Recall | 1 | **0.9520** | 0.9497 | +0.23 pp |
|
||||
| Recall | 5 | **0.9892** | 0.9876 | +0.16 pp |
|
||||
| Recall | 10 | 0.9928 | **0.9930** | −0.02 pp |
|
||||
| Recall | 100 | **0.9989** | 0.9981 | +0.08 pp |
|
||||
|
||||
Both models perform near-identically on CodeXGLUE. All deltas are below 0.25 absolute percentage points. This dataset does not differentiate the candidates.
|
||||
|
||||
**CoSQA** (natural language queries over code — most representative proxy for AVAP retrieval):
|
||||
|
||||
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
||||
|---|---|---|---|---|
|
||||
| NDCG | 1 | 0.1160 | **0.1740** | −5.80 pp |
|
||||
| NDCG | 5 | 0.2383 | **0.3351** | −9.68 pp |
|
||||
| NDCG | 10 | 0.2878 | **0.3909** | −10.31 pp |
|
||||
| NDCG | 100 | 0.3631 | **0.4510** | −8.79 pp |
|
||||
| Recall | 1 | 0.1160 | **0.1740** | −5.80 pp |
|
||||
| Recall | 5 | 0.3660 | **0.5020** | −13.60 pp |
|
||||
| Recall | 10 | 0.5160 | **0.6700** | −15.40 pp |
|
||||
| Recall | 100 | 0.8740 | **0.9520** | −7.80 pp |
|
||||
|
||||
Qwen3-Embedding-0.6B outperforms BGE-M3 on CoSQA by a wide margin at every k. The NDCG@10 gap is 10.31 absolute percentage points. CoSQA is the most representative proxy for the AVAP retrieval use case — it pairs natural language queries with code snippets — making this the most significant BEIR result.
|
||||
|
||||
**SciFact** (scientific prose — out-of-domain control):
|
||||
|
||||
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
||||
|---|---|---|---|---|
|
||||
| NDCG | 1 | 0.5100 | **0.5533** | −4.33 pp |
|
||||
| NDCG | 5 | 0.6190 | **0.6593** | −4.03 pp |
|
||||
| NDCG | 10 | 0.6431 | **0.6785** | −3.54 pp |
|
||||
| NDCG | 100 | 0.6705 | **0.7056** | −3.51 pp |
|
||||
| Recall | 1 | 0.4818 | **0.5243** | −4.25 pp |
|
||||
| Recall | 5 | 0.7149 | **0.7587** | −4.38 pp |
|
||||
| Recall | 10 | 0.7834 | **0.8144** | −3.10 pp |
|
||||
| Recall | 100 | 0.9037 | **0.9367** | −3.30 pp |
|
||||
|
||||
Qwen3-Embedding-0.6B leads BGE-M3 on SciFact by 3–4 absolute percentage points across all metrics. The gap is consistent but narrower than on CoSQA.
|
||||
|
||||
### BEIR summary — NDCG@10 comparison
|
||||
|
||||
| Dataset | BGE-M3 | Qwen3-Emb-0.6B | Delta | Leader |
|
||||
|---|---|---|---|---|
|
||||
| CodeXGLUE | 0.9749 | 0.9734 | +0.15 pp | BGE-M3 (marginal) |
|
||||
| CoSQA | 0.2878 | **0.3909** | −10.31 pp | **Qwen3** |
|
||||
| SciFact | 0.6431 | **0.6785** | −3.54 pp | **Qwen3** |
|
||||
| **Mean** | **0.6353** | **0.6809** | **−4.56 pp** | **Qwen3** |
|
||||
|
||||
Qwen3-Embedding-0.6B leads on mean NDCG@10 by 4.56 absolute percentage points, driven primarily by a 10.31 pp advantage on CoSQA.
|
||||
|
||||
### Application of tiebreaker criteria to BEIR results
|
||||
|
||||
Per the evaluation protocol, if EvaluateRAG global scores are within 5 absolute percentage points, the BEIR tiebreaker applies. The tiebreaker requires BGE-M3 to meet **both** conditions:
|
||||
|
||||
1. **BGE-M3 must exceed Qwen3 by more than 2 pp on mean NDCG@10.** Result: BGE-M3 trails by 4.56 pp. **Condition not met.**
|
||||
2. **BGE-M3 must not underperform Qwen3 by more than 2 pp on CoSQA NDCG@10.** Result: BGE-M3 trails by 10.31 pp. **Condition not met.**
|
||||
|
||||
Neither tiebreaker condition is satisfied. Under the defined protocol, if the EvaluateRAG evaluation results in a tie (within 5 pp), the BEIR tiebreaker defaults to Qwen3-Embedding-0.6B.
|
||||
|
||||
### Step 2 results — EvaluateRAG on AVAP corpus
|
||||
|
||||
At this moment, we are not in possesion of the golden dataset, cannot proceed with step 2.
|
||||
|
||||
_Pending. Results will be documented here upon completion of the EvaluateRAG evaluation for both models._
|
||||
|
||||
### Preliminary assessment
|
||||
|
||||
The BEIR benchmarks — the secondary decision signal — favour Qwen3-Embedding-0.6B across both the most representative dataset (CoSQA, −10.31 pp) and the out-of-domain control (SciFact, −3.54 pp), with CodeXGLUE effectively tied. BGE-M3's theoretical advantage from multilingual contrastive training does not translate to superior performance on these English-only benchmarks.
|
||||
|
||||
The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- **Index rebuild required** regardless of which model is adopted. Vectors from Qwen2.5-1.5B are incompatible with either candidate. The existing index must be deleted before re-ingestion.
|
||||
- **Index rebuild required** regardless of which model is adopted. Vectors from Qwen2.5-1.5B are incompatible with either candidate. The existing index is deleted before re-ingestion.
|
||||
- **Two index rebuilds required for the evaluation.** One per candidate for the EvaluateRAG step. Given the current corpus size (190 chunks, 11,498 tokens), rebuild time is not a meaningful constraint.
|
||||
- **Tokenizer alignment for BGE-M3.** If BGE-M3 is selected, both `OLLAMA_EMB_MODEL_NAME` and `HF_EMB_MODEL_NAME` must be updated. Updating only `OLLAMA_EMB_MODEL_NAME` causes the chunker to estimate token counts using the wrong vocabulary — a silent bug that produces inconsistent chunk sizes without raising any error.
|
||||
- **Future model changes.** Any future replacement of the embedding model must follow the same evaluation protocol — BEIR benchmarks on the same three datasets plus EvaluateRAG — before an ADR update is accepted. Results must be documented in `research/embeddings/`.
|
||||
- **Tokenizer alignment for BGE-M3.** If BGE-M3 is selected, both `OLLAMA_EMB_MODEL_NAME` and `HF_EMB_MODEL_NAME` are updated. Updating only `OLLAMA_EMB_MODEL_NAME` causes the chunker to estimate token counts using the wrong vocabulary — a silent bug that produces inconsistent chunk sizes without raising any error.
|
||||
- **Future model changes.** Any future replacement of the embedding model follows the same evaluation protocol — BEIR benchmarks on the same three datasets plus EvaluateRAG — before an ADR update is accepted. Results are documented in `research/embeddings/`.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,78 @@
|
|||
# ADR-0006: Code Indexing Improvements — Comparative Evaluation of code chunking strategies
|
||||
|
||||
**Date:** 2026-03-24
|
||||
**Status:** Proposed
|
||||
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
Efficient code indexing is a critical component for enabling high-quality code search, retrieval-augmented generation (RAG), and semantic understanding in developer tooling. The main challenge lies in representing source code in a way that preserves its syntactic and semantic structure while remaining suitable for embedding-based retrieval systems.
|
||||
|
||||
In this context, we explored different strategies to improve the indexing of .avap code files, starting from a naïve approach and progressively moving toward more structured representations based on parsing techniques.
|
||||
|
||||
### Alternatives
|
||||
- File-level chunking (baseline):
|
||||
|
||||
Each .avap file is treated as a single chunk and indexed directly. This approach is simple and fast but ignores internal structure (functions, classes, blocks).
|
||||
|
||||
|
||||
- EBNF chunking as metadata:
|
||||
|
||||
Each .avap file is still treated as a single chunk and indexed directly. However, by using the AVAP EBNF syntax, we extract the AST structure and injects it into the chunk metadata.
|
||||
|
||||
|
||||
- Full EBNF chunking:
|
||||
|
||||
Each .avap file is still treated as a single chunk and indexed directly. The difference between this approach and the last 2, is that the AST is indexed instead the code.
|
||||
|
||||
|
||||
- Grammar definition chunking:
|
||||
|
||||
Code is segmented using a language-specific configuration (`avap_config.json`) instead of one-file chunks. The chunker applies a lexer (comments/strings), identifies multi-line blocks (`function`, `if`, `startLoop`, `try`), classifies single-line statements (`registerEndpoint`, `orm_command`, `http_command`, etc.), and enriches every chunk with semantic tags (`uses_orm`, `uses_http`, `uses_async`, `returns_result`, among others).
|
||||
|
||||
This strategy also extracts function signatures as dedicated lightweight chunks and propagates local context between nearby chunks (semantic overlap), improving retrieval precision for both API-level and implementation-level queries.
|
||||
|
||||
|
||||
### Indexed docs
|
||||
For each strategy, we created a different Elasticsearch Index with their own characteristics. The 3 first approaches have 33 chunks (1 chunk per file), whereas the last approach has 89 chunks.
|
||||
|
||||
|
||||
### How can we evaluate each strategy?
|
||||
|
||||
**Evaluation Protocol:**
|
||||
|
||||
1. **Golden Dataset**
|
||||
- Generate a set of natural language queries paired with their ground-truth context (filename).
|
||||
- Each query should be answerable by examining one or more code samples.
|
||||
- Example: Query="How do you handle errors in AVAP?" → Context="try_catch_request.avap"
|
||||
|
||||
2. **Test Each Strategy**
|
||||
- For each of the 4 chunking strategies, run the same set of queries against the respective Elasticsearch index.
|
||||
- Record the top-10 retrieved chunks for each query.
|
||||
|
||||
3. **Metrics**
|
||||
- `NDCG@10`: Normalized discounted cumulative gain at rank 10 (measures ranking quality).
|
||||
- `Recall@10`: Fraction of relevant chunks retrieved in top 10.
|
||||
- `MRR@10`: Mean reciprocal rank (position of first relevant result).
|
||||
|
||||
4. **Relevance Judgment**
|
||||
- A chunk is considered relevant if it contains code directly answering the query.
|
||||
- For file-level strategies: entire file is relevant or irrelevant.
|
||||
- For grammar-definition: specific block/statement chunks are relevant even if the full file is not.
|
||||
|
||||
5. **Acceptance Criteria**
|
||||
- **Grammar definition must achieve at least a 10% improvement in NDCG@10 over file-level baseline.**
|
||||
- **Recall@10 must not drop by more than 5 absolute percentage points vs file-level.**
|
||||
- **Index size increase must remain below 50% of baseline.**
|
||||
|
||||
## Decision
|
||||
|
||||
|
||||
|
||||
## Rationale
|
||||
|
||||
|
||||
|
||||
## Consequences
|
||||
|
||||
Loading…
Reference in New Issue