fix(docs): improve formatting and readability in ADR-0005 for embedding model selection

This commit is contained in:
izapata 2026-03-26 15:32:23 +01:00
parent 1f0d31b7b3
commit 08c5aded35
1 changed files with 77 additions and 71 deletions

View File

@ -15,7 +15,7 @@ The AVAP RAG pipeline requires an embedding model capable of mapping a hybrid co
A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (`avap.md`) and 40 representative `.avap` code samples. Results (`test_chunks.jsonl`, 190 chunks): A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (`avap.md`) and 40 representative `.avap` code samples. Results (`test_chunks.jsonl`, 190 chunks):
| Metric | Value | | Metric | Value |
|---|---| | -------------------- | ----------- |
| Total chunks | 190 | | Total chunks | 190 |
| Total tokens indexed | 11,498 | | Total tokens indexed | 11,498 |
| Minimum chunk size | 1 token | | Minimum chunk size | 1 token |
@ -29,7 +29,7 @@ A chunk-level audit was performed on the full indexable corpus: the AVAP Languag
**Corpus composition by type:** **Corpus composition by type:**
| Type | Count | Description | | Type | Count | Description |
|---|---|---| | ------------------------- | ----- | ---------------------------------------- |
| Narrative (Spanish prose) | 79 | LRM explanations, concept descriptions | | Narrative (Spanish prose) | 79 | LRM explanations, concept descriptions |
| Code chunks | 83 | AVAP `.avap` sample files | | Code chunks | 83 | AVAP `.avap` sample files |
| BNF formal grammar | 9 | Formal language specification in English | | BNF formal grammar | 9 | Formal language specification in English |
@ -66,7 +66,7 @@ Benchmark confirmation (BEIR evaluation, three datasets):
**CodeXGLUE** (code retrieval from GitHub repositories): **CodeXGLUE** (code retrieval from GitHub repositories):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall | | k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|---|---|---|---|---| | -- | ----------------- | ------------------- | ------------------- | --------------------- |
| 1 | 0.00031 | 0.00031 | **0.9497** | **0.9497** | | 1 | 0.00031 | 0.00031 | **0.9497** | **0.9497** |
| 5 | 0.00086 | 0.00151 | **0.9716** | **0.9876** | | 5 | 0.00086 | 0.00151 | **0.9716** | **0.9876** |
| 10 | 0.00118 | 0.00250 | **0.9734** | **0.9929** | | 10 | 0.00118 | 0.00250 | **0.9734** | **0.9929** |
@ -74,7 +74,7 @@ Benchmark confirmation (BEIR evaluation, three datasets):
**CoSQA** (natural language queries over code — closest proxy to AVAP retrieval): **CoSQA** (natural language queries over code — closest proxy to AVAP retrieval):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall | | k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|---|---|---|---|---| | --- | ----------------- | ------------------- | ------------------- | --------------------- |
| 1 | 0.00000 | 0.00000 | **0.1740** | **0.1740** | | 1 | 0.00000 | 0.00000 | **0.1740** | **0.1740** |
| 10 | 0.00000 | 0.00000 | **0.3909** | **0.6700** | | 10 | 0.00000 | 0.00000 | **0.3909** | **0.6700** |
| 100 | 0.00210 | 0.01000 | **0.4510** | **0.9520** | | 100 | 0.00210 | 0.01000 | **0.4510** | **0.9520** |
@ -82,7 +82,7 @@ Benchmark confirmation (BEIR evaluation, three datasets):
**SciFact** (scientific prose — out-of-domain control): **SciFact** (scientific prose — out-of-domain control):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall | | k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|---|---|---|---|---| | --- | ----------------- | ------------------- | ------------------- | --------------------- |
| 1 | 0.02333 | 0.02083 | **0.5633** | **0.5299** | | 1 | 0.02333 | 0.02083 | **0.5633** | **0.5299** |
| 10 | 0.04619 | 0.07417 | **0.6855** | **0.8161** | | 10 | 0.04619 | 0.07417 | **0.6855** | **0.8161** |
| 100 | 0.07768 | 0.23144 | **0.7129** | **0.9400** | | 100 | 0.07768 | 0.23144 | **0.7129** | **0.9400** |
@ -110,12 +110,14 @@ The model that demonstrates superior performance under the evaluation criteria d
### Qwen3-Embedding-0.6B ### Qwen3-Embedding-0.6B
**Strengths:** **Strengths:**
- Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented - Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented
- 32,768 token context window — exceeds corpus requirements with large margin - 32,768 token context window — exceeds corpus requirements with large margin
- Same model family as the generation model (Qwen) — shared tokenizer vocabulary - Same model family as the generation model (Qwen) — shared tokenizer vocabulary
- Lowest integration risk — already validated in the pipeline - Lowest integration risk — already validated in the pipeline
**Limitations:** **Limitations:**
- Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated - Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated
- Not a dedicated multilingual model — training distribution weighted towards English and Chinese - Not a dedicated multilingual model — training distribution weighted towards English and Chinese
- No native sparse retrieval support - No native sparse retrieval support
@ -125,11 +127,13 @@ The model that demonstrates superior performance under the evaluation criteria d
### BGE-M3 ### BGE-M3
**Strengths:** **Strengths:**
- Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus - Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus
- Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003) - Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003)
- Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain - Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain
**Limitations:** **Limitations:**
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus - Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus
- 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth - 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
- Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent - Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent
@ -186,7 +190,7 @@ BGE-M3 benchmarks were completed on the same three BEIR datasets using identical
**CodeXGLUE** (code retrieval from GitHub repositories): **CodeXGLUE** (code retrieval from GitHub repositories):
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) | | Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) |
|---|---|---|---|---| | ------ | --- | ---------------- | ---------------- | ----------------------- |
| NDCG | 1 | **0.9520** | 0.9497 | +0.23 pp | | NDCG | 1 | **0.9520** | 0.9497 | +0.23 pp |
| NDCG | 5 | **0.9738** | 0.9717 | +0.21 pp | | NDCG | 5 | **0.9738** | 0.9717 | +0.21 pp |
| NDCG | 10 | **0.9749** | 0.9734 | +0.15 pp | | NDCG | 10 | **0.9749** | 0.9734 | +0.15 pp |
@ -201,7 +205,7 @@ Both models perform near-identically on CodeXGLUE. All deltas are below 0.25 abs
**CoSQA** (natural language queries over code — most representative proxy for AVAP retrieval): **CoSQA** (natural language queries over code — most representative proxy for AVAP retrieval):
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) | | Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) |
|---|---|---|---|---| | ------ | --- | ------ | ---------------- | ----------------------- |
| NDCG | 1 | 0.1160 | **0.1740** | 5.80 pp | | NDCG | 1 | 0.1160 | **0.1740** | 5.80 pp |
| NDCG | 5 | 0.2383 | **0.3351** | 9.68 pp | | NDCG | 5 | 0.2383 | **0.3351** | 9.68 pp |
| NDCG | 10 | 0.2878 | **0.3909** | 10.31 pp | | NDCG | 10 | 0.2878 | **0.3909** | 10.31 pp |
@ -216,7 +220,7 @@ Qwen3-Embedding-0.6B outperforms BGE-M3 on CoSQA by a wide margin at every k. Th
**SciFact** (scientific prose — out-of-domain control): **SciFact** (scientific prose — out-of-domain control):
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) | | Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) |
|---|---|---|---|---| | ------ | --- | ------ | ---------------- | ----------------------- |
| NDCG | 1 | 0.5100 | **0.5533** | 4.33 pp | | NDCG | 1 | 0.5100 | **0.5533** | 4.33 pp |
| NDCG | 5 | 0.6190 | **0.6593** | 4.03 pp | | NDCG | 5 | 0.6190 | **0.6593** | 4.03 pp |
| NDCG | 10 | 0.6431 | **0.6785** | 3.54 pp | | NDCG | 10 | 0.6431 | **0.6785** | 3.54 pp |
@ -231,7 +235,7 @@ Qwen3-Embedding-0.6B leads BGE-M3 on SciFact by 34 absolute percentage points
### BEIR summary — NDCG@10 comparison ### BEIR summary — NDCG@10 comparison
| Dataset | BGE-M3 | Qwen3-Emb-0.6B | Delta | Leader | | Dataset | BGE-M3 | Qwen3-Emb-0.6B | Delta | Leader |
|---|---|---|---|---| | -------------- | ---------------- | ---------------- | ------------------- | ----------------- |
| CodeXGLUE | 0.9749 | 0.9734 | +0.15 pp | BGE-M3 (marginal) | | CodeXGLUE | 0.9749 | 0.9734 | +0.15 pp | BGE-M3 (marginal) |
| CoSQA | 0.2878 | **0.3909** | 10.31 pp | **Qwen3** | | CoSQA | 0.2878 | **0.3909** | 10.31 pp | **Qwen3** |
| SciFact | 0.6431 | **0.6785** | 3.54 pp | **Qwen3** | | SciFact | 0.6431 | **0.6785** | 3.54 pp | **Qwen3** |
@ -260,6 +264,8 @@ The BEIR benchmarks — the secondary decision signal — favour Qwen3-Embedding
The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates. The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates.
We have found that Qwen3-embedding is multi-lingual, with good scores in multi-lingual benchmarks. The documentation says so, but the definitive answer will be provided by the scores of the evaluation on the AVAP corpus.
--- ---
## Consequences ## Consequences