fix(docs): improve formatting and readability in ADR-0005 for embedding model selection

2026-03-26 15:32:23 +01:00 · 2026-03-26 15:32:23 +01:00 · 08c5aded35
parent 1f0d31b7b3
commit 08c5aded35
1 changed files with 77 additions and 71 deletions
--- a/docs/ADR/ADR-0005-embedding-model-selection.md
+++ b/docs/ADR/ADR-0005-embedding-model-selection.md
@ -15,7 +15,7 @@ The AVAP RAG pipeline requires an embedding model capable of mapping a hybrid co
 A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (`avap.md`) and 40 representative `.avap` code samples. Results (`test_chunks.jsonl`, 190 chunks):
 | Metric               | Value       |
-|---|---|
+| -------------------- | ----------- |
 | Total chunks         | 190         |
 | Total tokens indexed | 11,498      |
 | Minimum chunk size   | 1 token     |
@ -29,7 +29,7 @@ A chunk-level audit was performed on the full indexable corpus: the AVAP Languag
 **Corpus composition by type:**
 | Type                      | Count | Description                              |
-|---|---|---|
+| ------------------------- | ----- | ---------------------------------------- |
 | Narrative (Spanish prose) | 79    | LRM explanations, concept descriptions   |
 | Code chunks               | 83    | AVAP `.avap` sample files              |
 | BNF formal grammar        | 9     | Formal language specification in English |
@ -66,7 +66,7 @@ Benchmark confirmation (BEIR evaluation, three datasets):
 **CodeXGLUE** (code retrieval from GitHub repositories):
 | k  | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
-|---|---|---|---|---|
+| -- | ----------------- | ------------------- | ------------------- | --------------------- |
 | 1  | 0.00031           | 0.00031             | **0.9497**    | **0.9497**      |
 | 5  | 0.00086           | 0.00151             | **0.9716**    | **0.9876**      |
 | 10 | 0.00118           | 0.00250             | **0.9734**    | **0.9929**      |
@ -74,7 +74,7 @@ Benchmark confirmation (BEIR evaluation, three datasets):
 **CoSQA** (natural language queries over code — closest proxy to AVAP retrieval):
 | k   | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
-|---|---|---|---|---|
+| --- | ----------------- | ------------------- | ------------------- | --------------------- |
 | 1   | 0.00000           | 0.00000             | **0.1740**    | **0.1740**      |
 | 10  | 0.00000           | 0.00000             | **0.3909**    | **0.6700**      |
 | 100 | 0.00210           | 0.01000             | **0.4510**    | **0.9520**      |
@ -82,7 +82,7 @@ Benchmark confirmation (BEIR evaluation, three datasets):
 **SciFact** (scientific prose — out-of-domain control):
 | k   | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
-|---|---|---|---|---|
+| --- | ----------------- | ------------------- | ------------------- | --------------------- |
 | 1   | 0.02333           | 0.02083             | **0.5633**    | **0.5299**      |
 | 10  | 0.04619           | 0.07417             | **0.6855**    | **0.8161**      |
 | 100 | 0.07768           | 0.23144             | **0.7129**    | **0.9400**      |
@ -110,12 +110,14 @@ The model that demonstrates superior performance under the evaluation criteria d
 ### Qwen3-Embedding-0.6B
 **Strengths:**
 - Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented
 - 32,768 token context window — exceeds corpus requirements with large margin
 - Same model family as the generation model (Qwen) — shared tokenizer vocabulary
 - Lowest integration risk — already validated in the pipeline
 **Limitations:**
 - Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated
 - Not a dedicated multilingual model — training distribution weighted towards English and Chinese
 - No native sparse retrieval support
@ -125,11 +127,13 @@ The model that demonstrates superior performance under the evaluation criteria d
 ### BGE-M3
 **Strengths:**
 - Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus
 - Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003)
 - Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain
 **Limitations:**
 - Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus
 - 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
 - Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent
@ -186,7 +190,7 @@ BGE-M3 benchmarks were completed on the same three BEIR datasets using identical
 **CodeXGLUE** (code retrieval from GitHub repositories):
 | Metric | k   | BGE-M3           | Qwen3-Emb-0.6B   | Delta (BGE-M3 − Qwen3) |
-|---|---|---|---|---|
+| ------ | --- | ---------------- | ---------------- | ----------------------- |
 | NDCG   | 1   | **0.9520** | 0.9497           | +0.23 pp                |
 | NDCG   | 5   | **0.9738** | 0.9717           | +0.21 pp                |
 | NDCG   | 10  | **0.9749** | 0.9734           | +0.15 pp                |
@ -201,7 +205,7 @@ Both models perform near-identically on CodeXGLUE. All deltas are below 0.25 abs
 **CoSQA** (natural language queries over code — most representative proxy for AVAP retrieval):
 | Metric | k   | BGE-M3 | Qwen3-Emb-0.6B   | Delta (BGE-M3 − Qwen3) |
-|---|---|---|---|---|
+| ------ | --- | ------ | ---------------- | ----------------------- |
 | NDCG   | 1   | 0.1160 | **0.1740** | −5.80 pp               |
 | NDCG   | 5   | 0.2383 | **0.3351** | −9.68 pp               |
 | NDCG   | 10  | 0.2878 | **0.3909** | −10.31 pp              |
@ -216,7 +220,7 @@ Qwen3-Embedding-0.6B outperforms BGE-M3 on CoSQA by a wide margin at every k. Th
 **SciFact** (scientific prose — out-of-domain control):
 | Metric | k   | BGE-M3 | Qwen3-Emb-0.6B   | Delta (BGE-M3 − Qwen3) |
-|---|---|---|---|---|
+| ------ | --- | ------ | ---------------- | ----------------------- |
 | NDCG   | 1   | 0.5100 | **0.5533** | −4.33 pp               |
 | NDCG   | 5   | 0.6190 | **0.6593** | −4.03 pp               |
 | NDCG   | 10  | 0.6431 | **0.6785** | −3.54 pp               |
@ -231,7 +235,7 @@ Qwen3-Embedding-0.6B leads BGE-M3 on SciFact by 3–4 absolute percentage points
 ### BEIR summary — NDCG@10 comparison
 | Dataset        | BGE-M3           | Qwen3-Emb-0.6B   | Delta               | Leader            |
-|---|---|---|---|---|
+| -------------- | ---------------- | ---------------- | ------------------- | ----------------- |
 | CodeXGLUE      | 0.9749           | 0.9734           | +0.15 pp            | BGE-M3 (marginal) |
 | CoSQA          | 0.2878           | **0.3909** | −10.31 pp          | **Qwen3**   |
 | SciFact        | 0.6431           | **0.6785** | −3.54 pp           | **Qwen3**   |
@ -260,6 +264,8 @@ The BEIR benchmarks — the secondary decision signal — favour Qwen3-Embedding
 The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates.
 We have found that Qwen3-embedding is multi-lingual, with good scores in multi-lingual benchmarks. The documentation says so, but the definitive answer will be provided by the scores of the evaluation on the AVAP corpus.
 ---
 ## Consequences