assistance-engine/docs/ADR/ADR-0005-embedding-model-se...

277 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-0005: Embedding Model Selection — Comparative Evaluation of BGE-M3 vs Qwen3-Embedding-0.6B
**Date:** 2026-03-19
**Status:** Under Evaluation
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering
---
## Context
The AVAP RAG pipeline requires an embedding model capable of mapping a hybrid corpus into a vector space suitable for semantic retrieval. Understanding the exact composition of this corpus is a prerequisite for model selection.
### Corpus characterisation (empirically measured)
A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (`avap.md`) and 40 representative `.avap` code samples. Results (`test_chunks.jsonl`, 190 chunks):
| Metric | Value |
| -------------------- | ----------- |
| Total chunks | 190 |
| Total tokens indexed | 11,498 |
| Minimum chunk size | 1 token |
| Maximum chunk size | 833 tokens |
| Mean chunk size | 60.5 tokens |
| Median chunk size | 29 tokens |
| p90 | 117 tokens |
| p95 | 204 tokens |
| p99 | 511 tokens |
**Corpus composition by type:**
| Type | Count | Description |
| ------------------------- | ----- | ---------------------------------------- |
| Narrative (Spanish prose) | 79 | LRM explanations, concept descriptions |
| Code chunks | 83 | AVAP `.avap` sample files |
| BNF formal grammar | 9 | Formal language specification in English |
| Code examples | 14 | Inline examples within LRM |
| Function signatures | 2 | Extracted function headers |
**Linguistic composition:** 55% of chunks originate from the LRM (`avap.md`), written in Spanish with embedded English DSL identifiers. 45% are `.avap` code files containing English command names (`addVar`, `addResult`, `registerEndpoint`, `ormDirect`) with Spanish-language string literals and variable names (`"Hola"`, `datos_cliente`, `mi_json_final`, `contraseña`, `fecha`). 18.9% of chunks (36 out of 190) contain both Spanish content and English DSL commands within the same chunk — intra-chunk multilingual mixing.
Representative examples of intra-chunk multilingual mixing:
```
// Narrative chunk (Spanish prose + English DSL terms):
"AVAP (Advanced Virtual API Programming) es un DSL (Domain-Specific Language)
Turing Completo, diseñado para la orquestación segura de microservicios e I/O."
// Code chunk (English commands + Spanish identifiers and literals):
addParam("lang", l)
if(l, "es", "=")
addVar(msg, "Hola")
end()
addResult(msg)
// BNF chunk (formal English grammar):
<program> ::= ( <line> | <block_comment> )*
<statement> ::= <assignment> | <method_call_stmt> | <io_command> | ...
```
### Why the initial model was eliminated
The initial model provided was **Qwen2.5-1.5B**. Empirical evaluation by MrHouston Engineering (full results in `research/embeddings/`) demonstrated it is unsuitable for dense retrieval. Qwen2.5-1.5B generates embeddings via the **Last Token** method: the final token of the sequence is assumed to encode all preceding context. For AVAP code chunks, the last token is always a syntactic closer — `end()`, `}`, `endLoop()` — with zero semantic content. The resulting embeddings are effectively identical across functionally distinct chunks.
Benchmark confirmation (BEIR evaluation, three datasets):
**CodeXGLUE** (code retrieval from GitHub repositories):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
| -- | ----------------- | ------------------- | ------------------- | --------------------- |
| 1 | 0.00031 | 0.00031 | **0.9497** | **0.9497** |
| 5 | 0.00086 | 0.00151 | **0.9716** | **0.9876** |
| 10 | 0.00118 | 0.00250 | **0.9734** | **0.9929** |
**CoSQA** (natural language queries over code — closest proxy to AVAP retrieval):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
| --- | ----------------- | ------------------- | ------------------- | --------------------- |
| 1 | 0.00000 | 0.00000 | **0.1740** | **0.1740** |
| 10 | 0.00000 | 0.00000 | **0.3909** | **0.6700** |
| 100 | 0.00210 | 0.01000 | **0.4510** | **0.9520** |
**SciFact** (scientific prose — out-of-domain control):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
| --- | ----------------- | ------------------- | ------------------- | --------------------- |
| 1 | 0.02333 | 0.02083 | **0.5633** | **0.5299** |
| 10 | 0.04619 | 0.07417 | **0.6855** | **0.8161** |
| 100 | 0.07768 | 0.23144 | **0.7129** | **0.9400** |
Qwen2.5-1.5B is eliminated. **Qwen3-Embedding-0.6B is the validated baseline.**
### Why a comparative evaluation was required before adopting Qwen3
Qwen3-Embedding-0.6B's benchmark results were obtained on English-only datasets. They eliminated Qwen2.5-1.5B decisively but did not characterise Qwen3's behaviour on the multilingual mixed corpus that AVAP represents. A second candidate — **BGE-M3** — presented theoretical advantages for this specific corpus that could not be assessed without empirical comparison.
The index rebuild required to adopt any model is destructive and must be done once. Given that the embedding model directly determines the quality of all RAG retrieval in production, adopting a model without a direct comparison between the two viable candidates would not have met the due diligence required for a decision of this impact.
---
## Decision
A **head-to-head comparative evaluation** of BGE-M3 and Qwen3-Embedding-0.6B is being conducted under identical conditions before either is adopted as the production embedding model.
The model that demonstrates superior performance under the evaluation criteria defined below is adopted. This ADR moves to Accepted upon completion of that evaluation, with the selected model documented as the outcome.
---
## Candidate Analysis
### Qwen3-Embedding-0.6B
**Strengths:**
- Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented
- 32,768 token context window — exceeds corpus requirements with large margin
- Same model family as the generation model (Qwen) — shared tokenizer vocabulary
- Lowest integration risk — already validated in the pipeline
**Limitations:**
- Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated
- Not a dedicated multilingual model — training distribution weighted towards English and Chinese
- No native sparse retrieval support
**Corpus fit assessment:** The maximum chunk in the AVAP corpus is 833 tokens — well within both candidates' limits. Qwen3's 32,768 token context window provides no practical advantage over BGE-M3's 8,192 tokens for this corpus. Context window is not a differentiating criterion.
### BGE-M3
**Strengths:**
- Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus
- Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003)
- Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain
**Limitations:**
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus
- 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
- Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent
**Corpus fit assessment:** The intra-chunk multilingual mixing (18.9% of chunks) and the Spanish prose component (79 narrative chunks) are the corpus characteristics most likely to differentiate BGE-M3 from Qwen3. The BEIR and EvaluateRAG evaluations determine whether this theoretical advantage translates to measurable retrieval improvement.
### VRAM
Both candidates require approximately 1.13 GiB at FP16 (BGE-M3: 567M parameters; Qwen3: 596M parameters). Combined with a quantized generation model and KV cache, total VRAM remains within the 4 GiB hardware constraint for both. VRAM is not a selection criterion.
### Embedding dimension
Both candidates output 1024-dimensional vectors. The Elasticsearch index mapping (`int8_hnsw`, `dims: 1024`, cosine similarity) is identical for both candidates. No mapping changes are required between them.
---
## Evaluation Protocol
Both models are evaluated under identical conditions. All results are documented in `research/embeddings/`.
**Step 1 — BEIR benchmarks**
CodeXGLUE, CoSQA and SciFact were run with **BGE-M3** using the same BEIR evaluation scripts and configuration used for Qwen3-Embedding-0.6B. Qwen3-Embedding-0.6B results already existed in `research/embeddings/` and served as the baseline. Reported metrics: NDCG@k, MAP@k, Recall@k and Precision@k at k = 1, 3, 5, 10, 100.
**Step 2 — EvaluateRAG on AVAP corpus**
The Elasticsearch index is rebuilt twice — once with each model — and `EvaluateRAG` is run against the production AVAP golden dataset for both. Reported RAGAS scores: faithfulness, answer_relevancy, context_recall, context_precision, and global score with verdict.
**Selection criterion**
EvaluateRAG is the primary decision signal. It directly measures retrieval quality on the actual AVAP production corpus — including its intra-chunk multilingual mixing (18.9% of chunks) and domain-specific DSL syntax — and is therefore more representative than any external benchmark. The model with the higher global EvaluateRAG score is adopted.
BEIR results are the secondary signal. The primary BEIR metric is NDCG@10. Among the three datasets, **CoSQA is the most representative proxy** for the AVAP retrieval use case — it pairs natural language queries with code snippets, mirroring the Spanish prose query / AVAP DSL code retrieval pattern. CoSQA results are weighted accordingly in the comparison.
All margin comparisons use **absolute percentage points** in NDCG@10 (e.g., 0.39 vs 0.41 is a 2 absolute percentage point difference, not a 5.1% relative difference).
**Tiebreaker**
If the EvaluateRAG global scores are within 5 absolute percentage points of each other, the BEIR results determine the outcome under the following conditions:
- BGE-M3 exceeds Qwen3-Embedding-0.6B by more than 2 absolute percentage points on mean NDCG@10 across all three BEIR datasets, AND
- BGE-M3 does not underperform Qwen3-Embedding-0.6B by more than 2 absolute percentage points on CoSQA NDCG@10 specifically.
If neither condition is met — that is, if EvaluateRAG scores are within 5 points and BGE-M3 does not clear both BEIR thresholds — Qwen3-Embedding-0.6B is adopted. It carries lower integration risk, its benchmarks are already documented, and it is the validated baseline for the system.
---
## Rationale
### Step 1 results — BEIR head-to-head comparison
BGE-M3 benchmarks were completed on the same three BEIR datasets using identical evaluation scripts and configuration. Full results are stored in `research/embeddings/embedding_eval_results/emb_models_result.json`. The following tables compare both candidates side by side.
**CodeXGLUE** (code retrieval from GitHub repositories):
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) |
| ------ | --- | ---------------- | ---------------- | ----------------------- |
| NDCG | 1 | **0.9520** | 0.9497 | +0.23 pp |
| NDCG | 5 | **0.9738** | 0.9717 | +0.21 pp |
| NDCG | 10 | **0.9749** | 0.9734 | +0.15 pp |
| NDCG | 100 | **0.9763** | 0.9745 | +0.18 pp |
| Recall | 1 | **0.9520** | 0.9497 | +0.23 pp |
| Recall | 5 | **0.9892** | 0.9876 | +0.16 pp |
| Recall | 10 | 0.9928 | **0.9930** | 0.02 pp |
| Recall | 100 | **0.9989** | 0.9981 | +0.08 pp |
Both models perform near-identically on CodeXGLUE. All deltas are below 0.25 absolute percentage points. This dataset does not differentiate the candidates.
**CoSQA** (natural language queries over code — most representative proxy for AVAP retrieval):
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) |
| ------ | --- | ------ | ---------------- | ----------------------- |
| NDCG | 1 | 0.1160 | **0.1740** | 5.80 pp |
| NDCG | 5 | 0.2383 | **0.3351** | 9.68 pp |
| NDCG | 10 | 0.2878 | **0.3909** | 10.31 pp |
| NDCG | 100 | 0.3631 | **0.4510** | 8.79 pp |
| Recall | 1 | 0.1160 | **0.1740** | 5.80 pp |
| Recall | 5 | 0.3660 | **0.5020** | 13.60 pp |
| Recall | 10 | 0.5160 | **0.6700** | 15.40 pp |
| Recall | 100 | 0.8740 | **0.9520** | 7.80 pp |
Qwen3-Embedding-0.6B outperforms BGE-M3 on CoSQA by a wide margin at every k. The NDCG@10 gap is 10.31 absolute percentage points. CoSQA is the most representative proxy for the AVAP retrieval use case — it pairs natural language queries with code snippets — making this the most significant BEIR result.
**SciFact** (scientific prose — out-of-domain control):
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) |
| ------ | --- | ------ | ---------------- | ----------------------- |
| NDCG | 1 | 0.5100 | **0.5533** | 4.33 pp |
| NDCG | 5 | 0.6190 | **0.6593** | 4.03 pp |
| NDCG | 10 | 0.6431 | **0.6785** | 3.54 pp |
| NDCG | 100 | 0.6705 | **0.7056** | 3.51 pp |
| Recall | 1 | 0.4818 | **0.5243** | 4.25 pp |
| Recall | 5 | 0.7149 | **0.7587** | 4.38 pp |
| Recall | 10 | 0.7834 | **0.8144** | 3.10 pp |
| Recall | 100 | 0.9037 | **0.9367** | 3.30 pp |
Qwen3-Embedding-0.6B leads BGE-M3 on SciFact by 34 absolute percentage points across all metrics. The gap is consistent but narrower than on CoSQA.
### BEIR summary — NDCG@10 comparison
| Dataset | BGE-M3 | Qwen3-Emb-0.6B | Delta | Leader |
| -------------- | ---------------- | ---------------- | ------------------- | ----------------- |
| CodeXGLUE | 0.9749 | 0.9734 | +0.15 pp | BGE-M3 (marginal) |
| CoSQA | 0.2878 | **0.3909** | 10.31 pp | **Qwen3** |
| SciFact | 0.6431 | **0.6785** | 3.54 pp | **Qwen3** |
| **Mean** | **0.6353** | **0.6809** | **4.56 pp** | **Qwen3** |
Qwen3-Embedding-0.6B leads on mean NDCG@10 by 4.56 absolute percentage points, driven primarily by a 10.31 pp advantage on CoSQA.
### Application of tiebreaker criteria to BEIR results
Per the evaluation protocol, if EvaluateRAG global scores are within 5 absolute percentage points, the BEIR tiebreaker applies. The tiebreaker requires BGE-M3 to meet **both** conditions:
1. **BGE-M3 must exceed Qwen3 by more than 2 pp on mean NDCG@10.** Result: BGE-M3 trails by 4.56 pp. **Condition not met.**
2. **BGE-M3 must not underperform Qwen3 by more than 2 pp on CoSQA NDCG@10.** Result: BGE-M3 trails by 10.31 pp. **Condition not met.**
Neither tiebreaker condition is satisfied. Under the defined protocol, if the EvaluateRAG evaluation results in a tie (within 5 pp), the BEIR tiebreaker defaults to Qwen3-Embedding-0.6B.
### Step 2 results — EvaluateRAG on AVAP corpus
At this moment, we are not in possesion of the golden dataset, cannot proceed with step 2.
_Pending. Results will be documented here upon completion of the EvaluateRAG evaluation for both models._
### Preliminary assessment
The BEIR benchmarks — the secondary decision signal — favour Qwen3-Embedding-0.6B across both the most representative dataset (CoSQA, 10.31 pp) and the out-of-domain control (SciFact, 3.54 pp), with CodeXGLUE effectively tied. BGE-M3's theoretical advantage from multilingual contrastive training does not translate to superior performance on these English-only benchmarks.
The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates.
We have found that Qwen3-embedding is multi-lingual, with good scores in multi-lingual benchmarks. The documentation says so, but the definitive answer will be provided by the scores of the evaluation on the AVAP corpus.
---
## Consequences
- **Index rebuild required** regardless of which model is adopted. Vectors from Qwen2.5-1.5B are incompatible with either candidate. The existing index is deleted before re-ingestion.
- **Two index rebuilds required for the evaluation.** One per candidate for the EvaluateRAG step. Given the current corpus size (190 chunks, 11,498 tokens), rebuild time is not a meaningful constraint.
- **Tokenizer alignment for BGE-M3.** If BGE-M3 is selected, both `OLLAMA_EMB_MODEL_NAME` and `HF_EMB_MODEL_NAME` are updated. Updating only `OLLAMA_EMB_MODEL_NAME` causes the chunker to estimate token counts using the wrong vocabulary — a silent bug that produces inconsistent chunk sizes without raising any error.
- **Future model changes.** Any future replacement of the embedding model follows the same evaluation protocol — BEIR benchmarks on the same three datasets plus EvaluateRAG — before an ADR update is accepted. Results are documented in `research/embeddings/`.