assistance-engine/docs/ADR/ADR-0005-embedding-model-se...

186 lines
11 KiB
Markdown

# ADR-0005: Embedding Model Selection — Comparative Evaluation of BGE-M3 vs Qwen3-Embedding-0.6B
**Date:** 2026-03-19
**Status:** Under Evaluation
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering
---
## Context
The AVAP RAG pipeline requires an embedding model capable of mapping a hybrid corpus into a vector space suitable for semantic retrieval. Understanding the exact composition of this corpus is a prerequisite for model selection.
### Corpus characterisation (empirically measured)
A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (`avap.md`) and 40 representative `.avap` code samples. Results (`test_chunks.jsonl`, 190 chunks):
| Metric | Value |
|---|---|
| Total chunks | 190 |
| Total tokens indexed | 11,498 |
| Minimum chunk size | 1 token |
| Maximum chunk size | 833 tokens |
| Mean chunk size | 60.5 tokens |
| Median chunk size | 29 tokens |
| p90 | 117 tokens |
| p95 | 204 tokens |
| p99 | 511 tokens |
**Corpus composition by type:**
| Type | Count | Description |
|---|---|---|
| Narrative (Spanish prose) | 79 | LRM explanations, concept descriptions |
| Code chunks | 83 | AVAP `.avap` sample files |
| BNF formal grammar | 9 | Formal language specification in English |
| Code examples | 14 | Inline examples within LRM |
| Function signatures | 2 | Extracted function headers |
**Linguistic composition:** 55% of chunks originate from the LRM (`avap.md`), written in Spanish with embedded English DSL identifiers. 45% are `.avap` code files containing English command names (`addVar`, `addResult`, `registerEndpoint`, `ormDirect`) with Spanish-language string literals and variable names (`"Hola"`, `datos_cliente`, `mi_json_final`, `contraseña`, `fecha`). 18.9% of chunks (36 out of 190) contain both Spanish content and English DSL commands within the same chunk — intra-chunk multilingual mixing.
Representative examples of intra-chunk multilingual mixing:
```
// Narrative chunk (Spanish prose + English DSL terms):
"AVAP (Advanced Virtual API Programming) es un DSL (Domain-Specific Language)
Turing Completo, diseñado para la orquestación segura de microservicios e I/O."
// Code chunk (English commands + Spanish identifiers and literals):
addParam("lang", l)
if(l, "es", "=")
addVar(msg, "Hola")
end()
addResult(msg)
// BNF chunk (formal English grammar):
<program> ::= ( <line> | <block_comment> )*
<statement> ::= <assignment> | <method_call_stmt> | <io_command> | ...
```
### Why the initial model was eliminated
The initial model provided was **Qwen2.5-1.5B**. Empirical evaluation by MrHouston Engineering (full results in `research/embeddings/`) demonstrated it is unsuitable for dense retrieval. Qwen2.5-1.5B generates embeddings via the **Last Token** method: the final token of the sequence is assumed to encode all preceding context. For AVAP code chunks, the last token is always a syntactic closer — `end()`, `}`, `endLoop()` — with zero semantic content. The resulting embeddings are effectively identical across functionally distinct chunks.
Benchmark confirmation (BEIR evaluation, three datasets):
**CodeXGLUE** (code retrieval from GitHub repositories):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|---|---|---|---|---|
| 1 | 0.00031 | 0.00031 | **0.9497** | **0.9497** |
| 5 | 0.00086 | 0.00151 | **0.9716** | **0.9876** |
| 10 | 0.00118 | 0.00250 | **0.9734** | **0.9929** |
**CoSQA** (natural language queries over code — closest proxy to AVAP retrieval):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|---|---|---|---|---|
| 1 | 0.00000 | 0.00000 | **0.1740** | **0.1740** |
| 10 | 0.00000 | 0.00000 | **0.3909** | **0.6700** |
| 100 | 0.00210 | 0.01000 | **0.4510** | **0.9520** |
**SciFact** (scientific prose — out-of-domain control):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|---|---|---|---|---|
| 1 | 0.02333 | 0.02083 | **0.5633** | **0.5299** |
| 10 | 0.04619 | 0.07417 | **0.6855** | **0.8161** |
| 100 | 0.07768 | 0.23144 | **0.7129** | **0.9400** |
Qwen2.5-1.5B is eliminated. **Qwen3-Embedding-0.6B is the validated baseline.**
### Why a comparative evaluation is required before adopting Qwen3
Qwen3-Embedding-0.6B's benchmark results were obtained on English-only datasets. They eliminate Qwen2.5-1.5B decisively but do not characterise Qwen3's behaviour on the multilingual mixed corpus that AVAP represents. A second candidate — **BGE-M3** — presents theoretical advantages for this specific corpus that cannot be assessed without empirical comparison.
The index rebuild required to adopt any model is destructive and must be done once. Given that the embedding model directly determines the quality of all RAG retrieval in production, adopting a model without a direct comparison between the two viable candidates would not meet the due diligence required for a decision of this impact.
---
## Decision
Conduct a **head-to-head comparative evaluation** of BGE-M3 and Qwen3-Embedding-0.6B under identical conditions before adopting either as the production embedding model.
The model that demonstrates superior performance under the evaluation criteria defined below will be adopted. This ADR moves to Accepted upon completion of that evaluation, with the selected model documented as the outcome.
---
## Candidate Analysis
### Qwen3-Embedding-0.6B
**Strengths:**
- Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented
- 32,768 token context window — exceeds corpus requirements with large margin
- Same model family as the generation model (Qwen) — shared tokenizer vocabulary
- Lowest integration risk — already validated in the pipeline
**Limitations:**
- Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated
- Not a dedicated multilingual model — training distribution weighted towards English and Chinese
- No native sparse retrieval support
**Corpus fit assessment:** The maximum chunk in the AVAP corpus is 833 tokens — well within both candidates' limits. Qwen3's 32,768 token context window provides no practical advantage over BGE-M3's 8,192 tokens for this corpus. Context window is not a differentiating criterion.
### BGE-M3
**Strengths:**
- Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus
- Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003)
- Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain
**Limitations:**
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact — no empirical results for this corpus
- 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
- Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent
**Corpus fit assessment:** The intra-chunk multilingual mixing (18.9% of chunks) and the Spanish prose component (79 narrative chunks) are the corpus characteristics most likely to differentiate BGE-M3 from Qwen3. The BEIR and EvaluateRAG evaluations will determine whether this theoretical advantage translates to measurable retrieval improvement.
### VRAM
Both candidates require approximately 1.13 GiB at FP16 (BGE-M3: 567M parameters; Qwen3: 596M parameters). Combined with a quantized generation model and KV cache, total VRAM remains within the 4 GiB hardware constraint for both. VRAM is not a selection criterion.
### Embedding dimension
Both candidates output 1024-dimensional vectors. The Elasticsearch index mapping (`int8_hnsw`, `dims: 1024`, cosine similarity) is identical for both candidates. No mapping changes are required between them.
---
## Evaluation Protocol
Both models are evaluated under identical conditions. Results must be documented in `research/embeddings/` before this ADR is closed.
**Step 1 — BEIR benchmarks**
Run CodeXGLUE, CoSQA and SciFact with **BGE-M3** using the same BEIR evaluation scripts and configuration used for Qwen3-Embedding-0.6B. Qwen3-Embedding-0.6B results already exist in `research/embeddings/` and serve as the baseline. Report NDCG@k, MAP@k, Recall@k and Precision@k at k = 1, 3, 5, 10, 100.
**Step 2 — EvaluateRAG on AVAP corpus**
Rebuild the Elasticsearch index twice — once with each model — and run `EvaluateRAG` against the production AVAP golden dataset for both. Report RAGAS scores: faithfulness, answer_relevancy, context_recall, context_precision, and global score with verdict.
**Selection criterion**
EvaluateRAG is the primary decision signal. It directly measures retrieval quality on the actual AVAP production corpus — including its intra-chunk multilingual mixing (18.9% of chunks) and domain-specific DSL syntax — and is therefore more representative than any external benchmark. The model with the higher global EvaluateRAG score is adopted.
BEIR results are the secondary signal. The primary BEIR metric is NDCG@10. Among the three datasets, **CoSQA is the most representative proxy** for the AVAP retrieval use case — it pairs natural language queries with code snippets, mirroring the Spanish prose query / AVAP DSL code retrieval pattern. CoSQA results are weighted accordingly in the comparison.
All margin comparisons use **absolute percentage points** in NDCG@10 (e.g., 0.39 vs 0.41 is a 2 absolute percentage point difference, not a 5.1% relative difference).
**Tiebreaker**
If the EvaluateRAG global scores are within 5 absolute percentage points of each other, the BEIR results determine the outcome under the following conditions:
- BGE-M3 must exceed Qwen3-Embedding-0.6B by more than 2 absolute percentage points on mean NDCG@10 across all three BEIR datasets, AND
- BGE-M3 must not underperform Qwen3-Embedding-0.6B by more than 2 absolute percentage points on CoSQA NDCG@10 specifically.
If neither condition is met — that is, if EvaluateRAG scores are within 5 points and BGE-M3 does not clear both BEIR thresholds — Qwen3-Embedding-0.6B is adopted. It carries lower integration risk, its benchmarks are already documented, and it is the validated baseline for the system.
---
## Consequences
- **Index rebuild required** regardless of which model is adopted. Vectors from Qwen2.5-1.5B are incompatible with either candidate. The existing index must be deleted before re-ingestion.
- **Two index rebuilds required for the evaluation.** One per candidate for the EvaluateRAG step. Given the current corpus size (190 chunks, 11,498 tokens), rebuild time is not a meaningful constraint.
- **Tokenizer alignment for BGE-M3.** If BGE-M3 is selected, both `OLLAMA_EMB_MODEL_NAME` and `HF_EMB_MODEL_NAME` must be updated. Updating only `OLLAMA_EMB_MODEL_NAME` causes the chunker to estimate token counts using the wrong vocabulary — a silent bug that produces inconsistent chunk sizes without raising any error.
- **Future model changes.** Any future replacement of the embedding model must follow the same evaluation protocol — BEIR benchmarks on the same three datasets plus EvaluateRAG — before an ADR update is accepted. Results must be documented in `research/embeddings/`.