assistance-engine/docs/ADR/ADR-0005-embedding-model-se...

16 KiB
Raw Blame History

ADR-0005: Embedding Model Selection — Comparative Evaluation of BGE-M3 vs Qwen3-Embedding-0.6B

Date: 2026-03-19
Status: Under Evaluation
Deciders: Rafael Ruiz (CTO), MrHouston Engineering


Context

The AVAP RAG pipeline requires an embedding model capable of mapping a hybrid corpus into a vector space suitable for semantic retrieval. Understanding the exact composition of this corpus is a prerequisite for model selection.

Corpus characterisation (empirically measured)

A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (avap.md) and 40 representative .avap code samples. Results (test_chunks.jsonl, 190 chunks):

Metric Value
Total chunks 190
Total tokens indexed 11,498
Minimum chunk size 1 token
Maximum chunk size 833 tokens
Mean chunk size 60.5 tokens
Median chunk size 29 tokens
p90 117 tokens
p95 204 tokens
p99 511 tokens

Corpus composition by type:

Type Count Description
Narrative (Spanish prose) 79 LRM explanations, concept descriptions
Code chunks 83 AVAP .avap sample files
BNF formal grammar 9 Formal language specification in English
Code examples 14 Inline examples within LRM
Function signatures 2 Extracted function headers

Linguistic composition: 55% of chunks originate from the LRM (avap.md), written in Spanish with embedded English DSL identifiers. 45% are .avap code files containing English command names (addVar, addResult, registerEndpoint, ormDirect) with Spanish-language string literals and variable names ("Hola", datos_cliente, mi_json_final, contraseña, fecha). 18.9% of chunks (36 out of 190) contain both Spanish content and English DSL commands within the same chunk — intra-chunk multilingual mixing.

Representative examples of intra-chunk multilingual mixing:

// Narrative chunk (Spanish prose + English DSL terms):
"AVAP (Advanced Virtual API Programming) es un DSL (Domain-Specific Language)
Turing Completo, diseñado para la orquestación segura de microservicios e I/O."

// Code chunk (English commands + Spanish identifiers and literals):
addParam("lang", l)
if(l, "es", "=")
    addVar(msg, "Hola")
end()
addResult(msg)

// BNF chunk (formal English grammar):
<program> ::= ( <line> | <block_comment> )*
<statement> ::= <assignment> | <method_call_stmt> | <io_command> | ...

Why the initial model was eliminated

The initial model provided was Qwen2.5-1.5B. Empirical evaluation by MrHouston Engineering (full results in research/embeddings/) demonstrated it is unsuitable for dense retrieval. Qwen2.5-1.5B generates embeddings via the Last Token method: the final token of the sequence is assumed to encode all preceding context. For AVAP code chunks, the last token is always a syntactic closer — end(), }, endLoop() — with zero semantic content. The resulting embeddings are effectively identical across functionally distinct chunks.

Benchmark confirmation (BEIR evaluation, three datasets):

CodeXGLUE (code retrieval from GitHub repositories):

k Qwen2.5-1.5B NDCG Qwen2.5-1.5B Recall Qwen3-Emb-0.6B NDCG Qwen3-Emb-0.6B Recall
1 0.00031 0.00031 0.9497 0.9497
5 0.00086 0.00151 0.9716 0.9876
10 0.00118 0.00250 0.9734 0.9929

CoSQA (natural language queries over code — closest proxy to AVAP retrieval):

k Qwen2.5-1.5B NDCG Qwen2.5-1.5B Recall Qwen3-Emb-0.6B NDCG Qwen3-Emb-0.6B Recall
1 0.00000 0.00000 0.1740 0.1740
10 0.00000 0.00000 0.3909 0.6700
100 0.00210 0.01000 0.4510 0.9520

SciFact (scientific prose — out-of-domain control):

k Qwen2.5-1.5B NDCG Qwen2.5-1.5B Recall Qwen3-Emb-0.6B NDCG Qwen3-Emb-0.6B Recall
1 0.02333 0.02083 0.5633 0.5299
10 0.04619 0.07417 0.6855 0.8161
100 0.07768 0.23144 0.7129 0.9400

Qwen2.5-1.5B is eliminated. Qwen3-Embedding-0.6B is the validated baseline.

Why a comparative evaluation was required before adopting Qwen3

Qwen3-Embedding-0.6B's benchmark results were obtained on English-only datasets. They eliminated Qwen2.5-1.5B decisively but did not characterise Qwen3's behaviour on the multilingual mixed corpus that AVAP represents. A second candidate — BGE-M3 — presented theoretical advantages for this specific corpus that could not be assessed without empirical comparison.

The index rebuild required to adopt any model is destructive and must be done once. Given that the embedding model directly determines the quality of all RAG retrieval in production, adopting a model without a direct comparison between the two viable candidates would not have met the due diligence required for a decision of this impact.


Decision

A head-to-head comparative evaluation of BGE-M3 and Qwen3-Embedding-0.6B is being conducted under identical conditions before either is adopted as the production embedding model.

The model that demonstrates superior performance under the evaluation criteria defined below is adopted. This ADR moves to Accepted upon completion of that evaluation, with the selected model documented as the outcome.


Candidate Analysis

Qwen3-Embedding-0.6B

Strengths:

  • Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented
  • 32,768 token context window — exceeds corpus requirements with large margin
  • Same model family as the generation model (Qwen) — shared tokenizer vocabulary
  • Lowest integration risk — already validated in the pipeline

Limitations:

  • Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated
  • Not a dedicated multilingual model — training distribution weighted towards English and Chinese
  • No native sparse retrieval support

Corpus fit assessment: The maximum chunk in the AVAP corpus is 833 tokens — well within both candidates' limits. Qwen3's 32,768 token context window provides no practical advantage over BGE-M3's 8,192 tokens for this corpus. Context window is not a differentiating criterion.

BGE-M3

Strengths:

  • Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus
  • Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003)
  • Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain

Limitations:

  • Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus
  • 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
  • Requires tokenizer alignment: HF_EMB_MODEL_NAME must be updated to BAAI/bge-m3 alongside OLLAMA_EMB_MODEL_NAME to keep chunk token counting consistent

Corpus fit assessment: The intra-chunk multilingual mixing (18.9% of chunks) and the Spanish prose component (79 narrative chunks) are the corpus characteristics most likely to differentiate BGE-M3 from Qwen3. The BEIR and EvaluateRAG evaluations determine whether this theoretical advantage translates to measurable retrieval improvement.

VRAM

Both candidates require approximately 1.13 GiB at FP16 (BGE-M3: 567M parameters; Qwen3: 596M parameters). Combined with a quantized generation model and KV cache, total VRAM remains within the 4 GiB hardware constraint for both. VRAM is not a selection criterion.

Embedding dimension

Both candidates output 1024-dimensional vectors. The Elasticsearch index mapping (int8_hnsw, dims: 1024, cosine similarity) is identical for both candidates. No mapping changes are required between them.


Evaluation Protocol

Both models are evaluated under identical conditions. All results are documented in research/embeddings/.

Step 1 — BEIR benchmarks

CodeXGLUE, CoSQA and SciFact were run with BGE-M3 using the same BEIR evaluation scripts and configuration used for Qwen3-Embedding-0.6B. Qwen3-Embedding-0.6B results already existed in research/embeddings/ and served as the baseline. Reported metrics: NDCG@k, MAP@k, Recall@k and Precision@k at k = 1, 3, 5, 10, 100.

Step 2 — EvaluateRAG on AVAP corpus

The Elasticsearch index is rebuilt twice — once with each model — and EvaluateRAG is run against the production AVAP golden dataset for both. Reported RAGAS scores: faithfulness, answer_relevancy, context_recall, context_precision, and global score with verdict.

Selection criterion

EvaluateRAG is the primary decision signal. It directly measures retrieval quality on the actual AVAP production corpus — including its intra-chunk multilingual mixing (18.9% of chunks) and domain-specific DSL syntax — and is therefore more representative than any external benchmark. The model with the higher global EvaluateRAG score is adopted.

BEIR results are the secondary signal. The primary BEIR metric is NDCG@10. Among the three datasets, CoSQA is the most representative proxy for the AVAP retrieval use case — it pairs natural language queries with code snippets, mirroring the Spanish prose query / AVAP DSL code retrieval pattern. CoSQA results are weighted accordingly in the comparison.

All margin comparisons use absolute percentage points in NDCG@10 (e.g., 0.39 vs 0.41 is a 2 absolute percentage point difference, not a 5.1% relative difference).

Tiebreaker

If the EvaluateRAG global scores are within 5 absolute percentage points of each other, the BEIR results determine the outcome under the following conditions:

  • BGE-M3 exceeds Qwen3-Embedding-0.6B by more than 2 absolute percentage points on mean NDCG@10 across all three BEIR datasets, AND
  • BGE-M3 does not underperform Qwen3-Embedding-0.6B by more than 2 absolute percentage points on CoSQA NDCG@10 specifically.

If neither condition is met — that is, if EvaluateRAG scores are within 5 points and BGE-M3 does not clear both BEIR thresholds — Qwen3-Embedding-0.6B is adopted. It carries lower integration risk, its benchmarks are already documented, and it is the validated baseline for the system.


Rationale

Step 1 results — BEIR head-to-head comparison

BGE-M3 benchmarks were completed on the same three BEIR datasets using identical evaluation scripts and configuration. Full results are stored in research/embeddings/embedding_eval_results/emb_models_result.json. The following tables compare both candidates side by side.

CodeXGLUE (code retrieval from GitHub repositories):

Metric k BGE-M3 Qwen3-Emb-0.6B Delta (BGE-M3 Qwen3)
NDCG 1 0.9520 0.9497 +0.23 pp
NDCG 5 0.9738 0.9717 +0.21 pp
NDCG 10 0.9749 0.9734 +0.15 pp
NDCG 100 0.9763 0.9745 +0.18 pp
Recall 1 0.9520 0.9497 +0.23 pp
Recall 5 0.9892 0.9876 +0.16 pp
Recall 10 0.9928 0.9930 0.02 pp
Recall 100 0.9989 0.9981 +0.08 pp

Both models perform near-identically on CodeXGLUE. All deltas are below 0.25 absolute percentage points. This dataset does not differentiate the candidates.

CoSQA (natural language queries over code — most representative proxy for AVAP retrieval):

Metric k BGE-M3 Qwen3-Emb-0.6B Delta (BGE-M3 Qwen3)
NDCG 1 0.1160 0.1740 5.80 pp
NDCG 5 0.2383 0.3351 9.68 pp
NDCG 10 0.2878 0.3909 10.31 pp
NDCG 100 0.3631 0.4510 8.79 pp
Recall 1 0.1160 0.1740 5.80 pp
Recall 5 0.3660 0.5020 13.60 pp
Recall 10 0.5160 0.6700 15.40 pp
Recall 100 0.8740 0.9520 7.80 pp

Qwen3-Embedding-0.6B outperforms BGE-M3 on CoSQA by a wide margin at every k. The NDCG@10 gap is 10.31 absolute percentage points. CoSQA is the most representative proxy for the AVAP retrieval use case — it pairs natural language queries with code snippets — making this the most significant BEIR result.

SciFact (scientific prose — out-of-domain control):

Metric k BGE-M3 Qwen3-Emb-0.6B Delta (BGE-M3 Qwen3)
NDCG 1 0.5100 0.5533 4.33 pp
NDCG 5 0.6190 0.6593 4.03 pp
NDCG 10 0.6431 0.6785 3.54 pp
NDCG 100 0.6705 0.7056 3.51 pp
Recall 1 0.4818 0.5243 4.25 pp
Recall 5 0.7149 0.7587 4.38 pp
Recall 10 0.7834 0.8144 3.10 pp
Recall 100 0.9037 0.9367 3.30 pp

Qwen3-Embedding-0.6B leads BGE-M3 on SciFact by 34 absolute percentage points across all metrics. The gap is consistent but narrower than on CoSQA.

BEIR summary — NDCG@10 comparison

Dataset BGE-M3 Qwen3-Emb-0.6B Delta Leader
CodeXGLUE 0.9749 0.9734 +0.15 pp BGE-M3 (marginal)
CoSQA 0.2878 0.3909 10.31 pp Qwen3
SciFact 0.6431 0.6785 3.54 pp Qwen3
Mean 0.6353 0.6809 4.56 pp Qwen3

Qwen3-Embedding-0.6B leads on mean NDCG@10 by 4.56 absolute percentage points, driven primarily by a 10.31 pp advantage on CoSQA.

Application of tiebreaker criteria to BEIR results

Per the evaluation protocol, if EvaluateRAG global scores are within 5 absolute percentage points, the BEIR tiebreaker applies. The tiebreaker requires BGE-M3 to meet both conditions:

  1. BGE-M3 must exceed Qwen3 by more than 2 pp on mean NDCG@10. Result: BGE-M3 trails by 4.56 pp. Condition not met.
  2. BGE-M3 must not underperform Qwen3 by more than 2 pp on CoSQA NDCG@10. Result: BGE-M3 trails by 10.31 pp. Condition not met.

Neither tiebreaker condition is satisfied. Under the defined protocol, if the EvaluateRAG evaluation results in a tie (within 5 pp), the BEIR tiebreaker defaults to Qwen3-Embedding-0.6B.

Step 2 results — EvaluateRAG on AVAP corpus

At this moment, we are not in possesion of the golden dataset, cannot proceed with step 2.

Pending. Results will be documented here upon completion of the EvaluateRAG evaluation for both models.

Preliminary assessment

The BEIR benchmarks — the secondary decision signal — favour Qwen3-Embedding-0.6B across both the most representative dataset (CoSQA, 10.31 pp) and the out-of-domain control (SciFact, 3.54 pp), with CodeXGLUE effectively tied. BGE-M3's theoretical advantage from multilingual contrastive training does not translate to superior performance on these English-only benchmarks.

The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates.


Consequences

  • Index rebuild required regardless of which model is adopted. Vectors from Qwen2.5-1.5B are incompatible with either candidate. The existing index is deleted before re-ingestion.
  • Two index rebuilds required for the evaluation. One per candidate for the EvaluateRAG step. Given the current corpus size (190 chunks, 11,498 tokens), rebuild time is not a meaningful constraint.
  • Tokenizer alignment for BGE-M3. If BGE-M3 is selected, both OLLAMA_EMB_MODEL_NAME and HF_EMB_MODEL_NAME are updated. Updating only OLLAMA_EMB_MODEL_NAME causes the chunker to estimate token counts using the wrong vocabulary — a silent bug that produces inconsistent chunk sizes without raising any error.
  • Future model changes. Any future replacement of the embedding model follows the same evaluation protocol — BEIR benchmarks on the same three datasets plus EvaluateRAG — before an ADR update is accepted. Results are documented in research/embeddings/.