Merge pull request #63 from BRUNIX-AI/mrh-online-dev-partial

Add BEIR analysis notebooks and evaluation pipeline for embedding models
This commit is contained in:
Rafael Ruiz 2026-03-26 09:33:54 -07:00 committed by GitHub
commit 3e47c15966
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
18 changed files with 58365 additions and 53 deletions

View File

@ -1,7 +1,7 @@
# ADR-0005: Embedding Model Selection — Comparative Evaluation of BGE-M3 vs Qwen3-Embedding-0.6B # ADR-0005: Embedding Model Selection — Comparative Evaluation of BGE-M3 vs Qwen3-Embedding-0.6B
**Date:** 2026-03-19 **Date:** 2026-03-19
**Status:** Under Evaluation **Status:** Under Evaluation
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering **Deciders:** Rafael Ruiz (CTO), MrHouston Engineering
--- ---
@ -14,27 +14,27 @@ The AVAP RAG pipeline requires an embedding model capable of mapping a hybrid co
A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (`avap.md`) and 40 representative `.avap` code samples. Results (`test_chunks.jsonl`, 190 chunks): A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (`avap.md`) and 40 representative `.avap` code samples. Results (`test_chunks.jsonl`, 190 chunks):
| Metric | Value | | Metric | Value |
|---|---| | -------------------- | ----------- |
| Total chunks | 190 | | Total chunks | 190 |
| Total tokens indexed | 11,498 | | Total tokens indexed | 11,498 |
| Minimum chunk size | 1 token | | Minimum chunk size | 1 token |
| Maximum chunk size | 833 tokens | | Maximum chunk size | 833 tokens |
| Mean chunk size | 60.5 tokens | | Mean chunk size | 60.5 tokens |
| Median chunk size | 29 tokens | | Median chunk size | 29 tokens |
| p90 | 117 tokens | | p90 | 117 tokens |
| p95 | 204 tokens | | p95 | 204 tokens |
| p99 | 511 tokens | | p99 | 511 tokens |
**Corpus composition by type:** **Corpus composition by type:**
| Type | Count | Description | | Type | Count | Description |
|---|---|---| | ------------------------- | ----- | ---------------------------------------- |
| Narrative (Spanish prose) | 79 | LRM explanations, concept descriptions | | Narrative (Spanish prose) | 79 | LRM explanations, concept descriptions |
| Code chunks | 83 | AVAP `.avap` sample files | | Code chunks | 83 | AVAP `.avap` sample files |
| BNF formal grammar | 9 | Formal language specification in English | | BNF formal grammar | 9 | Formal language specification in English |
| Code examples | 14 | Inline examples within LRM | | Code examples | 14 | Inline examples within LRM |
| Function signatures | 2 | Extracted function headers | | Function signatures | 2 | Extracted function headers |
**Linguistic composition:** 55% of chunks originate from the LRM (`avap.md`), written in Spanish with embedded English DSL identifiers. 45% are `.avap` code files containing English command names (`addVar`, `addResult`, `registerEndpoint`, `ormDirect`) with Spanish-language string literals and variable names (`"Hola"`, `datos_cliente`, `mi_json_final`, `contraseña`, `fecha`). 18.9% of chunks (36 out of 190) contain both Spanish content and English DSL commands within the same chunk — intra-chunk multilingual mixing. **Linguistic composition:** 55% of chunks originate from the LRM (`avap.md`), written in Spanish with embedded English DSL identifiers. 45% are `.avap` code files containing English command names (`addVar`, `addResult`, `registerEndpoint`, `ormDirect`) with Spanish-language string literals and variable names (`"Hola"`, `datos_cliente`, `mi_json_final`, `contraseña`, `fecha`). 18.9% of chunks (36 out of 190) contain both Spanish content and English DSL commands within the same chunk — intra-chunk multilingual mixing.
@ -65,43 +65,43 @@ Benchmark confirmation (BEIR evaluation, three datasets):
**CodeXGLUE** (code retrieval from GitHub repositories): **CodeXGLUE** (code retrieval from GitHub repositories):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall | | k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|---|---|---|---|---| | -- | ----------------- | ------------------- | ------------------- | --------------------- |
| 1 | 0.00031 | 0.00031 | **0.9497** | **0.9497** | | 1 | 0.00031 | 0.00031 | **0.9497** | **0.9497** |
| 5 | 0.00086 | 0.00151 | **0.9716** | **0.9876** | | 5 | 0.00086 | 0.00151 | **0.9716** | **0.9876** |
| 10 | 0.00118 | 0.00250 | **0.9734** | **0.9929** | | 10 | 0.00118 | 0.00250 | **0.9734** | **0.9929** |
**CoSQA** (natural language queries over code — closest proxy to AVAP retrieval): **CoSQA** (natural language queries over code — closest proxy to AVAP retrieval):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall | | k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|---|---|---|---|---| | --- | ----------------- | ------------------- | ------------------- | --------------------- |
| 1 | 0.00000 | 0.00000 | **0.1740** | **0.1740** | | 1 | 0.00000 | 0.00000 | **0.1740** | **0.1740** |
| 10 | 0.00000 | 0.00000 | **0.3909** | **0.6700** | | 10 | 0.00000 | 0.00000 | **0.3909** | **0.6700** |
| 100 | 0.00210 | 0.01000 | **0.4510** | **0.9520** | | 100 | 0.00210 | 0.01000 | **0.4510** | **0.9520** |
**SciFact** (scientific prose — out-of-domain control): **SciFact** (scientific prose — out-of-domain control):
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall | | k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|---|---|---|---|---| | --- | ----------------- | ------------------- | ------------------- | --------------------- |
| 1 | 0.02333 | 0.02083 | **0.5633** | **0.5299** | | 1 | 0.02333 | 0.02083 | **0.5633** | **0.5299** |
| 10 | 0.04619 | 0.07417 | **0.6855** | **0.8161** | | 10 | 0.04619 | 0.07417 | **0.6855** | **0.8161** |
| 100 | 0.07768 | 0.23144 | **0.7129** | **0.9400** | | 100 | 0.07768 | 0.23144 | **0.7129** | **0.9400** |
Qwen2.5-1.5B is eliminated. **Qwen3-Embedding-0.6B is the validated baseline.** Qwen2.5-1.5B is eliminated. **Qwen3-Embedding-0.6B is the validated baseline.**
### Why a comparative evaluation is required before adopting Qwen3 ### Why a comparative evaluation was required before adopting Qwen3
Qwen3-Embedding-0.6B's benchmark results were obtained on English-only datasets. They eliminate Qwen2.5-1.5B decisively but do not characterise Qwen3's behaviour on the multilingual mixed corpus that AVAP represents. A second candidate — **BGE-M3** — presents theoretical advantages for this specific corpus that cannot be assessed without empirical comparison. Qwen3-Embedding-0.6B's benchmark results were obtained on English-only datasets. They eliminated Qwen2.5-1.5B decisively but did not characterise Qwen3's behaviour on the multilingual mixed corpus that AVAP represents. A second candidate — **BGE-M3** — presented theoretical advantages for this specific corpus that could not be assessed without empirical comparison.
The index rebuild required to adopt any model is destructive and must be done once. Given that the embedding model directly determines the quality of all RAG retrieval in production, adopting a model without a direct comparison between the two viable candidates would not meet the due diligence required for a decision of this impact. The index rebuild required to adopt any model is destructive and must be done once. Given that the embedding model directly determines the quality of all RAG retrieval in production, adopting a model without a direct comparison between the two viable candidates would not have met the due diligence required for a decision of this impact.
--- ---
## Decision ## Decision
Conduct a **head-to-head comparative evaluation** of BGE-M3 and Qwen3-Embedding-0.6B under identical conditions before adopting either as the production embedding model. A **head-to-head comparative evaluation** of BGE-M3 and Qwen3-Embedding-0.6B is being conducted under identical conditions before either is adopted as the production embedding model.
The model that demonstrates superior performance under the evaluation criteria defined below will be adopted. This ADR moves to Accepted upon completion of that evaluation, with the selected model documented as the outcome. The model that demonstrates superior performance under the evaluation criteria defined below is adopted. This ADR moves to Accepted upon completion of that evaluation, with the selected model documented as the outcome.
--- ---
@ -110,12 +110,14 @@ The model that demonstrates superior performance under the evaluation criteria d
### Qwen3-Embedding-0.6B ### Qwen3-Embedding-0.6B
**Strengths:** **Strengths:**
- Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented - Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented
- 32,768 token context window — exceeds corpus requirements with large margin - 32,768 token context window — exceeds corpus requirements with large margin
- Same model family as the generation model (Qwen) — shared tokenizer vocabulary - Same model family as the generation model (Qwen) — shared tokenizer vocabulary
- Lowest integration risk — already validated in the pipeline - Lowest integration risk — already validated in the pipeline
**Limitations:** **Limitations:**
- Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated - Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated
- Not a dedicated multilingual model — training distribution weighted towards English and Chinese - Not a dedicated multilingual model — training distribution weighted towards English and Chinese
- No native sparse retrieval support - No native sparse retrieval support
@ -125,16 +127,18 @@ The model that demonstrates superior performance under the evaluation criteria d
### BGE-M3 ### BGE-M3
**Strengths:** **Strengths:**
- Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus - Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus
- Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003) - Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003)
- Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain - Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain
**Limitations:** **Limitations:**
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact — no empirical results for this corpus
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus
- 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth - 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
- Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent - Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent
**Corpus fit assessment:** The intra-chunk multilingual mixing (18.9% of chunks) and the Spanish prose component (79 narrative chunks) are the corpus characteristics most likely to differentiate BGE-M3 from Qwen3. The BEIR and EvaluateRAG evaluations will determine whether this theoretical advantage translates to measurable retrieval improvement. **Corpus fit assessment:** The intra-chunk multilingual mixing (18.9% of chunks) and the Spanish prose component (79 narrative chunks) are the corpus characteristics most likely to differentiate BGE-M3 from Qwen3. The BEIR and EvaluateRAG evaluations determine whether this theoretical advantage translates to measurable retrieval improvement.
### VRAM ### VRAM
@ -148,15 +152,15 @@ Both candidates output 1024-dimensional vectors. The Elasticsearch index mapping
## Evaluation Protocol ## Evaluation Protocol
Both models are evaluated under identical conditions. Results must be documented in `research/embeddings/` before this ADR is closed. Both models are evaluated under identical conditions. All results are documented in `research/embeddings/`.
**Step 1 — BEIR benchmarks** **Step 1 — BEIR benchmarks**
Run CodeXGLUE, CoSQA and SciFact with **BGE-M3** using the same BEIR evaluation scripts and configuration used for Qwen3-Embedding-0.6B. Qwen3-Embedding-0.6B results already exist in `research/embeddings/` and serve as the baseline. Report NDCG@k, MAP@k, Recall@k and Precision@k at k = 1, 3, 5, 10, 100. CodeXGLUE, CoSQA and SciFact were run with **BGE-M3** using the same BEIR evaluation scripts and configuration used for Qwen3-Embedding-0.6B. Qwen3-Embedding-0.6B results already existed in `research/embeddings/` and served as the baseline. Reported metrics: NDCG@k, MAP@k, Recall@k and Precision@k at k = 1, 3, 5, 10, 100.
**Step 2 — EvaluateRAG on AVAP corpus** **Step 2 — EvaluateRAG on AVAP corpus**
Rebuild the Elasticsearch index twice — once with each model — and run `EvaluateRAG` against the production AVAP golden dataset for both. Report RAGAS scores: faithfulness, answer_relevancy, context_recall, context_precision, and global score with verdict. The Elasticsearch index is rebuilt twice — once with each model — and `EvaluateRAG` is run against the production AVAP golden dataset for both. Reported RAGAS scores: faithfulness, answer_relevancy, context_recall, context_precision, and global score with verdict.
**Selection criterion** **Selection criterion**
@ -170,16 +174,103 @@ All margin comparisons use **absolute percentage points** in NDCG@10 (e.g., 0.39
If the EvaluateRAG global scores are within 5 absolute percentage points of each other, the BEIR results determine the outcome under the following conditions: If the EvaluateRAG global scores are within 5 absolute percentage points of each other, the BEIR results determine the outcome under the following conditions:
- BGE-M3 must exceed Qwen3-Embedding-0.6B by more than 2 absolute percentage points on mean NDCG@10 across all three BEIR datasets, AND - BGE-M3 exceeds Qwen3-Embedding-0.6B by more than 2 absolute percentage points on mean NDCG@10 across all three BEIR datasets, AND
- BGE-M3 must not underperform Qwen3-Embedding-0.6B by more than 2 absolute percentage points on CoSQA NDCG@10 specifically. - BGE-M3 does not underperform Qwen3-Embedding-0.6B by more than 2 absolute percentage points on CoSQA NDCG@10 specifically.
If neither condition is met — that is, if EvaluateRAG scores are within 5 points and BGE-M3 does not clear both BEIR thresholds — Qwen3-Embedding-0.6B is adopted. It carries lower integration risk, its benchmarks are already documented, and it is the validated baseline for the system. If neither condition is met — that is, if EvaluateRAG scores are within 5 points and BGE-M3 does not clear both BEIR thresholds — Qwen3-Embedding-0.6B is adopted. It carries lower integration risk, its benchmarks are already documented, and it is the validated baseline for the system.
--- ---
## Rationale
### Step 1 results — BEIR head-to-head comparison
BGE-M3 benchmarks were completed on the same three BEIR datasets using identical evaluation scripts and configuration. Full results are stored in `research/embeddings/embedding_eval_results/emb_models_result.json`. The following tables compare both candidates side by side.
**CodeXGLUE** (code retrieval from GitHub repositories):
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) |
| ------ | --- | ---------------- | ---------------- | ----------------------- |
| NDCG | 1 | **0.9520** | 0.9497 | +0.23 pp |
| NDCG | 5 | **0.9738** | 0.9717 | +0.21 pp |
| NDCG | 10 | **0.9749** | 0.9734 | +0.15 pp |
| NDCG | 100 | **0.9763** | 0.9745 | +0.18 pp |
| Recall | 1 | **0.9520** | 0.9497 | +0.23 pp |
| Recall | 5 | **0.9892** | 0.9876 | +0.16 pp |
| Recall | 10 | 0.9928 | **0.9930** | 0.02 pp |
| Recall | 100 | **0.9989** | 0.9981 | +0.08 pp |
Both models perform near-identically on CodeXGLUE. All deltas are below 0.25 absolute percentage points. This dataset does not differentiate the candidates.
**CoSQA** (natural language queries over code — most representative proxy for AVAP retrieval):
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) |
| ------ | --- | ------ | ---------------- | ----------------------- |
| NDCG | 1 | 0.1160 | **0.1740** | 5.80 pp |
| NDCG | 5 | 0.2383 | **0.3351** | 9.68 pp |
| NDCG | 10 | 0.2878 | **0.3909** | 10.31 pp |
| NDCG | 100 | 0.3631 | **0.4510** | 8.79 pp |
| Recall | 1 | 0.1160 | **0.1740** | 5.80 pp |
| Recall | 5 | 0.3660 | **0.5020** | 13.60 pp |
| Recall | 10 | 0.5160 | **0.6700** | 15.40 pp |
| Recall | 100 | 0.8740 | **0.9520** | 7.80 pp |
Qwen3-Embedding-0.6B outperforms BGE-M3 on CoSQA by a wide margin at every k. The NDCG@10 gap is 10.31 absolute percentage points. CoSQA is the most representative proxy for the AVAP retrieval use case — it pairs natural language queries with code snippets — making this the most significant BEIR result.
**SciFact** (scientific prose — out-of-domain control):
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 Qwen3) |
| ------ | --- | ------ | ---------------- | ----------------------- |
| NDCG | 1 | 0.5100 | **0.5533** | 4.33 pp |
| NDCG | 5 | 0.6190 | **0.6593** | 4.03 pp |
| NDCG | 10 | 0.6431 | **0.6785** | 3.54 pp |
| NDCG | 100 | 0.6705 | **0.7056** | 3.51 pp |
| Recall | 1 | 0.4818 | **0.5243** | 4.25 pp |
| Recall | 5 | 0.7149 | **0.7587** | 4.38 pp |
| Recall | 10 | 0.7834 | **0.8144** | 3.10 pp |
| Recall | 100 | 0.9037 | **0.9367** | 3.30 pp |
Qwen3-Embedding-0.6B leads BGE-M3 on SciFact by 34 absolute percentage points across all metrics. The gap is consistent but narrower than on CoSQA.
### BEIR summary — NDCG@10 comparison
| Dataset | BGE-M3 | Qwen3-Emb-0.6B | Delta | Leader |
| -------------- | ---------------- | ---------------- | ------------------- | ----------------- |
| CodeXGLUE | 0.9749 | 0.9734 | +0.15 pp | BGE-M3 (marginal) |
| CoSQA | 0.2878 | **0.3909** | 10.31 pp | **Qwen3** |
| SciFact | 0.6431 | **0.6785** | 3.54 pp | **Qwen3** |
| **Mean** | **0.6353** | **0.6809** | **4.56 pp** | **Qwen3** |
Qwen3-Embedding-0.6B leads on mean NDCG@10 by 4.56 absolute percentage points, driven primarily by a 10.31 pp advantage on CoSQA.
### Application of tiebreaker criteria to BEIR results
Per the evaluation protocol, if EvaluateRAG global scores are within 5 absolute percentage points, the BEIR tiebreaker applies. The tiebreaker requires BGE-M3 to meet **both** conditions:
1. **BGE-M3 must exceed Qwen3 by more than 2 pp on mean NDCG@10.** Result: BGE-M3 trails by 4.56 pp. **Condition not met.**
2. **BGE-M3 must not underperform Qwen3 by more than 2 pp on CoSQA NDCG@10.** Result: BGE-M3 trails by 10.31 pp. **Condition not met.**
Neither tiebreaker condition is satisfied. Under the defined protocol, if the EvaluateRAG evaluation results in a tie (within 5 pp), the BEIR tiebreaker defaults to Qwen3-Embedding-0.6B.
### Step 2 results — EvaluateRAG on AVAP corpus
At this moment, we are not in possesion of the golden dataset, cannot proceed with step 2.
_Pending. Results will be documented here upon completion of the EvaluateRAG evaluation for both models._
### Preliminary assessment
The BEIR benchmarks — the secondary decision signal — favour Qwen3-Embedding-0.6B across both the most representative dataset (CoSQA, 10.31 pp) and the out-of-domain control (SciFact, 3.54 pp), with CodeXGLUE effectively tied. BGE-M3's theoretical advantage from multilingual contrastive training does not translate to superior performance on these English-only benchmarks.
The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates.
We have found that Qwen3-embedding is multi-lingual, with good scores in multi-lingual benchmarks. The documentation says so, but the definitive answer will be provided by the scores of the evaluation on the AVAP corpus.
---
## Consequences ## Consequences
- **Index rebuild required** regardless of which model is adopted. Vectors from Qwen2.5-1.5B are incompatible with either candidate. The existing index must be deleted before re-ingestion. - **Index rebuild required** regardless of which model is adopted. Vectors from Qwen2.5-1.5B are incompatible with either candidate. The existing index is deleted before re-ingestion.
- **Two index rebuilds required for the evaluation.** One per candidate for the EvaluateRAG step. Given the current corpus size (190 chunks, 11,498 tokens), rebuild time is not a meaningful constraint. - **Two index rebuilds required for the evaluation.** One per candidate for the EvaluateRAG step. Given the current corpus size (190 chunks, 11,498 tokens), rebuild time is not a meaningful constraint.
- **Tokenizer alignment for BGE-M3.** If BGE-M3 is selected, both `OLLAMA_EMB_MODEL_NAME` and `HF_EMB_MODEL_NAME` must be updated. Updating only `OLLAMA_EMB_MODEL_NAME` causes the chunker to estimate token counts using the wrong vocabulary — a silent bug that produces inconsistent chunk sizes without raising any error. - **Tokenizer alignment for BGE-M3.** If BGE-M3 is selected, both `OLLAMA_EMB_MODEL_NAME` and `HF_EMB_MODEL_NAME` are updated. Updating only `OLLAMA_EMB_MODEL_NAME` causes the chunker to estimate token counts using the wrong vocabulary — a silent bug that produces inconsistent chunk sizes without raising any error.
- **Future model changes.** Any future replacement of the embedding model must follow the same evaluation protocol — BEIR benchmarks on the same three datasets plus EvaluateRAG — before an ADR update is accepted. Results must be documented in `research/embeddings/`. - **Future model changes.** Any future replacement of the embedding model follows the same evaluation protocol — BEIR benchmarks on the same three datasets plus EvaluateRAG — before an ADR update is accepted. Results are documented in `research/embeddings/`.

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

Binary file not shown.

View File

@ -0,0 +1,162 @@
{
"bge-m3:latest": {
"scifact": {
"NDCG": {
"NDCG@1": 0.51,
"NDCG@5": 0.61904,
"NDCG@10": 0.64312,
"NDCG@100": 0.6705
},
"MAP": {
"MAP@1": 0.48178,
"MAP@5": 0.58023,
"MAP@10": 0.59181,
"MAP@100": 0.59849
},
"Recall": {
"Recall@1": 0.48178,
"Recall@5": 0.71489,
"Recall@10": 0.78344,
"Recall@100": 0.90367
},
"Precision": {
"P@1": 0.51,
"P@5": 0.15667,
"P@10": 0.088,
"P@100": 0.01027
}
},
"cosqa": {
"NDCG": {
"NDCG@1": 0.116,
"NDCG@5": 0.23831,
"NDCG@10": 0.28783,
"NDCG@100": 0.36311
},
"MAP": {
"MAP@1": 0.116,
"MAP@5": 0.19687,
"MAP@10": 0.21791,
"MAP@100": 0.23272
},
"Recall": {
"Recall@1": 0.116,
"Recall@5": 0.366,
"Recall@10": 0.516,
"Recall@100": 0.874
},
"Precision": {
"P@1": 0.116,
"P@5": 0.0732,
"P@10": 0.0516,
"P@100": 0.00874
}
},
"codexglue": {
"NDCG": {
"NDCG@1": 0.952,
"NDCG@5": 0.97379,
"NDCG@10": 0.97494,
"NDCG@100": 0.97629
},
"MAP": {
"MAP@1": 0.952,
"MAP@5": 0.96849,
"MAP@10": 0.96897,
"MAP@100": 0.96926
},
"Recall": {
"Recall@1": 0.952,
"Recall@5": 0.98922,
"Recall@10": 0.99276,
"Recall@100": 0.99885
},
"Precision": {
"P@1": 0.952,
"P@5": 0.19784,
"P@10": 0.09928,
"P@100": 0.00999
}
}
},
"qwen3-0.6B-emb:latest": {
"scifact": {
"NDCG": {
"NDCG@1": 0.55333,
"NDCG@5": 0.65926,
"NDCG@10": 0.67848,
"NDCG@100": 0.70557
},
"MAP": {
"MAP@1": 0.52428,
"MAP@5": 0.62128,
"MAP@10": 0.63094,
"MAP@100": 0.63723
},
"Recall": {
"Recall@1": 0.52428,
"Recall@5": 0.75867,
"Recall@10": 0.81444,
"Recall@100": 0.93667
},
"Precision": {
"P@1": 0.55333,
"P@5": 0.17067,
"P@10": 0.093,
"P@100": 0.01067
}
},
"cosqa": {
"NDCG": {
"NDCG@1": 0.174,
"NDCG@5": 0.33509,
"NDCG@10": 0.39086,
"NDCG@100": 0.45099
},
"MAP": {
"MAP@1": 0.174,
"MAP@5": 0.2808,
"MAP@10": 0.30466,
"MAP@100": 0.31702
},
"Recall": {
"Recall@1": 0.174,
"Recall@5": 0.502,
"Recall@10": 0.67,
"Recall@100": 0.952
},
"Precision": {
"P@1": 0.174,
"P@5": 0.1004,
"P@10": 0.067,
"P@100": 0.00952
}
},
"codexglue": {
"NDCG": {
"NDCG@1": 0.94971,
"NDCG@5": 0.97166,
"NDCG@10": 0.97342,
"NDCG@100": 0.97453
},
"MAP": {
"MAP@1": 0.94971,
"MAP@5": 0.9662,
"MAP@10": 0.96694,
"MAP@100": 0.96718
},
"Recall": {
"Recall@1": 0.94971,
"Recall@5": 0.98761,
"Recall@10": 0.99297,
"Recall@100": 0.99807
},
"Precision": {
"P@1": 0.94971,
"P@5": 0.19752,
"P@10": 0.0993,
"P@100": 0.00998
}
}
}
}

View File

@ -261,8 +261,6 @@
"print(\"Recall:\", recall_qwen_2)\n", "print(\"Recall:\", recall_qwen_2)\n",
"print(\"Precision:\", precision_qwen_2)" "print(\"Precision:\", precision_qwen_2)"
] ]
<<<<<<< HEAD
=======
}, },
{ {
"cell_type": "code", "cell_type": "code",
@ -308,7 +306,6 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [] "source": []
>>>>>>> 4b5352d93cf89b7562895b550fb5bd62160586c5
} }
], ],
"metadata": { "metadata": {

View File

@ -0,0 +1,431 @@
"""
Embedding Evaluation Pipeline
Evaluate embedding models across CodexGlue, CoSQA, and SciFact benchmarks.
Supports multiple embedding providers via factory methods.
"""
import json
from pathlib import Path
from typing import Any, Dict, List, Union
import numpy as np
import typer
from langchain_ollama import OllamaEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch
from beir import util
from datasets import load_dataset
from src.config import settings
# Import embedding factory
project_root = settings.proj_root
DATASETS_ROOT = project_root / "research" / "embeddings" / "datasets"
app = typer.Typer()
def _has_local_beir_files(data_path: Path) -> bool:
"""Return True when a dataset folder already has the required BEIR files."""
required_files = [
data_path / "corpus.jsonl",
data_path / "queries.jsonl",
data_path / "qrels" / "test.tsv",
]
return all(path.exists() and path.stat().st_size > 0 for path in required_files)
def _load_local_beir_dataset(data_path: Path) -> tuple[Dict, Dict, Dict]:
"""Load a BEIR-formatted dataset from local disk."""
return GenericDataLoader(str(data_path)).load(split="test")
class BEIROllamaEmbeddings:
"""
Adapter that makes LangChain's OllamaEmbeddings compatible with BEIR.
"""
def __init__(
self,
base_url: str,
model: str,
batch_size: int = 64,
) -> None:
self.batch_size = batch_size
self.embeddings = OllamaEmbeddings(
base_url=base_url,
model=model,
)
def _batch_embed(self, texts: List[str]) -> np.ndarray:
vectors = []
for i in range(0, len(texts), self.batch_size):
batch = texts[i : i + self.batch_size]
batch_vectors = self.embeddings.embed_documents(batch)
# Handle NaN values by replacing with zeros
for vec in batch_vectors:
if isinstance(vec, (list, np.ndarray)):
vec_array = np.asarray(vec, dtype=np.float32)
# Replace NaN with zeros
vec_array = np.nan_to_num(vec_array, nan=0.0, posinf=0.0, neginf=0.0)
vectors.append(vec_array)
else:
vectors.append(vec)
return np.asarray(vectors, dtype=np.float32)
def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
"""
BEIR query encoder
"""
# Filter and clean queries - replace empty ones with placeholder
cleaned_queries = []
for q in queries:
if isinstance(q, str):
cleaned = q.strip()
if not cleaned:
cleaned = "[EMPTY]"
else:
cleaned = "[INVALID]"
cleaned_queries.append(cleaned)
return self._batch_embed(cleaned_queries)
def encode_corpus(
self,
corpus: Union[List[Dict[str, str]], Dict[str, Dict[str, str]]],
**kwargs,
) -> np.ndarray:
"""
BEIR corpus encoder
"""
if isinstance(corpus, dict):
corpus = list(corpus.values())
texts = []
for doc in corpus:
title = (doc.get("title") or "").strip()
text = (doc.get("text") or "").strip()
# Combine title and text, filtering out empty strings
combined = " ".join(filter(None, [title, text]))
# Use placeholder if both are empty to avoid NaN embeddings
if not combined:
combined = "[EMPTY]"
texts.append(combined)
return self._batch_embed(texts)
class BEIRHuggingFaceEmbeddings:
"""
Adapter that makes LangChain's HuggingFaceEmbeddings compatible with BEIR.
"""
def __init__(self, model: str, batch_size: int = 64) -> None:
self.batch_size = batch_size
self.embeddings = HuggingFaceEmbeddings(model_name=model)
def _batch_embed(self, texts: List[str]) -> np.ndarray:
vectors = []
for i in range(0, len(texts), self.batch_size):
batch = texts[i : i + self.batch_size]
batch_vectors = self.embeddings.embed_documents(batch)
# Handle NaN values
for vec in batch_vectors:
if isinstance(vec, (list, np.ndarray)):
vec_array = np.asarray(vec, dtype=np.float32)
vec_array = np.nan_to_num(vec_array, nan=0.0, posinf=0.0, neginf=0.0)
vectors.append(vec_array)
else:
vectors.append(vec)
return np.asarray(vectors, dtype=np.float32)
def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
"""BEIR query encoder"""
cleaned_queries = []
for q in queries:
if isinstance(q, str):
cleaned = q.strip()
if not cleaned:
cleaned = "[EMPTY]"
else:
cleaned = "[INVALID]"
cleaned_queries.append(cleaned)
return self._batch_embed(cleaned_queries)
def encode_corpus(
self,
corpus: Union[List[Dict[str, str]], Dict[str, Dict[str, str]]],
**kwargs,
) -> np.ndarray:
"""BEIR corpus encoder"""
if isinstance(corpus, dict):
corpus = list(corpus.values())
texts = []
for doc in corpus:
title = (doc.get("title") or "").strip()
text = (doc.get("text") or "").strip()
combined = " ".join(filter(None, [title, text]))
if not combined:
combined = "[EMPTY]"
texts.append(combined)
return self._batch_embed(texts)
def load_scifact_dataset() -> tuple[Dict, Dict, Dict]:
"""Load SciFact benchmark."""
DATASETS_ROOT.mkdir(parents=True, exist_ok=True)
scifact_path = DATASETS_ROOT / "scifact"
if _has_local_beir_files(scifact_path):
print(" Using local SciFact dataset cache")
return _load_local_beir_dataset(scifact_path)
print(" SciFact dataset not found locally. Downloading...")
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip"
data_path = util.download_and_unzip(url, out_dir=str(DATASETS_ROOT))
downloaded_path = Path(data_path)
if downloaded_path.name == "scifact" and _has_local_beir_files(downloaded_path):
return _load_local_beir_dataset(downloaded_path)
return _load_local_beir_dataset(scifact_path)
def load_cosqa_dataset() -> tuple[Dict, Dict, Dict]:
"""Load CoSQA benchmark."""
data_path = DATASETS_ROOT / "cosqa"
if _has_local_beir_files(data_path):
print(" Using local CoSQA dataset cache")
return _load_local_beir_dataset(data_path)
print(" CoSQA dataset not found locally. Downloading and preparing...")
(data_path / "qrels").mkdir(parents=True, exist_ok=True)
# Load from HuggingFace
hf_corpus = load_dataset("CoIR-Retrieval/cosqa", "corpus", split="corpus")
hf_queries = load_dataset("CoIR-Retrieval/cosqa", "queries", split="queries")
hf_qrels = load_dataset("CoIR-Retrieval/cosqa", "default", split="test")
# Save in BEIR format
with open(data_path / "corpus.jsonl", "w") as f:
for item in hf_corpus:
f.write(
json.dumps(
{"_id": str(item["_id"]), "text": item["text"], "title": ""}
)
+ "\n"
)
with open(data_path / "queries.jsonl", "w") as f:
for item in hf_queries:
f.write(json.dumps({"_id": str(item["_id"]), "text": item["text"]}) + "\n")
with open(data_path / "qrels" / "test.tsv", "w") as f:
f.write("query-id\tcorpus-id\tscore\n")
for item in hf_qrels:
f.write(f"{item['query-id']}\t{item['corpus-id']}\t{item['score']}\n")
return _load_local_beir_dataset(data_path)
def load_codexglue_dataset() -> tuple[Dict, Dict, Dict]:
"""Load CodexGlue benchmark."""
data_path = DATASETS_ROOT / "codexglue"
if _has_local_beir_files(data_path):
print(" Using local CodexGlue dataset cache")
return _load_local_beir_dataset(data_path)
print(" CodexGlue dataset not found locally. Downloading and preparing...")
(data_path / "qrels").mkdir(parents=True, exist_ok=True)
raw_dataset = load_dataset("google/code_x_glue_tc_nl_code_search_adv", split="test")
with open(data_path / "corpus.jsonl", "w") as corpus_file:
for i, data in enumerate(raw_dataset):
docid = f"doc_{i}"
corpus_file.write(
json.dumps(
{
"_id": docid,
"title": data.get("func_name", ""),
"text": data["code"],
}
)
+ "\n"
)
with open(data_path / "queries.jsonl", "w") as query_file:
for i, data in enumerate(raw_dataset):
queryid = f"q_{i}"
query_file.write(
json.dumps({"_id": queryid, "text": data["docstring"]}) + "\n"
)
with open(data_path / "qrels" / "test.tsv", "w") as qrels_file:
qrels_file.write("query-id\tcorpus-id\tscore\n")
for i, _ in enumerate(raw_dataset):
qrels_file.write(f"q_{i}\tdoc_{i}\t1\n")
return _load_local_beir_dataset(data_path)
BENCHMARK_LOADERS = {
"scifact": load_scifact_dataset,
"cosqa": load_cosqa_dataset,
"codexglue": load_codexglue_dataset,
}
def evaluate_model_on_benchmark(
benchmark: str, provider: str, model: str, k_values: List[int] = None
) -> Dict[str, Any]:
"""Evaluate a model on a benchmark."""
if k_values is None:
k_values = [1, 5, 10, 100]
print(f" Loading {benchmark.upper()} dataset...")
corpus, queries, qrels = BENCHMARK_LOADERS[benchmark]()
print(f" Corpus: {len(corpus)}, Queries: {len(queries)}")
# Select adapter based on provider
if provider == "ollama":
adapter = BEIROllamaEmbeddings(
base_url=settings.ollama_local_url,
model=model,
batch_size=64
)
elif provider == "huggingface":
adapter = BEIRHuggingFaceEmbeddings(model=model, batch_size=64)
else:
raise ValueError(f"Unknown provider: {provider}")
retriever = DenseRetrievalExactSearch(adapter, batch_size=64)
evaluator = EvaluateRetrieval(retriever, score_function="cos_sim")
print(" Running retrieval...")
results = evaluator.retrieve(corpus, queries)
print(" Computing metrics...")
ndcg, _map, recall, precision = evaluator.evaluate(qrels, results, k_values)
return {"NDCG": ndcg, "MAP": _map, "Recall": recall, "Precision": precision}
def parse_model_spec(model_spec: str) -> tuple[str, str]:
"""
Parse model spec. Format: "provider:model_name" (default provider: ollama).
Examples: "ollama:qwen3", "openai:text-embedding-3-small", "bge-me3:latest"
"""
if ":" in model_spec:
parts = model_spec.split(":", 1)
if parts[0].lower() in ["ollama", "openai", "huggingface", "bedrock"]:
return parts[0].lower(), parts[1]
return "ollama", model_spec
def evaluate_models(
models: List[str], benchmarks: List[str], output_folder: Path, k_values: List[int]
) -> None:
"""Evaluate multiple models on multiple benchmarks."""
output_folder.mkdir(parents=True, exist_ok=True)
all_results = {}
for model_spec in models:
provider, model_name = parse_model_spec(model_spec)
print(f"\n{'='*60}\nModel: {model_name} ({provider})\n{'='*60}")
model_results = {}
for benchmark in benchmarks:
if benchmark not in BENCHMARK_LOADERS:
print(f"✗ Unknown benchmark: {benchmark}")
continue
print(f"\nEvaluating on {benchmark}...")
try:
metrics = evaluate_model_on_benchmark(
benchmark, provider, model_name, k_values=k_values
)
model_results[benchmark] = metrics
print("✓ Complete")
except Exception as e:
print(f"✗ Error: {e}")
import traceback
traceback.print_exc()
all_results[model_spec] = model_results
output_file = output_folder / f"results_{'_'.join(models)}_{'_'.join(benchmarks)}.json"
print(f"\n{'='*60}\nSaving to {output_file}")
with open(output_file, "w") as f:
json.dump(all_results, f, indent=2)
print("✓ Done")
@app.command()
def main(
models: List[str] = typer.Option(
None,
"--model",
"-m",
help="Model spec (format: 'provider:model' or just 'model' for Ollama). "
"Providers: ollama, huggingface. Can specify multiple times. "
"Default: huggingface:sentence-transformers/all-MiniLM-L6-v2",
),
benchmarks: List[str] = typer.Option(
None,
"--benchmark",
"-b",
help="Benchmark name (scifact, cosqa, codexglue). Default: all three",
),
output_folder: Path = typer.Option(
Path("research/embedding_eval_results"),
"--output",
"-o",
help="Output folder for results.",
),
k_values: str = typer.Option(
"1,5,10,100",
"--k-values",
"-k",
help="Comma-separated k values for metrics.",
),
) -> None:
"""
Evaluate embedding models on CodexGlue, CoSQA, and SciFact benchmarks.
Examples:
# HuggingFace model (no Ollama required)
python evaluate_embeddings_pipeline.py
# Different HuggingFace model
python evaluate_embeddings_pipeline.py -m huggingface:sentence-transformers/bge-small-en-v1.5
# Ollama model
python evaluate_embeddings_pipeline.py -m ollama:qwen:embeddings
# Multiple models and single benchmark
python evaluate_embeddings_pipeline.py -m huggingface:all-MiniLM-L6-v2 -m ollama:bge-m3 -b scifact -o ./results
"""
if not models:
models = ["bge-m3:latest", "qwen3-0.6B-emb:latest"]
if not benchmarks:
benchmarks = ["scifact", "cosqa", "codexglue"]
k_list = [int(k.strip()) for k in k_values.split(",")]
evaluate_models(models=models, benchmarks=benchmarks, output_folder=output_folder, k_values=k_list)
if __name__ == "__main__":
app()