Merge branch 'mrh-online-dev' of github.com:BRUNIX-AI/assistance-engine into mrh-online-dev
This commit is contained in:
commit
b94f3382b3
|
|
@ -14,27 +14,27 @@ The AVAP RAG pipeline requires an embedding model capable of mapping a hybrid co
|
||||||
|
|
||||||
A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (`avap.md`) and 40 representative `.avap` code samples. Results (`test_chunks.jsonl`, 190 chunks):
|
A chunk-level audit was performed on the full indexable corpus: the AVAP Language Reference Manual (`avap.md`) and 40 representative `.avap` code samples. Results (`test_chunks.jsonl`, 190 chunks):
|
||||||
|
|
||||||
| Metric | Value |
|
| Metric | Value |
|
||||||
|---|---|
|
| -------------------- | ----------- |
|
||||||
| Total chunks | 190 |
|
| Total chunks | 190 |
|
||||||
| Total tokens indexed | 11,498 |
|
| Total tokens indexed | 11,498 |
|
||||||
| Minimum chunk size | 1 token |
|
| Minimum chunk size | 1 token |
|
||||||
| Maximum chunk size | 833 tokens |
|
| Maximum chunk size | 833 tokens |
|
||||||
| Mean chunk size | 60.5 tokens |
|
| Mean chunk size | 60.5 tokens |
|
||||||
| Median chunk size | 29 tokens |
|
| Median chunk size | 29 tokens |
|
||||||
| p90 | 117 tokens |
|
| p90 | 117 tokens |
|
||||||
| p95 | 204 tokens |
|
| p95 | 204 tokens |
|
||||||
| p99 | 511 tokens |
|
| p99 | 511 tokens |
|
||||||
|
|
||||||
**Corpus composition by type:**
|
**Corpus composition by type:**
|
||||||
|
|
||||||
| Type | Count | Description |
|
| Type | Count | Description |
|
||||||
|---|---|---|
|
| ------------------------- | ----- | ---------------------------------------- |
|
||||||
| Narrative (Spanish prose) | 79 | LRM explanations, concept descriptions |
|
| Narrative (Spanish prose) | 79 | LRM explanations, concept descriptions |
|
||||||
| Code chunks | 83 | AVAP `.avap` sample files |
|
| Code chunks | 83 | AVAP `.avap` sample files |
|
||||||
| BNF formal grammar | 9 | Formal language specification in English |
|
| BNF formal grammar | 9 | Formal language specification in English |
|
||||||
| Code examples | 14 | Inline examples within LRM |
|
| Code examples | 14 | Inline examples within LRM |
|
||||||
| Function signatures | 2 | Extracted function headers |
|
| Function signatures | 2 | Extracted function headers |
|
||||||
|
|
||||||
**Linguistic composition:** 55% of chunks originate from the LRM (`avap.md`), written in Spanish with embedded English DSL identifiers. 45% are `.avap` code files containing English command names (`addVar`, `addResult`, `registerEndpoint`, `ormDirect`) with Spanish-language string literals and variable names (`"Hola"`, `datos_cliente`, `mi_json_final`, `contraseña`, `fecha`). 18.9% of chunks (36 out of 190) contain both Spanish content and English DSL commands within the same chunk — intra-chunk multilingual mixing.
|
**Linguistic composition:** 55% of chunks originate from the LRM (`avap.md`), written in Spanish with embedded English DSL identifiers. 45% are `.avap` code files containing English command names (`addVar`, `addResult`, `registerEndpoint`, `ormDirect`) with Spanish-language string literals and variable names (`"Hola"`, `datos_cliente`, `mi_json_final`, `contraseña`, `fecha`). 18.9% of chunks (36 out of 190) contain both Spanish content and English DSL commands within the same chunk — intra-chunk multilingual mixing.
|
||||||
|
|
||||||
|
|
@ -65,27 +65,27 @@ Benchmark confirmation (BEIR evaluation, three datasets):
|
||||||
|
|
||||||
**CodeXGLUE** (code retrieval from GitHub repositories):
|
**CodeXGLUE** (code retrieval from GitHub repositories):
|
||||||
|
|
||||||
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|
||||||
|---|---|---|---|---|
|
| -- | ----------------- | ------------------- | ------------------- | --------------------- |
|
||||||
| 1 | 0.00031 | 0.00031 | **0.9497** | **0.9497** |
|
| 1 | 0.00031 | 0.00031 | **0.9497** | **0.9497** |
|
||||||
| 5 | 0.00086 | 0.00151 | **0.9716** | **0.9876** |
|
| 5 | 0.00086 | 0.00151 | **0.9716** | **0.9876** |
|
||||||
| 10 | 0.00118 | 0.00250 | **0.9734** | **0.9929** |
|
| 10 | 0.00118 | 0.00250 | **0.9734** | **0.9929** |
|
||||||
|
|
||||||
**CoSQA** (natural language queries over code — closest proxy to AVAP retrieval):
|
**CoSQA** (natural language queries over code — closest proxy to AVAP retrieval):
|
||||||
|
|
||||||
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|
||||||
|---|---|---|---|---|
|
| --- | ----------------- | ------------------- | ------------------- | --------------------- |
|
||||||
| 1 | 0.00000 | 0.00000 | **0.1740** | **0.1740** |
|
| 1 | 0.00000 | 0.00000 | **0.1740** | **0.1740** |
|
||||||
| 10 | 0.00000 | 0.00000 | **0.3909** | **0.6700** |
|
| 10 | 0.00000 | 0.00000 | **0.3909** | **0.6700** |
|
||||||
| 100 | 0.00210 | 0.01000 | **0.4510** | **0.9520** |
|
| 100 | 0.00210 | 0.01000 | **0.4510** | **0.9520** |
|
||||||
|
|
||||||
**SciFact** (scientific prose — out-of-domain control):
|
**SciFact** (scientific prose — out-of-domain control):
|
||||||
|
|
||||||
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|
| k | Qwen2.5-1.5B NDCG | Qwen2.5-1.5B Recall | Qwen3-Emb-0.6B NDCG | Qwen3-Emb-0.6B Recall |
|
||||||
|---|---|---|---|---|
|
| --- | ----------------- | ------------------- | ------------------- | --------------------- |
|
||||||
| 1 | 0.02333 | 0.02083 | **0.5633** | **0.5299** |
|
| 1 | 0.02333 | 0.02083 | **0.5633** | **0.5299** |
|
||||||
| 10 | 0.04619 | 0.07417 | **0.6855** | **0.8161** |
|
| 10 | 0.04619 | 0.07417 | **0.6855** | **0.8161** |
|
||||||
| 100 | 0.07768 | 0.23144 | **0.7129** | **0.9400** |
|
| 100 | 0.07768 | 0.23144 | **0.7129** | **0.9400** |
|
||||||
|
|
||||||
Qwen2.5-1.5B is eliminated. **Qwen3-Embedding-0.6B is the validated baseline.**
|
Qwen2.5-1.5B is eliminated. **Qwen3-Embedding-0.6B is the validated baseline.**
|
||||||
|
|
||||||
|
|
@ -110,12 +110,14 @@ The model that demonstrates superior performance under the evaluation criteria d
|
||||||
### Qwen3-Embedding-0.6B
|
### Qwen3-Embedding-0.6B
|
||||||
|
|
||||||
**Strengths:**
|
**Strengths:**
|
||||||
|
|
||||||
- Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented
|
- Already benchmarked on CodeXGLUE, CoSQA and SciFact — strong results documented
|
||||||
- 32,768 token context window — exceeds corpus requirements with large margin
|
- 32,768 token context window — exceeds corpus requirements with large margin
|
||||||
- Same model family as the generation model (Qwen) — shared tokenizer vocabulary
|
- Same model family as the generation model (Qwen) — shared tokenizer vocabulary
|
||||||
- Lowest integration risk — already validated in the pipeline
|
- Lowest integration risk — already validated in the pipeline
|
||||||
|
|
||||||
**Limitations:**
|
**Limitations:**
|
||||||
|
|
||||||
- Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated
|
- Benchmarks are English-only — multilingual performance on AVAP corpus unvalidated
|
||||||
- Not a dedicated multilingual model — training distribution weighted towards English and Chinese
|
- Not a dedicated multilingual model — training distribution weighted towards English and Chinese
|
||||||
- No native sparse retrieval support
|
- No native sparse retrieval support
|
||||||
|
|
@ -125,11 +127,13 @@ The model that demonstrates superior performance under the evaluation criteria d
|
||||||
### BGE-M3
|
### BGE-M3
|
||||||
|
|
||||||
**Strengths:**
|
**Strengths:**
|
||||||
|
|
||||||
- Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus
|
- Explicit multilingual contrastive training across 100+ languages including programming languages — direct architectural fit for the intra-chunk Spanish/English/DSL mixing observed in the corpus
|
||||||
- Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003)
|
- Supports dense, sparse and multi-vector ColBERT retrieval from a single model inference — future path to consolidating the current BM25+kNN dual-system architecture (ADR-0003)
|
||||||
- Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain
|
- Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain
|
||||||
|
|
||||||
**Limitations:**
|
**Limitations:**
|
||||||
|
|
||||||
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus
|
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus
|
||||||
- 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
|
- 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
|
||||||
- Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent
|
- Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent
|
||||||
|
|
@ -185,57 +189,57 @@ BGE-M3 benchmarks were completed on the same three BEIR datasets using identical
|
||||||
|
|
||||||
**CodeXGLUE** (code retrieval from GitHub repositories):
|
**CodeXGLUE** (code retrieval from GitHub repositories):
|
||||||
|
|
||||||
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
||||||
|---|---|---|---|---|
|
| ------ | --- | ---------------- | ---------------- | ----------------------- |
|
||||||
| NDCG | 1 | **0.9520** | 0.9497 | +0.23 pp |
|
| NDCG | 1 | **0.9520** | 0.9497 | +0.23 pp |
|
||||||
| NDCG | 5 | **0.9738** | 0.9717 | +0.21 pp |
|
| NDCG | 5 | **0.9738** | 0.9717 | +0.21 pp |
|
||||||
| NDCG | 10 | **0.9749** | 0.9734 | +0.15 pp |
|
| NDCG | 10 | **0.9749** | 0.9734 | +0.15 pp |
|
||||||
| NDCG | 100 | **0.9763** | 0.9745 | +0.18 pp |
|
| NDCG | 100 | **0.9763** | 0.9745 | +0.18 pp |
|
||||||
| Recall | 1 | **0.9520** | 0.9497 | +0.23 pp |
|
| Recall | 1 | **0.9520** | 0.9497 | +0.23 pp |
|
||||||
| Recall | 5 | **0.9892** | 0.9876 | +0.16 pp |
|
| Recall | 5 | **0.9892** | 0.9876 | +0.16 pp |
|
||||||
| Recall | 10 | 0.9928 | **0.9930** | −0.02 pp |
|
| Recall | 10 | 0.9928 | **0.9930** | −0.02 pp |
|
||||||
| Recall | 100 | **0.9989** | 0.9981 | +0.08 pp |
|
| Recall | 100 | **0.9989** | 0.9981 | +0.08 pp |
|
||||||
|
|
||||||
Both models perform near-identically on CodeXGLUE. All deltas are below 0.25 absolute percentage points. This dataset does not differentiate the candidates.
|
Both models perform near-identically on CodeXGLUE. All deltas are below 0.25 absolute percentage points. This dataset does not differentiate the candidates.
|
||||||
|
|
||||||
**CoSQA** (natural language queries over code — most representative proxy for AVAP retrieval):
|
**CoSQA** (natural language queries over code — most representative proxy for AVAP retrieval):
|
||||||
|
|
||||||
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
||||||
|---|---|---|---|---|
|
| ------ | --- | ------ | ---------------- | ----------------------- |
|
||||||
| NDCG | 1 | 0.1160 | **0.1740** | −5.80 pp |
|
| NDCG | 1 | 0.1160 | **0.1740** | −5.80 pp |
|
||||||
| NDCG | 5 | 0.2383 | **0.3351** | −9.68 pp |
|
| NDCG | 5 | 0.2383 | **0.3351** | −9.68 pp |
|
||||||
| NDCG | 10 | 0.2878 | **0.3909** | −10.31 pp |
|
| NDCG | 10 | 0.2878 | **0.3909** | −10.31 pp |
|
||||||
| NDCG | 100 | 0.3631 | **0.4510** | −8.79 pp |
|
| NDCG | 100 | 0.3631 | **0.4510** | −8.79 pp |
|
||||||
| Recall | 1 | 0.1160 | **0.1740** | −5.80 pp |
|
| Recall | 1 | 0.1160 | **0.1740** | −5.80 pp |
|
||||||
| Recall | 5 | 0.3660 | **0.5020** | −13.60 pp |
|
| Recall | 5 | 0.3660 | **0.5020** | −13.60 pp |
|
||||||
| Recall | 10 | 0.5160 | **0.6700** | −15.40 pp |
|
| Recall | 10 | 0.5160 | **0.6700** | −15.40 pp |
|
||||||
| Recall | 100 | 0.8740 | **0.9520** | −7.80 pp |
|
| Recall | 100 | 0.8740 | **0.9520** | −7.80 pp |
|
||||||
|
|
||||||
Qwen3-Embedding-0.6B outperforms BGE-M3 on CoSQA by a wide margin at every k. The NDCG@10 gap is 10.31 absolute percentage points. CoSQA is the most representative proxy for the AVAP retrieval use case — it pairs natural language queries with code snippets — making this the most significant BEIR result.
|
Qwen3-Embedding-0.6B outperforms BGE-M3 on CoSQA by a wide margin at every k. The NDCG@10 gap is 10.31 absolute percentage points. CoSQA is the most representative proxy for the AVAP retrieval use case — it pairs natural language queries with code snippets — making this the most significant BEIR result.
|
||||||
|
|
||||||
**SciFact** (scientific prose — out-of-domain control):
|
**SciFact** (scientific prose — out-of-domain control):
|
||||||
|
|
||||||
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
||||||
|---|---|---|---|---|
|
| ------ | --- | ------ | ---------------- | ----------------------- |
|
||||||
| NDCG | 1 | 0.5100 | **0.5533** | −4.33 pp |
|
| NDCG | 1 | 0.5100 | **0.5533** | −4.33 pp |
|
||||||
| NDCG | 5 | 0.6190 | **0.6593** | −4.03 pp |
|
| NDCG | 5 | 0.6190 | **0.6593** | −4.03 pp |
|
||||||
| NDCG | 10 | 0.6431 | **0.6785** | −3.54 pp |
|
| NDCG | 10 | 0.6431 | **0.6785** | −3.54 pp |
|
||||||
| NDCG | 100 | 0.6705 | **0.7056** | −3.51 pp |
|
| NDCG | 100 | 0.6705 | **0.7056** | −3.51 pp |
|
||||||
| Recall | 1 | 0.4818 | **0.5243** | −4.25 pp |
|
| Recall | 1 | 0.4818 | **0.5243** | −4.25 pp |
|
||||||
| Recall | 5 | 0.7149 | **0.7587** | −4.38 pp |
|
| Recall | 5 | 0.7149 | **0.7587** | −4.38 pp |
|
||||||
| Recall | 10 | 0.7834 | **0.8144** | −3.10 pp |
|
| Recall | 10 | 0.7834 | **0.8144** | −3.10 pp |
|
||||||
| Recall | 100 | 0.9037 | **0.9367** | −3.30 pp |
|
| Recall | 100 | 0.9037 | **0.9367** | −3.30 pp |
|
||||||
|
|
||||||
Qwen3-Embedding-0.6B leads BGE-M3 on SciFact by 3–4 absolute percentage points across all metrics. The gap is consistent but narrower than on CoSQA.
|
Qwen3-Embedding-0.6B leads BGE-M3 on SciFact by 3–4 absolute percentage points across all metrics. The gap is consistent but narrower than on CoSQA.
|
||||||
|
|
||||||
### BEIR summary — NDCG@10 comparison
|
### BEIR summary — NDCG@10 comparison
|
||||||
|
|
||||||
| Dataset | BGE-M3 | Qwen3-Emb-0.6B | Delta | Leader |
|
| Dataset | BGE-M3 | Qwen3-Emb-0.6B | Delta | Leader |
|
||||||
|---|---|---|---|---|
|
| -------------- | ---------------- | ---------------- | ------------------- | ----------------- |
|
||||||
| CodeXGLUE | 0.9749 | 0.9734 | +0.15 pp | BGE-M3 (marginal) |
|
| CodeXGLUE | 0.9749 | 0.9734 | +0.15 pp | BGE-M3 (marginal) |
|
||||||
| CoSQA | 0.2878 | **0.3909** | −10.31 pp | **Qwen3** |
|
| CoSQA | 0.2878 | **0.3909** | −10.31 pp | **Qwen3** |
|
||||||
| SciFact | 0.6431 | **0.6785** | −3.54 pp | **Qwen3** |
|
| SciFact | 0.6431 | **0.6785** | −3.54 pp | **Qwen3** |
|
||||||
| **Mean** | **0.6353** | **0.6809** | **−4.56 pp** | **Qwen3** |
|
| **Mean** | **0.6353** | **0.6809** | **−4.56 pp** | **Qwen3** |
|
||||||
|
|
||||||
Qwen3-Embedding-0.6B leads on mean NDCG@10 by 4.56 absolute percentage points, driven primarily by a 10.31 pp advantage on CoSQA.
|
Qwen3-Embedding-0.6B leads on mean NDCG@10 by 4.56 absolute percentage points, driven primarily by a 10.31 pp advantage on CoSQA.
|
||||||
|
|
||||||
|
|
@ -260,6 +264,8 @@ The BEIR benchmarks — the secondary decision signal — favour Qwen3-Embedding
|
||||||
|
|
||||||
The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates.
|
The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates.
|
||||||
|
|
||||||
|
We have found that Qwen3-embedding is multi-lingual, with good scores in multi-lingual benchmarks. The documentation says so, but the definitive answer will be provided by the scores of the evaluation on the AVAP corpus.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Consequences
|
## Consequences
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue