assistance-engine/docs/ADR/ADR-0003-hybrid-retrieval-r...

# ADR-0003: Hybrid Retrieval (BM25 + kNN) with RRF Fusion

**Date:** 2026-03-05
**Status:** Accepted
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering

---

## Context

The RAG pipeline needs a retrieval strategy for finding relevant AVAP documentation chunks from Elasticsearch. The knowledge base contains a mix of:

- **Prose documentation** (explanations of AVAP concepts, commands, parameters) — benefits from semantic (dense) retrieval
- **Code examples and BNF grammar** (exact syntax patterns, function signatures) — benefits from lexical (sparse) retrieval, where exact token matches are critical

A single retrieval strategy will underperform for one of these document types.

---

## Decision

Implement **hybrid retrieval** combining:
- **BM25** (Elasticsearch `multi_match` on `content^2` and `text^2` fields) for lexical relevance
- **kNN** (Elasticsearch `knn` on the `embedding` field) for semantic relevance
- **RRF (Reciprocal Rank Fusion)** with constant `k=60` to fuse rankings from both systems

The fused top-8 documents are passed to the generation node as context.

Query reformulation (`reformulate` node) runs before retrieval and rewrites the user query into keyword-optimized form to improve BM25 recall for AVAP-specific terminology.

---

## Rationale

### Why hybrid over pure semantic?

AVAP is a domain-specific language with precise, non-negotiable syntax. For queries like "how does `addVar` work", exact lexical matching on the function name `addVar` is more reliable than semantic similarity, which may confuse similar-sounding functions or return contextually related but syntactically different commands.

### Why hybrid over pure BM25?

Conversational queries ("explain how loops work in AVAP", "what's the difference between addVar and setVar") benefit from semantic search that captures meaning beyond exact keyword overlap.

### Why RRF over score normalization?

BM25 and kNN scores are on different scales and distributions. Normalizing them requires careful calibration per index. RRF operates on ranks — not scores — making it robust to distribution differences and requiring no per-deployment tuning. The `k=60` constant is the standard literature value.

### Retrieval parameters

| Parameter | Value | Rationale |
|---|---|---|
| `k` (top documents) | 8 | Balances context richness vs. context window length |
| `num_candidates` (kNN) | `k × 5 = 40` | Standard ES kNN oversampling ratio |
| BM25 fields | `content^2, text^2` | Boost content/text fields; `^2` emphasizes them over metadata |
| Fuzziness (BM25) | `AUTO` | Handles minor typos in AVAP function names |

---

## Consequences

- Retrieval requires two ES queries per request (BM25 + kNN). This is acceptable given the tunnel latency baseline already incurred.
- If either BM25 or kNN fails (e.g., embedding model unavailable), the system degrades gracefully: the failing component logs a warning and returns an empty list; RRF fusion proceeds with the available rankings.
- Context length grows with `k`. At `k=8` with typical chunk sizes (~300 tokens each), context is ~2400 tokens — within the `qwen2.5:1.5b` context window.
- Changing `k` has a direct impact on both retrieval quality and generation latency. Any change must be evaluated with `EvaluateRAG` before merging.