assistance-engine/docs/ADR/ADR-0008-adaptive-query-rou...

# ADR-0008: Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy

**Date:** 2026-04-09
**Status:** Accepted
**Deciders:** Rafael Ruiz (CTO)
**Related ADRs:** ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)

---

## Context

The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types with no differentiation in routing, retrieval, or model selection. Two problems emerged in production:

### Problem 1 — No query taxonomy

All queries were treated identically. Platform queries (account status, usage metrics, billing) were sent through the same RAG pipeline as AVAP language questions, wasting retrieval resources and producing irrelevant context.

### Problem 2 — Classifier anchoring bias

The LLM-based classifier received raw conversation messages as history. A 1.7B model exhibited **anchoring bias**: it computed `P(type | history)` instead of `P(type | message_content)`, misclassifying new queries as the same type as recent turns regardless of actual content.

---

## Decision

This ADR makes three decisions with different time horizons:

1. **Permanent** — query taxonomy and routing contract
2. **Permanent** — model assignment per type
3. **Tactical / bootstrap** — LLM classifier as interim implementation

### Decision 1 — Query taxonomy (permanent)

Four query types with fixed routing semantics:

| Type | Purpose | RAG | Model slot |
|---|---|---|---|
| `RETRIEVAL` | AVAP language documentation and concepts | Yes | `main` |
| `CODE_GENERATION` | Produce working AVAP code | Yes | `main` |
| `CONVERSATIONAL` | Rephrase or continue prior answer | No | `conversational` |
| `PLATFORM` | Account, metrics, usage, quota, billing | No | `conversational` |

These types and their RAG/model assignments are stable. Any future classifier implementation must preserve this taxonomy.

### Decision 2 — Model specialization (permanent)

Two model slots configured via environment variables:

```
OLLAMA_MODEL_NAME=qwen3:1.7b                # main slot: RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # conversational slot: CONVERSATIONAL + PLATFORM
```

If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is unset, both slots fall back to `OLLAMA_MODEL_NAME`.

### Decision 3 — LLM classifier as bootstrap `[TACTICAL DEBT]`

> **This is an acknowledged interim implementation, not the target architecture.**
> See [Future Path](#future-path-discriminative-classifier-pipeline) for the correct steady-state design.

A generative LLM is used for classification because no labeled training data exists yet. The design includes two mitigations for its known weaknesses:

**a) Compact intent history instead of raw messages**

`classify_history` replaces raw message history in the classifier context. Each entry stores only `type` + 60-char topic snippet:

```
[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"
```

This preserves reference resolution (`"this"`, `"esto"`, `"lo anterior"`) without the topical noise that causes anchoring. `classify_history` is persisted in `classify_history_store` per session.

**b) Prompt constraints to counteract generative bias**

- `<history_rule>` — explicit instruction that intent distribution of prior turns must not influence prior probability of current classification
- `<platform_priority_rule>` — hard semantic override: usage percentages, account metrics, quota or billing data → always `PLATFORM`

These prompt rules are compensations for the architectural mismatch between a generative model and a discriminative task. They become unnecessary once the LLM classifier is replaced.

---

## Routing Contract

This section is normative and **implementation-independent**. Any reimplementation — including the discriminative classifier described in Future Path — must satisfy all rules below. Rules are ordered by priority.

### RC-01 — Fast-path override (priority: highest)

If the query contains a known platform-injected prefix, classify as `PLATFORM` without invoking any classifier.

```
∀ q : query
  contains(q, known_platform_prefix) → route(q) = PLATFORM
```

Current registered prefixes (`_PLATFORM_PATTERNS` in `graph.py`):
- `"you are a direct and concise assistant"`

Adding a prefix requires updating `_PLATFORM_PATTERNS` and this list.

### RC-02 — Platform data signal (priority: high)

If the query contains usage percentages, account metrics, consumption figures, quota data, or billing information, the output **MUST** be `PLATFORM` regardless of history or classifier confidence.

In the current bootstrap implementation this is enforced via `<platform_priority_rule>`. In the future discriminative classifier it should be a hard pre-filter in Layer 1.

### RC-03 — Intent history scoping (priority: medium)

The classifier **MUST** use `classify_history` only to resolve ambiguous deictic references. It **MUST NOT** use history to predict or bias the type of the current message.

```
classify(q, history) ≠ f(dominant_type(history))
classify(q, history) = f(intent(q), resolve_references(q, history))
```

**Rationale:** Small LLMs implicitly compute `P(type | history)` instead of `P(type | message_content)`. The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 `RETRIEVAL` turns does not make the next message more likely to be `RETRIEVAL`.

### RC-04 — RAG bypass (priority: medium)

| Type | RAG | Justification |
|---|---|---|
| `RETRIEVAL` | Yes | Requires documentation context |
| `CODE_GENERATION` | Yes | Requires syntax examples |
| `CONVERSATIONAL` | No | Prior answer already in context |
| `PLATFORM` | No | Data injected via `extra_context` |

A `PLATFORM` or `CONVERSATIONAL` query that triggers Elasticsearch retrieval is a contract violation.

### RC-05 — Model assignment (priority: medium)

```
route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME
route(q) ∈ {CONVERSATIONAL, PLATFORM}   → model = OLLAMA_MODEL_NAME_CONVERSATIONAL
                                                    ?? OLLAMA_MODEL_NAME  # fallback
```

### RC-06 — History growth bound (priority: low)

`classify_history` input to the classifier **MUST** be capped at 6 entries per session.

### Contract violations to monitor

| Symptom | Violated rule |
|---|---|
| Platform query hits Elasticsearch | RC-04 |
| `qwen3:1.7b` used for a `PLATFORM` response | RC-05 |
| Platform prefix triggers LLM classifier | RC-01 |
| Classifier output mirrors dominant history type | RC-03 |

---

## Consequences

### Positive

- Query taxonomy is formalized and stable — downstream graph, model assignment, and RAG decisions are decoupled from classifier implementation
- `classify_history_store` acts as a data flywheel for future classifier training
- Platform-injected prompts classified in O(1) via RC-01
- `PLATFORM` queries never hit Elasticsearch

### Negative / Trade-offs

- The LLM classifier is a generative model doing discriminative work — this is the accepted tactical debt
- Prompt engineering (`<history_rule>`, `<platform_priority_rule>`) is a symptom of this mismatch, not a solution
- `qwen3:1.7b` can still misclassify edge cases without platform signals — inherent to the bootstrap design

---

## Future Path: Discriminative Classifier Pipeline

### The fundamental problem with the bootstrap design

The LLM classifier is a generative model doing discriminative work. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering to compensate for what should be model properties. RC-01 through RC-06 exist precisely because of this mismatch.

The bootstrap design is justified while no labeled data exists. It should not be the steady-state architecture.

### Target architecture

A layered pipeline where each layer is only invoked if the previous layer cannot produce a confident answer:

```
Query
  │
  ▼
[Layer 1] Hard rules (RC-01, RC-02)          ← O(1), deterministic
  │ no match
  ▼
[Layer 2] Embedding similarity classifier     ← ~1ms, CPU, no LLM
  │ confidence < threshold
  ▼
[Layer 3] LLM classifier (current design)     ← fallback for ambiguous queries only
  │
  ▼
Classification result
```

In steady state, Layer 3 handles fewer than 5% of requests.

### Layer 2: embedding classifier on `bge-m3`

`bge-m3` is already running in the stack. The implementation:

1. Embed each query via `bge-m3` → fixed-size vector
2. Train logistic regression (or SVM with RBF kernel) on labeled `(query, type)` pairs
3. At inference: embed → class centroids → argmax with confidence score
4. If `max(softmax(logits)) < 0.85` → fall through to Layer 3

This is microseconds of CPU inference. No GPU, no Ollama call, no prompt templating. RC-02 becomes a hard pre-filter in Layer 1, making it implementation-independent rather than prompt-dependent.

### The data flywheel

`classify_history_store` already generates labeled training data. Every session produces `(topic_snippet, type)` pairs implicitly validated by user continuation.

```
classify_history_store → periodic export → labeled dataset → retrain Layer 2
```

The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation over production traffic without manual labeling.

**Trigger:** retrain when `classify_history_store` accumulates 500 sessions.

### Caller-declared type

The platform generates `PLATFORM` prompts and knows the type at generation time. Adding `query_type` to `AgentRequest` (proto field 7) lets the caller declare the type explicitly, bypassing all three layers. This makes RC-01 and RC-02 redundant for platform-generated traffic.

### Convergence path

| Phase | What changes | Layer 3 traffic |
|---|---|---|
| Now — bootstrap | LLM classifier for all unmatched queries | ~95% |
| Phase 1 | Collect labels via `classify_history_store` | ~95% |
| Phase 2 | Deploy embedding classifier (Layer 2) | ~10–20% |
| Phase 3 | Caller-declared type for platform prompts | <5% |
| Phase 4 | LLM classifier as anomaly handler only | <2% |

Phase 2 is the highest-leverage step: it replaces the dominant code path (LLM inference per request) with CPU-only inference, with no change to the routing contract or the downstream graph.