238 lines
10 KiB
Markdown
238 lines
10 KiB
Markdown
# ADR-0008: Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy
|
||
|
||
**Date:** 2026-04-09
|
||
**Status:** Accepted
|
||
**Deciders:** Rafael Ruiz (CTO)
|
||
**Related ADRs:** ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types with no differentiation in routing, retrieval, or model selection. Two problems emerged in production:
|
||
|
||
### Problem 1 — No query taxonomy
|
||
|
||
All queries were treated identically. Platform queries (account status, usage metrics, billing) were sent through the same RAG pipeline as AVAP language questions, wasting retrieval resources and producing irrelevant context.
|
||
|
||
### Problem 2 — Classifier anchoring bias
|
||
|
||
The LLM-based classifier received raw conversation messages as history. A 1.7B model exhibited **anchoring bias**: it computed `P(type | history)` instead of `P(type | message_content)`, misclassifying new queries as the same type as recent turns regardless of actual content.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
This ADR makes three decisions with different time horizons:
|
||
|
||
1. **Permanent** — query taxonomy and routing contract
|
||
2. **Permanent** — model assignment per type
|
||
3. **Tactical / bootstrap** — LLM classifier as interim implementation
|
||
|
||
### Decision 1 — Query taxonomy (permanent)
|
||
|
||
Four query types with fixed routing semantics:
|
||
|
||
| Type | Purpose | RAG | Model slot |
|
||
|---|---|---|---|
|
||
| `RETRIEVAL` | AVAP language documentation and concepts | Yes | `main` |
|
||
| `CODE_GENERATION` | Produce working AVAP code | Yes | `main` |
|
||
| `CONVERSATIONAL` | Rephrase or continue prior answer | No | `conversational` |
|
||
| `PLATFORM` | Account, metrics, usage, quota, billing | No | `conversational` |
|
||
|
||
These types and their RAG/model assignments are stable. Any future classifier implementation must preserve this taxonomy.
|
||
|
||
### Decision 2 — Model specialization (permanent)
|
||
|
||
Two model slots configured via environment variables:
|
||
|
||
```
|
||
OLLAMA_MODEL_NAME=qwen3:1.7b # main slot: RETRIEVAL + CODE_GENERATION
|
||
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # conversational slot: CONVERSATIONAL + PLATFORM
|
||
```
|
||
|
||
If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is unset, both slots fall back to `OLLAMA_MODEL_NAME`.
|
||
|
||
### Decision 3 — LLM classifier as bootstrap `[TACTICAL DEBT]`
|
||
|
||
> **This is an acknowledged interim implementation, not the target architecture.**
|
||
> See [Future Path](#future-path-discriminative-classifier-pipeline) for the correct steady-state design.
|
||
|
||
A generative LLM is used for classification because no labeled training data exists yet. The design includes two mitigations for its known weaknesses:
|
||
|
||
**a) Compact intent history instead of raw messages**
|
||
|
||
`classify_history` replaces raw message history in the classifier context. Each entry stores only `type` + 60-char topic snippet:
|
||
|
||
```
|
||
[RETRIEVAL] "What is addVar in AVAP?"
|
||
[CODE_GENERATION] "Write an API endpoint that retur"
|
||
[PLATFORM] "You have a project usage percentag"
|
||
```
|
||
|
||
This preserves reference resolution (`"this"`, `"esto"`, `"lo anterior"`) without the topical noise that causes anchoring. `classify_history` is persisted in `classify_history_store` per session.
|
||
|
||
**b) Prompt constraints to counteract generative bias**
|
||
|
||
- `<history_rule>` — explicit instruction that intent distribution of prior turns must not influence prior probability of current classification
|
||
- `<platform_priority_rule>` — hard semantic override: usage percentages, account metrics, quota or billing data → always `PLATFORM`
|
||
|
||
These prompt rules are compensations for the architectural mismatch between a generative model and a discriminative task. They become unnecessary once the LLM classifier is replaced.
|
||
|
||
---
|
||
|
||
## Routing Contract
|
||
|
||
This section is normative and **implementation-independent**. Any reimplementation — including the discriminative classifier described in Future Path — must satisfy all rules below. Rules are ordered by priority.
|
||
|
||
### RC-01 — Fast-path override (priority: highest)
|
||
|
||
If the query contains a known platform-injected prefix, classify as `PLATFORM` without invoking any classifier.
|
||
|
||
```
|
||
∀ q : query
|
||
contains(q, known_platform_prefix) → route(q) = PLATFORM
|
||
```
|
||
|
||
Current registered prefixes (`_PLATFORM_PATTERNS` in `graph.py`):
|
||
- `"you are a direct and concise assistant"`
|
||
|
||
Adding a prefix requires updating `_PLATFORM_PATTERNS` and this list.
|
||
|
||
### RC-02 — Platform data signal (priority: high)
|
||
|
||
If the query contains usage percentages, account metrics, consumption figures, quota data, or billing information, the output **MUST** be `PLATFORM` regardless of history or classifier confidence.
|
||
|
||
In the current bootstrap implementation this is enforced via `<platform_priority_rule>`. In the future discriminative classifier it should be a hard pre-filter in Layer 1.
|
||
|
||
### RC-03 — Intent history scoping (priority: medium)
|
||
|
||
The classifier **MUST** use `classify_history` only to resolve ambiguous deictic references. It **MUST NOT** use history to predict or bias the type of the current message.
|
||
|
||
```
|
||
classify(q, history) ≠ f(dominant_type(history))
|
||
classify(q, history) = f(intent(q), resolve_references(q, history))
|
||
```
|
||
|
||
**Rationale:** Small LLMs implicitly compute `P(type | history)` instead of `P(type | message_content)`. The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 `RETRIEVAL` turns does not make the next message more likely to be `RETRIEVAL`.
|
||
|
||
### RC-04 — RAG bypass (priority: medium)
|
||
|
||
| Type | RAG | Justification |
|
||
|---|---|---|
|
||
| `RETRIEVAL` | Yes | Requires documentation context |
|
||
| `CODE_GENERATION` | Yes | Requires syntax examples |
|
||
| `CONVERSATIONAL` | No | Prior answer already in context |
|
||
| `PLATFORM` | No | Data injected via `extra_context` |
|
||
|
||
A `PLATFORM` or `CONVERSATIONAL` query that triggers Elasticsearch retrieval is a contract violation.
|
||
|
||
### RC-05 — Model assignment (priority: medium)
|
||
|
||
```
|
||
route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME
|
||
route(q) ∈ {CONVERSATIONAL, PLATFORM} → model = OLLAMA_MODEL_NAME_CONVERSATIONAL
|
||
?? OLLAMA_MODEL_NAME # fallback
|
||
```
|
||
|
||
### RC-06 — History growth bound (priority: low)
|
||
|
||
`classify_history` input to the classifier **MUST** be capped at 6 entries per session.
|
||
|
||
### Contract violations to monitor
|
||
|
||
| Symptom | Violated rule |
|
||
|---|---|
|
||
| Platform query hits Elasticsearch | RC-04 |
|
||
| `qwen3:1.7b` used for a `PLATFORM` response | RC-05 |
|
||
| Platform prefix triggers LLM classifier | RC-01 |
|
||
| Classifier output mirrors dominant history type | RC-03 |
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
|
||
- Query taxonomy is formalized and stable — downstream graph, model assignment, and RAG decisions are decoupled from classifier implementation
|
||
- `classify_history_store` acts as a data flywheel for future classifier training
|
||
- Platform-injected prompts classified in O(1) via RC-01
|
||
- `PLATFORM` queries never hit Elasticsearch
|
||
|
||
### Negative / Trade-offs
|
||
|
||
- The LLM classifier is a generative model doing discriminative work — this is the accepted tactical debt
|
||
- Prompt engineering (`<history_rule>`, `<platform_priority_rule>`) is a symptom of this mismatch, not a solution
|
||
- `qwen3:1.7b` can still misclassify edge cases without platform signals — inherent to the bootstrap design
|
||
|
||
---
|
||
|
||
## Future Path: Discriminative Classifier Pipeline
|
||
|
||
### The fundamental problem with the bootstrap design
|
||
|
||
The LLM classifier is a generative model doing discriminative work. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering to compensate for what should be model properties. RC-01 through RC-06 exist precisely because of this mismatch.
|
||
|
||
The bootstrap design is justified while no labeled data exists. It should not be the steady-state architecture.
|
||
|
||
### Target architecture
|
||
|
||
A layered pipeline where each layer is only invoked if the previous layer cannot produce a confident answer:
|
||
|
||
```
|
||
Query
|
||
│
|
||
▼
|
||
[Layer 1] Hard rules (RC-01, RC-02) ← O(1), deterministic
|
||
│ no match
|
||
▼
|
||
[Layer 2] Embedding similarity classifier ← ~1ms, CPU, no LLM
|
||
│ confidence < threshold
|
||
▼
|
||
[Layer 3] LLM classifier (current design) ← fallback for ambiguous queries only
|
||
│
|
||
▼
|
||
Classification result
|
||
```
|
||
|
||
In steady state, Layer 3 handles fewer than 5% of requests.
|
||
|
||
### Layer 2: embedding classifier on `bge-m3`
|
||
|
||
`bge-m3` is already running in the stack. The implementation:
|
||
|
||
1. Embed each query via `bge-m3` → fixed-size vector
|
||
2. Train logistic regression (or SVM with RBF kernel) on labeled `(query, type)` pairs
|
||
3. At inference: embed → class centroids → argmax with confidence score
|
||
4. If `max(softmax(logits)) < 0.85` → fall through to Layer 3
|
||
|
||
This is microseconds of CPU inference. No GPU, no Ollama call, no prompt templating. RC-02 becomes a hard pre-filter in Layer 1, making it implementation-independent rather than prompt-dependent.
|
||
|
||
### The data flywheel
|
||
|
||
`classify_history_store` already generates labeled training data. Every session produces `(topic_snippet, type)` pairs implicitly validated by user continuation.
|
||
|
||
```
|
||
classify_history_store → periodic export → labeled dataset → retrain Layer 2
|
||
```
|
||
|
||
The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation over production traffic without manual labeling.
|
||
|
||
**Trigger:** retrain when `classify_history_store` accumulates 500 sessions.
|
||
|
||
### Caller-declared type
|
||
|
||
The platform generates `PLATFORM` prompts and knows the type at generation time. Adding `query_type` to `AgentRequest` (proto field 7) lets the caller declare the type explicitly, bypassing all three layers. This makes RC-01 and RC-02 redundant for platform-generated traffic.
|
||
|
||
### Convergence path
|
||
|
||
| Phase | What changes | Layer 3 traffic |
|
||
|---|---|---|
|
||
| Now — bootstrap | LLM classifier for all unmatched queries | ~95% |
|
||
| Phase 1 | Collect labels via `classify_history_store` | ~95% |
|
||
| Phase 2 | Deploy embedding classifier (Layer 2) | ~10–20% |
|
||
| Phase 3 | Caller-declared type for platform prompts | <5% |
|
||
| Phase 4 | LLM classifier as anomaly handler only | <2% |
|
||
|
||
Phase 2 is the highest-leverage step: it replaces the dominant code path (LLM inference per request) with CPU-only inference, with no change to the routing contract or the downstream graph.
|