250 lines
12 KiB
Markdown
250 lines
12 KiB
Markdown
# ADR-0008: Adaptive Query Routing with Intent History and Model Specialization
|
||
|
||
**Date:** 2026-04-09
|
||
**Status:** Accepted
|
||
**Deciders:** Rafael Ruiz (CTO)
|
||
**Related ADRs:** ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types and a single LLM-based classifier that received raw conversation history. Two problems emerged in production:
|
||
|
||
### Problem 1 — Model oversizing for lightweight queries
|
||
|
||
Platform queries (account status, usage metrics, subscription data) and conversational follow-ups do not require retrieval or a large model. Running `qwen3:1.7b` for a one-sentence platform insight wastes resources and adds latency.
|
||
|
||
### Problem 2 — Classifier bias from raw message history
|
||
|
||
When the classifier received raw conversation messages as history, a small model (1.7B parameters) exhibited **anchoring bias**: it would classify new messages as the same type as recent messages, regardless of the actual content of the new query. This caused platform queries (`"You have a project usage percentage of 20%, provide a recommendation"`) to be misclassified as `RETRIEVAL` or `CODE_GENERATION` during sessions that had previously handled AVAP language questions.
|
||
|
||
Root cause: passing full message content to a small classifier is too noisy. The model uses conversation topic as a proxy for intent type.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
### 1. New query type: `PLATFORM`
|
||
|
||
A fourth classification category is introduced alongside `RETRIEVAL`, `CODE_GENERATION`, and `CONVERSATIONAL`:
|
||
|
||
| Type | Purpose | RAG | Model |
|
||
|---|---|---|---|
|
||
| `RETRIEVAL` | AVAP language documentation | Yes | `OLLAMA_MODEL_NAME` |
|
||
| `CODE_GENERATION` | Produce working AVAP code | Yes | `OLLAMA_MODEL_NAME` |
|
||
| `CONVERSATIONAL` | Rephrase / continue prior answer | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` |
|
||
| `PLATFORM` | Account, metrics, usage, billing | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` |
|
||
|
||
`PLATFORM` queries skip RAG entirely and are served with a dedicated `PLATFORM_PROMPT` that instructs the model to use `extra_context` (where user account data is injected) as primary source.
|
||
|
||
### 2. Model specialization via environment variables
|
||
|
||
Two model slots are configured independently:
|
||
|
||
```
|
||
OLLAMA_MODEL_NAME=qwen3:1.7b # RETRIEVAL + CODE_GENERATION
|
||
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM
|
||
```
|
||
|
||
If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is not set, both slots fall back to `OLLAMA_MODEL_NAME` (backward compatible).
|
||
|
||
### 3. Intent history instead of raw message history for classification
|
||
|
||
The classifier no longer receives raw conversation messages. Instead, a compact **intent history** (`classify_history`) is maintained per session:
|
||
|
||
```
|
||
[RETRIEVAL] "What is addVar in AVAP?"
|
||
[CODE_GENERATION] "Write an API endpoint that retur"
|
||
[PLATFORM] "You have a project usage percentag"
|
||
```
|
||
|
||
Each entry stores only the `type` and a 60-character topic snippet. This gives the classifier the conversational thread (useful for resolving ambiguous references like "this", "esto", "lo anterior") without the topical noise that causes anchoring bias.
|
||
|
||
`classify_history` is persisted in `classify_history_store` (parallel to `session_store`) and passed in `AgentState` across turns.
|
||
|
||
### 4. Classifier prompt redesign
|
||
|
||
The prompt now includes:
|
||
|
||
- **`<history_rule>`** — explicit instruction: use history only to resolve ambiguous references, not to predict the category of the new message
|
||
- **`<platform_priority_rule>`** — hard override: if the message contains usage percentages, account metrics, quota data, or billing information, classify as `PLATFORM` regardless of history
|
||
- **`<step1_purpose>`** replaced by inline role instruction that each message must be evaluated independently
|
||
|
||
### 5. Fast-path for known platform prefixes
|
||
|
||
Queries containing `"you are a direct and concise assistant"` (a system-injected prefix used by the platform) are classified as `PLATFORM` deterministically without invoking the LLM classifier. This is justified because this prefix is controlled by the platform itself, not by user input, so deterministic detection is both correct and cheaper.
|
||
|
||
---
|
||
|
||
## Routing Contract
|
||
|
||
This section is normative. Any reimplementation of the classifier or the graph must satisfy all rules below. Rules are ordered by priority — a higher-priority rule always wins.
|
||
|
||
### RC-01 — Fast-path override (priority: highest)
|
||
|
||
If the query contains a known platform-injected prefix, the system **MUST** classify it as `PLATFORM` without invoking any LLM.
|
||
|
||
```
|
||
∀ q : query
|
||
contains(q, known_platform_prefix) → route(q) = PLATFORM
|
||
```
|
||
|
||
Current registered prefixes (see `_PLATFORM_PATTERNS` in `graph.py`):
|
||
- `"you are a direct and concise assistant"`
|
||
|
||
Adding a new prefix requires a code change to `_PLATFORM_PATTERNS` and a corresponding update to this list.
|
||
|
||
### RC-02 — Platform data signal (priority: high)
|
||
|
||
If the query contains any of the following signals, the classifier **MUST** output `PLATFORM` regardless of conversation history:
|
||
|
||
- Usage percentages (e.g. `"20%"` in the context of project/account usage)
|
||
- Account metrics or consumption figures
|
||
- Quota, limit, or billing data
|
||
|
||
This rule is enforced via `<platform_priority_rule>` in the classifier prompt. It cannot be overridden by history.
|
||
|
||
### RC-03 — Intent history scoping (priority: medium)
|
||
|
||
The classifier **MUST** use `classify_history` only to resolve ambiguous pronoun or deictic references (`"this"`, `"esto"`, `"lo anterior"`, `"that function"`). It **MUST NOT** use history to predict or bias the type of the current message.
|
||
|
||
```
|
||
classify(q, history) ≠ f(dominant_type(history))
|
||
classify(q, history) = f(intent(q), resolve_references(q, history))
|
||
```
|
||
|
||
**Rationale:** Small LLMs implicitly compute `P(type | history)` instead of `P(type | message_content)`. The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 `RETRIEVAL` turns does not make the next message more likely to be `RETRIEVAL`. The `<history_rule>` in the classifier prompt enforces this explicitly.
|
||
|
||
### RC-04 — RAG bypass (priority: medium)
|
||
|
||
Query types that bypass Elasticsearch retrieval:
|
||
|
||
| Type | RAG | Justification |
|
||
|---|---|---|
|
||
| `RETRIEVAL` | Yes | Requires documentation context |
|
||
| `CODE_GENERATION` | Yes | Requires syntax examples |
|
||
| `CONVERSATIONAL` | No | Reformulates prior answer already in context |
|
||
| `PLATFORM` | No | Data is injected via `extra_context`, not retrieved |
|
||
|
||
A `PLATFORM` or `CONVERSATIONAL` query that triggers a retrieval step is a contract violation.
|
||
|
||
### RC-05 — Model assignment (priority: medium)
|
||
|
||
```
|
||
route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME
|
||
route(q) ∈ {CONVERSATIONAL, PLATFORM} → model = OLLAMA_MODEL_NAME_CONVERSATIONAL
|
||
?? OLLAMA_MODEL_NAME # fallback if unset
|
||
```
|
||
|
||
Changing which types map to which model slot requires updating this contract.
|
||
|
||
### RC-06 — History growth bound (priority: low)
|
||
|
||
`classify_history` per session **MUST** be bounded. The classifier reads at most the last 6 entries. The store may grow unbounded in memory but the classifier input is always capped.
|
||
|
||
### Contract violations to monitor
|
||
|
||
| Symptom | Violated rule |
|
||
|---|---|
|
||
| Platform query hits Elasticsearch | RC-04 |
|
||
| `qwen3:1.7b` used for a `PLATFORM` response | RC-05 |
|
||
| Platform prefix query triggers LLM classifier | RC-01 |
|
||
| Classifier output mirrors dominant history type | RC-03 |
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
|
||
- Platform and conversational queries are served by a smaller, faster model
|
||
- Classifier bias from conversation history is eliminated while preserving the ability to resolve ambiguous references
|
||
- `PLATFORM` queries never hit Elasticsearch, reducing unnecessary retrieval load
|
||
- The system is more predictable: platform-injected prompts are classified in O(1) without an LLM call
|
||
|
||
### Negative / Trade-offs
|
||
|
||
- `classify_history` adds a small amount of state per session (bounded to last 6 entries)
|
||
- Two model slots mean two warm-up calls at startup if models differ
|
||
- The `qwen3:1.7b` classifier can still misclassify edge cases where no platform signals are present in the text — this is inherent to using a 1.7B model for semantic classification
|
||
|
||
### Open questions
|
||
|
||
- Whether `PLATFORM` should eventually split into sub-types (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) as the platform data schema grows
|
||
|
||
---
|
||
|
||
## Future Path: Discriminative Classifier Pipeline
|
||
|
||
### The fundamental problem with the current design
|
||
|
||
The LLM classifier is a **generative model doing discriminative work**. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering as a substitute for proper model design. The rules in RC-01–RC-06 exist precisely to compensate for this architectural mismatch.
|
||
|
||
The current design is correct as a **bootstrap mechanism** — it lets the system operate before labeled training data exists. But it should not be the steady-state architecture.
|
||
|
||
### Target architecture
|
||
|
||
A layered pipeline where each layer is only invoked if the previous layer cannot produce a confident answer:
|
||
|
||
```
|
||
Query
|
||
│
|
||
▼
|
||
[Layer 1] Hard rules (RC-01, RC-02) ← O(1), deterministic
|
||
│ no match
|
||
▼
|
||
[Layer 2] Embedding similarity classifier ← ~1ms, CPU, no LLM
|
||
│ confidence < threshold
|
||
▼
|
||
[Layer 3] LLM classifier (current design) ← fallback only
|
||
│
|
||
▼
|
||
Classification result
|
||
```
|
||
|
||
In steady state, Layer 3 should handle fewer than 5% of requests — only genuinely ambiguous queries that neither rules nor the trained classifier can resolve with confidence.
|
||
|
||
### Layer 2: Embedding similarity classifier
|
||
|
||
`bge-m3` is already running in the stack. The right move is to reuse it as the backbone of a lightweight discriminative classifier rather than adding a second LLM.
|
||
|
||
**Implementation:**
|
||
|
||
1. Embed each query using `bge-m3` → fixed-size vector representation
|
||
2. Train a logistic regression (or SVM with RBF kernel) over those embeddings on a labeled dataset of `(query, type)` pairs
|
||
3. At inference: embed query → dot product against class centroids → argmax with confidence score
|
||
4. If `max(softmax(logits)) < threshold` (e.g. 0.85), fall through to Layer 3
|
||
|
||
This is microseconds of CPU inference, not LLM inference. No GPU, no Ollama call, no prompt templating.
|
||
|
||
### The data flywheel: `classify_history_store` as training set
|
||
|
||
Every session already generates labeled examples. `classify_history_store` stores `(topic_snippet, type)` pairs that are implicitly validated by the system — if the user continued the conversation without correcting the assistant, the classification was likely correct.
|
||
|
||
```
|
||
classify_history_store → periodic export → labeled dataset → retrain classifier
|
||
```
|
||
|
||
The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation without the overhead of explicit distillation training — the teacher labels production traffic automatically.
|
||
|
||
**Data collection trigger:** When `classify_history_store` accumulates N sessions (suggested: 500), export and retrain. The classifier improves continuously without human labeling.
|
||
|
||
### Caller-declared type (for platform-injected prompts)
|
||
|
||
The platform generates `PLATFORM` prompts — it always knows the type at generation time. Adding a `query_type` field to `AgentRequest` (proto field 7) allows the caller to declare the type explicitly. When set, all three classifier layers are bypassed entirely.
|
||
|
||
This makes RC-01 and RC-02 redundant for platform-generated traffic and eliminates the only remaining case where a generative model is used to classify structured platform data.
|
||
|
||
### Convergence path
|
||
|
||
| Phase | What changes | Expected Layer 3 traffic |
|
||
|---|---|---|
|
||
| Now (bootstrap) | LLM classifier for all unmatched queries | ~95% |
|
||
| Phase 1 | Collect labels via `classify_history_store` | ~95% |
|
||
| Phase 2 | Deploy embedding classifier (Layer 2) | ~10–20% |
|
||
| Phase 3 | Caller-declared type for platform prompts | <5% |
|
||
| Phase 4 | LLM classifier becomes anomaly handler only | <2% |
|
||
|
||
Phase 2 is the highest-leverage step: it replaces the most frequent code path (LLM inference per request) with a CPU-only operation, with no change to the routing contract or the downstream graph.
|