assistance-engine/docs/ADR/ADR-0008-adaptive-query-rou...

250 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-0008: Adaptive Query Routing with Intent History and Model Specialization
**Date:** 2026-04-09
**Status:** Accepted
**Deciders:** Rafael Ruiz (CTO)
**Related ADRs:** ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)
---
## Context
The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types and a single LLM-based classifier that received raw conversation history. Two problems emerged in production:
### Problem 1 — Model oversizing for lightweight queries
Platform queries (account status, usage metrics, subscription data) and conversational follow-ups do not require retrieval or a large model. Running `qwen3:1.7b` for a one-sentence platform insight wastes resources and adds latency.
### Problem 2 — Classifier bias from raw message history
When the classifier received raw conversation messages as history, a small model (1.7B parameters) exhibited **anchoring bias**: it would classify new messages as the same type as recent messages, regardless of the actual content of the new query. This caused platform queries (`"You have a project usage percentage of 20%, provide a recommendation"`) to be misclassified as `RETRIEVAL` or `CODE_GENERATION` during sessions that had previously handled AVAP language questions.
Root cause: passing full message content to a small classifier is too noisy. The model uses conversation topic as a proxy for intent type.
---
## Decision
### 1. New query type: `PLATFORM`
A fourth classification category is introduced alongside `RETRIEVAL`, `CODE_GENERATION`, and `CONVERSATIONAL`:
| Type | Purpose | RAG | Model |
|---|---|---|---|
| `RETRIEVAL` | AVAP language documentation | Yes | `OLLAMA_MODEL_NAME` |
| `CODE_GENERATION` | Produce working AVAP code | Yes | `OLLAMA_MODEL_NAME` |
| `CONVERSATIONAL` | Rephrase / continue prior answer | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` |
| `PLATFORM` | Account, metrics, usage, billing | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` |
`PLATFORM` queries skip RAG entirely and are served with a dedicated `PLATFORM_PROMPT` that instructs the model to use `extra_context` (where user account data is injected) as primary source.
### 2. Model specialization via environment variables
Two model slots are configured independently:
```
OLLAMA_MODEL_NAME=qwen3:1.7b # RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM
```
If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is not set, both slots fall back to `OLLAMA_MODEL_NAME` (backward compatible).
### 3. Intent history instead of raw message history for classification
The classifier no longer receives raw conversation messages. Instead, a compact **intent history** (`classify_history`) is maintained per session:
```
[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"
```
Each entry stores only the `type` and a 60-character topic snippet. This gives the classifier the conversational thread (useful for resolving ambiguous references like "this", "esto", "lo anterior") without the topical noise that causes anchoring bias.
`classify_history` is persisted in `classify_history_store` (parallel to `session_store`) and passed in `AgentState` across turns.
### 4. Classifier prompt redesign
The prompt now includes:
- **`<history_rule>`** — explicit instruction: use history only to resolve ambiguous references, not to predict the category of the new message
- **`<platform_priority_rule>`** — hard override: if the message contains usage percentages, account metrics, quota data, or billing information, classify as `PLATFORM` regardless of history
- **`<step1_purpose>`** replaced by inline role instruction that each message must be evaluated independently
### 5. Fast-path for known platform prefixes
Queries containing `"you are a direct and concise assistant"` (a system-injected prefix used by the platform) are classified as `PLATFORM` deterministically without invoking the LLM classifier. This is justified because this prefix is controlled by the platform itself, not by user input, so deterministic detection is both correct and cheaper.
---
## Routing Contract
This section is normative. Any reimplementation of the classifier or the graph must satisfy all rules below. Rules are ordered by priority — a higher-priority rule always wins.
### RC-01 — Fast-path override (priority: highest)
If the query contains a known platform-injected prefix, the system **MUST** classify it as `PLATFORM` without invoking any LLM.
```
∀ q : query
contains(q, known_platform_prefix) → route(q) = PLATFORM
```
Current registered prefixes (see `_PLATFORM_PATTERNS` in `graph.py`):
- `"you are a direct and concise assistant"`
Adding a new prefix requires a code change to `_PLATFORM_PATTERNS` and a corresponding update to this list.
### RC-02 — Platform data signal (priority: high)
If the query contains any of the following signals, the classifier **MUST** output `PLATFORM` regardless of conversation history:
- Usage percentages (e.g. `"20%"` in the context of project/account usage)
- Account metrics or consumption figures
- Quota, limit, or billing data
This rule is enforced via `<platform_priority_rule>` in the classifier prompt. It cannot be overridden by history.
### RC-03 — Intent history scoping (priority: medium)
The classifier **MUST** use `classify_history` only to resolve ambiguous pronoun or deictic references (`"this"`, `"esto"`, `"lo anterior"`, `"that function"`). It **MUST NOT** use history to predict or bias the type of the current message.
```
classify(q, history) ≠ f(dominant_type(history))
classify(q, history) = f(intent(q), resolve_references(q, history))
```
**Rationale:** Small LLMs implicitly compute `P(type | history)` instead of `P(type | message_content)`. The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 `RETRIEVAL` turns does not make the next message more likely to be `RETRIEVAL`. The `<history_rule>` in the classifier prompt enforces this explicitly.
### RC-04 — RAG bypass (priority: medium)
Query types that bypass Elasticsearch retrieval:
| Type | RAG | Justification |
|---|---|---|
| `RETRIEVAL` | Yes | Requires documentation context |
| `CODE_GENERATION` | Yes | Requires syntax examples |
| `CONVERSATIONAL` | No | Reformulates prior answer already in context |
| `PLATFORM` | No | Data is injected via `extra_context`, not retrieved |
A `PLATFORM` or `CONVERSATIONAL` query that triggers a retrieval step is a contract violation.
### RC-05 — Model assignment (priority: medium)
```
route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME
route(q) ∈ {CONVERSATIONAL, PLATFORM} → model = OLLAMA_MODEL_NAME_CONVERSATIONAL
?? OLLAMA_MODEL_NAME # fallback if unset
```
Changing which types map to which model slot requires updating this contract.
### RC-06 — History growth bound (priority: low)
`classify_history` per session **MUST** be bounded. The classifier reads at most the last 6 entries. The store may grow unbounded in memory but the classifier input is always capped.
### Contract violations to monitor
| Symptom | Violated rule |
|---|---|
| Platform query hits Elasticsearch | RC-04 |
| `qwen3:1.7b` used for a `PLATFORM` response | RC-05 |
| Platform prefix query triggers LLM classifier | RC-01 |
| Classifier output mirrors dominant history type | RC-03 |
---
## Consequences
### Positive
- Platform and conversational queries are served by a smaller, faster model
- Classifier bias from conversation history is eliminated while preserving the ability to resolve ambiguous references
- `PLATFORM` queries never hit Elasticsearch, reducing unnecessary retrieval load
- The system is more predictable: platform-injected prompts are classified in O(1) without an LLM call
### Negative / Trade-offs
- `classify_history` adds a small amount of state per session (bounded to last 6 entries)
- Two model slots mean two warm-up calls at startup if models differ
- The `qwen3:1.7b` classifier can still misclassify edge cases where no platform signals are present in the text — this is inherent to using a 1.7B model for semantic classification
### Open questions
- Whether `PLATFORM` should eventually split into sub-types (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) as the platform data schema grows
---
## Future Path: Discriminative Classifier Pipeline
### The fundamental problem with the current design
The LLM classifier is a **generative model doing discriminative work**. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering as a substitute for proper model design. The rules in RC-01RC-06 exist precisely to compensate for this architectural mismatch.
The current design is correct as a **bootstrap mechanism** — it lets the system operate before labeled training data exists. But it should not be the steady-state architecture.
### Target architecture
A layered pipeline where each layer is only invoked if the previous layer cannot produce a confident answer:
```
Query
[Layer 1] Hard rules (RC-01, RC-02) ← O(1), deterministic
│ no match
[Layer 2] Embedding similarity classifier ← ~1ms, CPU, no LLM
│ confidence < threshold
[Layer 3] LLM classifier (current design) ← fallback only
Classification result
```
In steady state, Layer 3 should handle fewer than 5% of requests — only genuinely ambiguous queries that neither rules nor the trained classifier can resolve with confidence.
### Layer 2: Embedding similarity classifier
`bge-m3` is already running in the stack. The right move is to reuse it as the backbone of a lightweight discriminative classifier rather than adding a second LLM.
**Implementation:**
1. Embed each query using `bge-m3` → fixed-size vector representation
2. Train a logistic regression (or SVM with RBF kernel) over those embeddings on a labeled dataset of `(query, type)` pairs
3. At inference: embed query → dot product against class centroids → argmax with confidence score
4. If `max(softmax(logits)) < threshold` (e.g. 0.85), fall through to Layer 3
This is microseconds of CPU inference, not LLM inference. No GPU, no Ollama call, no prompt templating.
### The data flywheel: `classify_history_store` as training set
Every session already generates labeled examples. `classify_history_store` stores `(topic_snippet, type)` pairs that are implicitly validated by the system — if the user continued the conversation without correcting the assistant, the classification was likely correct.
```
classify_history_store → periodic export → labeled dataset → retrain classifier
```
The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation without the overhead of explicit distillation training — the teacher labels production traffic automatically.
**Data collection trigger:** When `classify_history_store` accumulates N sessions (suggested: 500), export and retrain. The classifier improves continuously without human labeling.
### Caller-declared type (for platform-injected prompts)
The platform generates `PLATFORM` prompts — it always knows the type at generation time. Adding a `query_type` field to `AgentRequest` (proto field 7) allows the caller to declare the type explicitly. When set, all three classifier layers are bypassed entirely.
This makes RC-01 and RC-02 redundant for platform-generated traffic and eliminates the only remaining case where a generative model is used to classify structured platform data.
### Convergence path
| Phase | What changes | Expected Layer 3 traffic |
|---|---|---|
| Now (bootstrap) | LLM classifier for all unmatched queries | ~95% |
| Phase 1 | Collect labels via `classify_history_store` | ~95% |
| Phase 2 | Deploy embedding classifier (Layer 2) | ~1020% |
| Phase 3 | Caller-declared type for platform prompts | <5% |
| Phase 4 | LLM classifier becomes anomaly handler only | <2% |
Phase 2 is the highest-leverage step: it replaces the most frequent code path (LLM inference per request) with a CPU-only operation, with no change to the routing contract or the downstream graph.