99 lines
5.2 KiB
Markdown
99 lines
5.2 KiB
Markdown
# ADR-0008: Adaptive Query Routing with Intent History and Model Specialization
|
|
|
|
**Date:** 2026-04-09
|
|
**Status:** Accepted
|
|
**Deciders:** Rafael Ruiz (CTO)
|
|
**Related ADRs:** ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types and a single LLM-based classifier that received raw conversation history. Two problems emerged in production:
|
|
|
|
### Problem 1 — Model oversizing for lightweight queries
|
|
|
|
Platform queries (account status, usage metrics, subscription data) and conversational follow-ups do not require retrieval or a large model. Running `qwen3:1.7b` for a one-sentence platform insight wastes resources and adds latency.
|
|
|
|
### Problem 2 — Classifier bias from raw message history
|
|
|
|
When the classifier received raw conversation messages as history, a small model (1.7B parameters) exhibited **anchoring bias**: it would classify new messages as the same type as recent messages, regardless of the actual content of the new query. This caused platform queries (`"You have a project usage percentage of 20%, provide a recommendation"`) to be misclassified as `RETRIEVAL` or `CODE_GENERATION` during sessions that had previously handled AVAP language questions.
|
|
|
|
Root cause: passing full message content to a small classifier is too noisy. The model uses conversation topic as a proxy for intent type.
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
### 1. New query type: `PLATFORM`
|
|
|
|
A fourth classification category is introduced alongside `RETRIEVAL`, `CODE_GENERATION`, and `CONVERSATIONAL`:
|
|
|
|
| Type | Purpose | RAG | Model |
|
|
|---|---|---|---|
|
|
| `RETRIEVAL` | AVAP language documentation | Yes | `OLLAMA_MODEL_NAME` |
|
|
| `CODE_GENERATION` | Produce working AVAP code | Yes | `OLLAMA_MODEL_NAME` |
|
|
| `CONVERSATIONAL` | Rephrase / continue prior answer | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` |
|
|
| `PLATFORM` | Account, metrics, usage, billing | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` |
|
|
|
|
`PLATFORM` queries skip RAG entirely and are served with a dedicated `PLATFORM_PROMPT` that instructs the model to use `extra_context` (where user account data is injected) as primary source.
|
|
|
|
### 2. Model specialization via environment variables
|
|
|
|
Two model slots are configured independently:
|
|
|
|
```
|
|
OLLAMA_MODEL_NAME=qwen3:1.7b # RETRIEVAL + CODE_GENERATION
|
|
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM
|
|
```
|
|
|
|
If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is not set, both slots fall back to `OLLAMA_MODEL_NAME` (backward compatible).
|
|
|
|
### 3. Intent history instead of raw message history for classification
|
|
|
|
The classifier no longer receives raw conversation messages. Instead, a compact **intent history** (`classify_history`) is maintained per session:
|
|
|
|
```
|
|
[RETRIEVAL] "What is addVar in AVAP?"
|
|
[CODE_GENERATION] "Write an API endpoint that retur"
|
|
[PLATFORM] "You have a project usage percentag"
|
|
```
|
|
|
|
Each entry stores only the `type` and a 60-character topic snippet. This gives the classifier the conversational thread (useful for resolving ambiguous references like "this", "esto", "lo anterior") without the topical noise that causes anchoring bias.
|
|
|
|
`classify_history` is persisted in `classify_history_store` (parallel to `session_store`) and passed in `AgentState` across turns.
|
|
|
|
### 4. Classifier prompt redesign
|
|
|
|
The prompt now includes:
|
|
|
|
- **`<history_rule>`** — explicit instruction: use history only to resolve ambiguous references, not to predict the category of the new message
|
|
- **`<platform_priority_rule>`** — hard override: if the message contains usage percentages, account metrics, quota data, or billing information, classify as `PLATFORM` regardless of history
|
|
- **`<step1_purpose>`** replaced by inline role instruction that each message must be evaluated independently
|
|
|
|
### 5. Fast-path for known platform prefixes
|
|
|
|
Queries containing `"you are a direct and concise assistant"` (a system-injected prefix used by the platform) are classified as `PLATFORM` deterministically without invoking the LLM classifier. This is justified because this prefix is controlled by the platform itself, not by user input, so deterministic detection is both correct and cheaper.
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- Platform and conversational queries are served by a smaller, faster model
|
|
- Classifier bias from conversation history is eliminated while preserving the ability to resolve ambiguous references
|
|
- `PLATFORM` queries never hit Elasticsearch, reducing unnecessary retrieval load
|
|
- The system is more predictable: platform-injected prompts are classified in O(1) without an LLM call
|
|
|
|
### Negative / Trade-offs
|
|
|
|
- `classify_history` adds a small amount of state per session (bounded to last 6 entries)
|
|
- Two model slots mean two warm-up calls at startup if models differ
|
|
- The `qwen3:1.7b` classifier can still misclassify edge cases where no platform signals are present in the text — this is inherent to using a 1.7B model for semantic classification
|
|
|
|
### Open questions
|
|
|
|
- Whether the classifier should be upgraded to a more capable model in the future (at the cost of latency/resources)
|
|
- Whether `PLATFORM` should eventually split into sub-types (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) as the platform data schema grows
|