# ADR-0008: Adaptive Query Routing with Intent History and Model Specialization **Date:** 2026-04-09 **Status:** Accepted **Deciders:** Rafael Ruiz (CTO) **Related ADRs:** ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF) --- ## Context The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types and a single LLM-based classifier that received raw conversation history. Two problems emerged in production: ### Problem 1 — Model oversizing for lightweight queries Platform queries (account status, usage metrics, subscription data) and conversational follow-ups do not require retrieval or a large model. Running `qwen3:1.7b` for a one-sentence platform insight wastes resources and adds latency. ### Problem 2 — Classifier bias from raw message history When the classifier received raw conversation messages as history, a small model (1.7B parameters) exhibited **anchoring bias**: it would classify new messages as the same type as recent messages, regardless of the actual content of the new query. This caused platform queries (`"You have a project usage percentage of 20%, provide a recommendation"`) to be misclassified as `RETRIEVAL` or `CODE_GENERATION` during sessions that had previously handled AVAP language questions. Root cause: passing full message content to a small classifier is too noisy. The model uses conversation topic as a proxy for intent type. --- ## Decision ### 1. New query type: `PLATFORM` A fourth classification category is introduced alongside `RETRIEVAL`, `CODE_GENERATION`, and `CONVERSATIONAL`: | Type | Purpose | RAG | Model | |---|---|---|---| | `RETRIEVAL` | AVAP language documentation | Yes | `OLLAMA_MODEL_NAME` | | `CODE_GENERATION` | Produce working AVAP code | Yes | `OLLAMA_MODEL_NAME` | | `CONVERSATIONAL` | Rephrase / continue prior answer | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | | `PLATFORM` | Account, metrics, usage, billing | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | `PLATFORM` queries skip RAG entirely and are served with a dedicated `PLATFORM_PROMPT` that instructs the model to use `extra_context` (where user account data is injected) as primary source. ### 2. Model specialization via environment variables Two model slots are configured independently: ``` OLLAMA_MODEL_NAME=qwen3:1.7b # RETRIEVAL + CODE_GENERATION OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM ``` If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is not set, both slots fall back to `OLLAMA_MODEL_NAME` (backward compatible). ### 3. Intent history instead of raw message history for classification The classifier no longer receives raw conversation messages. Instead, a compact **intent history** (`classify_history`) is maintained per session: ``` [RETRIEVAL] "What is addVar in AVAP?" [CODE_GENERATION] "Write an API endpoint that retur" [PLATFORM] "You have a project usage percentag" ``` Each entry stores only the `type` and a 60-character topic snippet. This gives the classifier the conversational thread (useful for resolving ambiguous references like "this", "esto", "lo anterior") without the topical noise that causes anchoring bias. `classify_history` is persisted in `classify_history_store` (parallel to `session_store`) and passed in `AgentState` across turns. ### 4. Classifier prompt redesign The prompt now includes: - **``** — explicit instruction: use history only to resolve ambiguous references, not to predict the category of the new message - **``** — hard override: if the message contains usage percentages, account metrics, quota data, or billing information, classify as `PLATFORM` regardless of history - **``** replaced by inline role instruction that each message must be evaluated independently ### 5. Fast-path for known platform prefixes Queries containing `"you are a direct and concise assistant"` (a system-injected prefix used by the platform) are classified as `PLATFORM` deterministically without invoking the LLM classifier. This is justified because this prefix is controlled by the platform itself, not by user input, so deterministic detection is both correct and cheaper. --- ## Routing Contract This section is normative. Any reimplementation of the classifier or the graph must satisfy all rules below. Rules are ordered by priority — a higher-priority rule always wins. ### RC-01 — Fast-path override (priority: highest) If the query contains a known platform-injected prefix, the system **MUST** classify it as `PLATFORM` without invoking any LLM. ``` ∀ q : query contains(q, known_platform_prefix) → route(q) = PLATFORM ``` Current registered prefixes (see `_PLATFORM_PATTERNS` in `graph.py`): - `"you are a direct and concise assistant"` Adding a new prefix requires a code change to `_PLATFORM_PATTERNS` and a corresponding update to this list. ### RC-02 — Platform data signal (priority: high) If the query contains any of the following signals, the classifier **MUST** output `PLATFORM` regardless of conversation history: - Usage percentages (e.g. `"20%"` in the context of project/account usage) - Account metrics or consumption figures - Quota, limit, or billing data This rule is enforced via `` in the classifier prompt. It cannot be overridden by history. ### RC-03 — Intent history scoping (priority: medium) The classifier **MUST** use `classify_history` only to resolve ambiguous pronoun or deictic references (`"this"`, `"esto"`, `"lo anterior"`, `"that function"`). It **MUST NOT** use history to predict or bias the type of the current message. ``` classify(q, history) ≠ f(dominant_type(history)) classify(q, history) = f(intent(q), resolve_references(q, history)) ``` ### RC-04 — RAG bypass (priority: medium) Query types that bypass Elasticsearch retrieval: | Type | RAG | Justification | |---|---|---| | `RETRIEVAL` | Yes | Requires documentation context | | `CODE_GENERATION` | Yes | Requires syntax examples | | `CONVERSATIONAL` | No | Reformulates prior answer already in context | | `PLATFORM` | No | Data is injected via `extra_context`, not retrieved | A `PLATFORM` or `CONVERSATIONAL` query that triggers a retrieval step is a contract violation. ### RC-05 — Model assignment (priority: medium) ``` route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME route(q) ∈ {CONVERSATIONAL, PLATFORM} → model = OLLAMA_MODEL_NAME_CONVERSATIONAL ?? OLLAMA_MODEL_NAME # fallback if unset ``` Changing which types map to which model slot requires updating this contract. ### RC-06 — History growth bound (priority: low) `classify_history` per session **MUST** be bounded. The classifier reads at most the last 6 entries. The store may grow unbounded in memory but the classifier input is always capped. ### Contract violations to monitor | Symptom | Violated rule | |---|---| | Platform query hits Elasticsearch | RC-04 | | `qwen3:1.7b` used for a `PLATFORM` response | RC-05 | | Platform prefix query triggers LLM classifier | RC-01 | | Classifier output mirrors dominant history type | RC-03 | --- ## Consequences ### Positive - Platform and conversational queries are served by a smaller, faster model - Classifier bias from conversation history is eliminated while preserving the ability to resolve ambiguous references - `PLATFORM` queries never hit Elasticsearch, reducing unnecessary retrieval load - The system is more predictable: platform-injected prompts are classified in O(1) without an LLM call ### Negative / Trade-offs - `classify_history` adds a small amount of state per session (bounded to last 6 entries) - Two model slots mean two warm-up calls at startup if models differ - The `qwen3:1.7b` classifier can still misclassify edge cases where no platform signals are present in the text — this is inherent to using a 1.7B model for semantic classification ### Open questions - Whether the classifier should be upgraded to a more capable model in the future (at the cost of latency/resources) - Whether `PLATFORM` should eventually split into sub-types (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) as the platform data schema grows