8.5 KiB

Raw Blame History

ADR-0008: Adaptive Query Routing with Intent History and Model Specialization

Date: 2026-04-09 Status: Accepted Deciders: Rafael Ruiz (CTO) Related ADRs: ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)

Context

The assistance engine previously used a single Ollama model (qwen3:1.7b) for all query types and a single LLM-based classifier that received raw conversation history. Two problems emerged in production:

Problem 1 — Model oversizing for lightweight queries

Platform queries (account status, usage metrics, subscription data) and conversational follow-ups do not require retrieval or a large model. Running qwen3:1.7b for a one-sentence platform insight wastes resources and adds latency.

Problem 2 — Classifier bias from raw message history

When the classifier received raw conversation messages as history, a small model (1.7B parameters) exhibited anchoring bias: it would classify new messages as the same type as recent messages, regardless of the actual content of the new query. This caused platform queries ("You have a project usage percentage of 20%, provide a recommendation") to be misclassified as RETRIEVAL or CODE_GENERATION during sessions that had previously handled AVAP language questions.

Root cause: passing full message content to a small classifier is too noisy. The model uses conversation topic as a proxy for intent type.

Decision

1. New query type: `PLATFORM`

A fourth classification category is introduced alongside RETRIEVAL, CODE_GENERATION, and CONVERSATIONAL:

Type	Purpose	RAG	Model
`RETRIEVAL`	AVAP language documentation	Yes	`OLLAMA_MODEL_NAME`
`CODE_GENERATION`	Produce working AVAP code	Yes	`OLLAMA_MODEL_NAME`
`CONVERSATIONAL`	Rephrase / continue prior answer	No	`OLLAMA_MODEL_NAME_CONVERSATIONAL`
`PLATFORM`	Account, metrics, usage, billing	No	`OLLAMA_MODEL_NAME_CONVERSATIONAL`

PLATFORM queries skip RAG entirely and are served with a dedicated PLATFORM_PROMPT that instructs the model to use extra_context (where user account data is injected) as primary source.

2. Model specialization via environment variables

Two model slots are configured independently:

OLLAMA_MODEL_NAME=qwen3:1.7b               # RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM

If OLLAMA_MODEL_NAME_CONVERSATIONAL is not set, both slots fall back to OLLAMA_MODEL_NAME (backward compatible).

3. Intent history instead of raw message history for classification

The classifier no longer receives raw conversation messages. Instead, a compact intent history (classify_history) is maintained per session:

[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"

Each entry stores only the type and a 60-character topic snippet. This gives the classifier the conversational thread (useful for resolving ambiguous references like "this", "esto", "lo anterior") without the topical noise that causes anchoring bias.

classify_history is persisted in classify_history_store (parallel to session_store) and passed in AgentState across turns.

4. Classifier prompt redesign

The prompt now includes:

<history_rule> — explicit instruction: use history only to resolve ambiguous references, not to predict the category of the new message
<platform_priority_rule> — hard override: if the message contains usage percentages, account metrics, quota data, or billing information, classify as PLATFORM regardless of history
<step1_purpose> replaced by inline role instruction that each message must be evaluated independently

5. Fast-path for known platform prefixes

Queries containing "you are a direct and concise assistant" (a system-injected prefix used by the platform) are classified as PLATFORM deterministically without invoking the LLM classifier. This is justified because this prefix is controlled by the platform itself, not by user input, so deterministic detection is both correct and cheaper.

Routing Contract

This section is normative. Any reimplementation of the classifier or the graph must satisfy all rules below. Rules are ordered by priority — a higher-priority rule always wins.

RC-01 — Fast-path override (priority: highest)

If the query contains a known platform-injected prefix, the system MUST classify it as PLATFORM without invoking any LLM.

∀ q : query
  contains(q, known_platform_prefix) → route(q) = PLATFORM

Current registered prefixes (see _PLATFORM_PATTERNS in graph.py):

"you are a direct and concise assistant"

Adding a new prefix requires a code change to _PLATFORM_PATTERNS and a corresponding update to this list.

RC-02 — Platform data signal (priority: high)

If the query contains any of the following signals, the classifier MUST output PLATFORM regardless of conversation history:

Usage percentages (e.g. "20%" in the context of project/account usage)
Account metrics or consumption figures
Quota, limit, or billing data

This rule is enforced via <platform_priority_rule> in the classifier prompt. It cannot be overridden by history.

RC-03 — Intent history scoping (priority: medium)

The classifier MUST use classify_history only to resolve ambiguous pronoun or deictic references ("this", "esto", "lo anterior", "that function"). It MUST NOT use history to predict or bias the type of the current message.

classify(q, history) ≠ f(dominant_type(history))
classify(q, history) = f(intent(q), resolve_references(q, history))

Rationale: Small LLMs implicitly compute P(type | history) instead of P(type | message_content). The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 RETRIEVAL turns does not make the next message more likely to be RETRIEVAL. The <history_rule> in the classifier prompt enforces this explicitly.

RC-04 — RAG bypass (priority: medium)

Query types that bypass Elasticsearch retrieval:

Type	RAG	Justification
`RETRIEVAL`	Yes	Requires documentation context
`CODE_GENERATION`	Yes	Requires syntax examples
`CONVERSATIONAL`	No	Reformulates prior answer already in context
`PLATFORM`	No	Data is injected via `extra_context`, not retrieved

A PLATFORM or CONVERSATIONAL query that triggers a retrieval step is a contract violation.

RC-05 — Model assignment (priority: medium)

route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME
route(q) ∈ {CONVERSATIONAL, PLATFORM}   → model = OLLAMA_MODEL_NAME_CONVERSATIONAL
                                                    ?? OLLAMA_MODEL_NAME  # fallback if unset

Changing which types map to which model slot requires updating this contract.

RC-06 — History growth bound (priority: low)

classify_history per session MUST be bounded. The classifier reads at most the last 6 entries. The store may grow unbounded in memory but the classifier input is always capped.

Contract violations to monitor

Symptom	Violated rule
Platform query hits Elasticsearch	RC-04
`qwen3:1.7b` used for a `PLATFORM` response	RC-05
Platform prefix query triggers LLM classifier	RC-01
Classifier output mirrors dominant history type	RC-03

Consequences

Positive

Platform and conversational queries are served by a smaller, faster model
Classifier bias from conversation history is eliminated while preserving the ability to resolve ambiguous references
PLATFORM queries never hit Elasticsearch, reducing unnecessary retrieval load
The system is more predictable: platform-injected prompts are classified in O(1) without an LLM call

Negative / Trade-offs

classify_history adds a small amount of state per session (bounded to last 6 entries)
Two model slots mean two warm-up calls at startup if models differ
The qwen3:1.7b classifier can still misclassify edge cases where no platform signals are present in the text — this is inherent to using a 1.7B model for semantic classification

Open questions

Whether the classifier should be upgraded to a more capable model in the future (at the cost of latency/resources)
Whether PLATFORM should eventually split into sub-types (e.g. PLATFORM_METRICS vs PLATFORM_BILLING) as the platform data schema grows

8.5 KiB Raw Blame History