assistance-engine/docs/ADR/ADR-0008-adaptive-query-rou...

8.5 KiB

ADR-0008: Adaptive Query Routing with Intent History and Model Specialization

Date: 2026-04-09 Status: Accepted Deciders: Rafael Ruiz (CTO) Related ADRs: ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)


Context

The assistance engine previously used a single Ollama model (qwen3:1.7b) for all query types and a single LLM-based classifier that received raw conversation history. Two problems emerged in production:

Problem 1 — Model oversizing for lightweight queries

Platform queries (account status, usage metrics, subscription data) and conversational follow-ups do not require retrieval or a large model. Running qwen3:1.7b for a one-sentence platform insight wastes resources and adds latency.

Problem 2 — Classifier bias from raw message history

When the classifier received raw conversation messages as history, a small model (1.7B parameters) exhibited anchoring bias: it would classify new messages as the same type as recent messages, regardless of the actual content of the new query. This caused platform queries ("You have a project usage percentage of 20%, provide a recommendation") to be misclassified as RETRIEVAL or CODE_GENERATION during sessions that had previously handled AVAP language questions.

Root cause: passing full message content to a small classifier is too noisy. The model uses conversation topic as a proxy for intent type.


Decision

1. New query type: PLATFORM

A fourth classification category is introduced alongside RETRIEVAL, CODE_GENERATION, and CONVERSATIONAL:

Type Purpose RAG Model
RETRIEVAL AVAP language documentation Yes OLLAMA_MODEL_NAME
CODE_GENERATION Produce working AVAP code Yes OLLAMA_MODEL_NAME
CONVERSATIONAL Rephrase / continue prior answer No OLLAMA_MODEL_NAME_CONVERSATIONAL
PLATFORM Account, metrics, usage, billing No OLLAMA_MODEL_NAME_CONVERSATIONAL

PLATFORM queries skip RAG entirely and are served with a dedicated PLATFORM_PROMPT that instructs the model to use extra_context (where user account data is injected) as primary source.

2. Model specialization via environment variables

Two model slots are configured independently:

OLLAMA_MODEL_NAME=qwen3:1.7b               # RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM

If OLLAMA_MODEL_NAME_CONVERSATIONAL is not set, both slots fall back to OLLAMA_MODEL_NAME (backward compatible).

3. Intent history instead of raw message history for classification

The classifier no longer receives raw conversation messages. Instead, a compact intent history (classify_history) is maintained per session:

[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"

Each entry stores only the type and a 60-character topic snippet. This gives the classifier the conversational thread (useful for resolving ambiguous references like "this", "esto", "lo anterior") without the topical noise that causes anchoring bias.

classify_history is persisted in classify_history_store (parallel to session_store) and passed in AgentState across turns.

4. Classifier prompt redesign

The prompt now includes:

  • <history_rule> — explicit instruction: use history only to resolve ambiguous references, not to predict the category of the new message
  • <platform_priority_rule> — hard override: if the message contains usage percentages, account metrics, quota data, or billing information, classify as PLATFORM regardless of history
  • <step1_purpose> replaced by inline role instruction that each message must be evaluated independently

5. Fast-path for known platform prefixes

Queries containing "you are a direct and concise assistant" (a system-injected prefix used by the platform) are classified as PLATFORM deterministically without invoking the LLM classifier. This is justified because this prefix is controlled by the platform itself, not by user input, so deterministic detection is both correct and cheaper.


Routing Contract

This section is normative. Any reimplementation of the classifier or the graph must satisfy all rules below. Rules are ordered by priority — a higher-priority rule always wins.

RC-01 — Fast-path override (priority: highest)

If the query contains a known platform-injected prefix, the system MUST classify it as PLATFORM without invoking any LLM.

∀ q : query
  contains(q, known_platform_prefix) → route(q) = PLATFORM

Current registered prefixes (see _PLATFORM_PATTERNS in graph.py):

  • "you are a direct and concise assistant"

Adding a new prefix requires a code change to _PLATFORM_PATTERNS and a corresponding update to this list.

RC-02 — Platform data signal (priority: high)

If the query contains any of the following signals, the classifier MUST output PLATFORM regardless of conversation history:

  • Usage percentages (e.g. "20%" in the context of project/account usage)
  • Account metrics or consumption figures
  • Quota, limit, or billing data

This rule is enforced via <platform_priority_rule> in the classifier prompt. It cannot be overridden by history.

RC-03 — Intent history scoping (priority: medium)

The classifier MUST use classify_history only to resolve ambiguous pronoun or deictic references ("this", "esto", "lo anterior", "that function"). It MUST NOT use history to predict or bias the type of the current message.

classify(q, history) ≠ f(dominant_type(history))
classify(q, history) = f(intent(q), resolve_references(q, history))

Rationale: Small LLMs implicitly compute P(type | history) instead of P(type | message_content). The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 RETRIEVAL turns does not make the next message more likely to be RETRIEVAL. The <history_rule> in the classifier prompt enforces this explicitly.

RC-04 — RAG bypass (priority: medium)

Query types that bypass Elasticsearch retrieval:

Type RAG Justification
RETRIEVAL Yes Requires documentation context
CODE_GENERATION Yes Requires syntax examples
CONVERSATIONAL No Reformulates prior answer already in context
PLATFORM No Data is injected via extra_context, not retrieved

A PLATFORM or CONVERSATIONAL query that triggers a retrieval step is a contract violation.

RC-05 — Model assignment (priority: medium)

route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME
route(q) ∈ {CONVERSATIONAL, PLATFORM}   → model = OLLAMA_MODEL_NAME_CONVERSATIONAL
                                                    ?? OLLAMA_MODEL_NAME  # fallback if unset

Changing which types map to which model slot requires updating this contract.

RC-06 — History growth bound (priority: low)

classify_history per session MUST be bounded. The classifier reads at most the last 6 entries. The store may grow unbounded in memory but the classifier input is always capped.

Contract violations to monitor

Symptom Violated rule
Platform query hits Elasticsearch RC-04
qwen3:1.7b used for a PLATFORM response RC-05
Platform prefix query triggers LLM classifier RC-01
Classifier output mirrors dominant history type RC-03

Consequences

Positive

  • Platform and conversational queries are served by a smaller, faster model
  • Classifier bias from conversation history is eliminated while preserving the ability to resolve ambiguous references
  • PLATFORM queries never hit Elasticsearch, reducing unnecessary retrieval load
  • The system is more predictable: platform-injected prompts are classified in O(1) without an LLM call

Negative / Trade-offs

  • classify_history adds a small amount of state per session (bounded to last 6 entries)
  • Two model slots mean two warm-up calls at startup if models differ
  • The qwen3:1.7b classifier can still misclassify edge cases where no platform signals are present in the text — this is inherent to using a 1.7B model for semantic classification

Open questions

  • Whether the classifier should be upgraded to a more capable model in the future (at the cost of latency/resources)
  • Whether PLATFORM should eventually split into sub-types (e.g. PLATFORM_METRICS vs PLATFORM_BILLING) as the platform data schema grows