assistance-engine/docs/ADR/ADR-0008-adaptive-query-rou...

5.2 KiB

ADR-0008: Adaptive Query Routing with Intent History and Model Specialization

Date: 2026-04-09 Status: Accepted Deciders: Rafael Ruiz (CTO) Related ADRs: ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)


Context

The assistance engine previously used a single Ollama model (qwen3:1.7b) for all query types and a single LLM-based classifier that received raw conversation history. Two problems emerged in production:

Problem 1 — Model oversizing for lightweight queries

Platform queries (account status, usage metrics, subscription data) and conversational follow-ups do not require retrieval or a large model. Running qwen3:1.7b for a one-sentence platform insight wastes resources and adds latency.

Problem 2 — Classifier bias from raw message history

When the classifier received raw conversation messages as history, a small model (1.7B parameters) exhibited anchoring bias: it would classify new messages as the same type as recent messages, regardless of the actual content of the new query. This caused platform queries ("You have a project usage percentage of 20%, provide a recommendation") to be misclassified as RETRIEVAL or CODE_GENERATION during sessions that had previously handled AVAP language questions.

Root cause: passing full message content to a small classifier is too noisy. The model uses conversation topic as a proxy for intent type.


Decision

1. New query type: PLATFORM

A fourth classification category is introduced alongside RETRIEVAL, CODE_GENERATION, and CONVERSATIONAL:

Type Purpose RAG Model
RETRIEVAL AVAP language documentation Yes OLLAMA_MODEL_NAME
CODE_GENERATION Produce working AVAP code Yes OLLAMA_MODEL_NAME
CONVERSATIONAL Rephrase / continue prior answer No OLLAMA_MODEL_NAME_CONVERSATIONAL
PLATFORM Account, metrics, usage, billing No OLLAMA_MODEL_NAME_CONVERSATIONAL

PLATFORM queries skip RAG entirely and are served with a dedicated PLATFORM_PROMPT that instructs the model to use extra_context (where user account data is injected) as primary source.

2. Model specialization via environment variables

Two model slots are configured independently:

OLLAMA_MODEL_NAME=qwen3:1.7b               # RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM

If OLLAMA_MODEL_NAME_CONVERSATIONAL is not set, both slots fall back to OLLAMA_MODEL_NAME (backward compatible).

3. Intent history instead of raw message history for classification

The classifier no longer receives raw conversation messages. Instead, a compact intent history (classify_history) is maintained per session:

[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"

Each entry stores only the type and a 60-character topic snippet. This gives the classifier the conversational thread (useful for resolving ambiguous references like "this", "esto", "lo anterior") without the topical noise that causes anchoring bias.

classify_history is persisted in classify_history_store (parallel to session_store) and passed in AgentState across turns.

4. Classifier prompt redesign

The prompt now includes:

  • <history_rule> — explicit instruction: use history only to resolve ambiguous references, not to predict the category of the new message
  • <platform_priority_rule> — hard override: if the message contains usage percentages, account metrics, quota data, or billing information, classify as PLATFORM regardless of history
  • <step1_purpose> replaced by inline role instruction that each message must be evaluated independently

5. Fast-path for known platform prefixes

Queries containing "you are a direct and concise assistant" (a system-injected prefix used by the platform) are classified as PLATFORM deterministically without invoking the LLM classifier. This is justified because this prefix is controlled by the platform itself, not by user input, so deterministic detection is both correct and cheaper.


Consequences

Positive

  • Platform and conversational queries are served by a smaller, faster model
  • Classifier bias from conversation history is eliminated while preserving the ability to resolve ambiguous references
  • PLATFORM queries never hit Elasticsearch, reducing unnecessary retrieval load
  • The system is more predictable: platform-injected prompts are classified in O(1) without an LLM call

Negative / Trade-offs

  • classify_history adds a small amount of state per session (bounded to last 6 entries)
  • Two model slots mean two warm-up calls at startup if models differ
  • The qwen3:1.7b classifier can still misclassify edge cases where no platform signals are present in the text — this is inherent to using a 1.7B model for semantic classification

Open questions

  • Whether the classifier should be upgraded to a more capable model in the future (at the cost of latency/resources)
  • Whether PLATFORM should eventually split into sub-types (e.g. PLATFORM_METRICS vs PLATFORM_BILLING) as the platform data schema grows