10 KiB
ADR-0008: Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy
Date: 2026-04-09 Status: Accepted Deciders: Rafael Ruiz (CTO) Related ADRs: ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)
Context
The assistance engine previously used a single Ollama model (qwen3:1.7b) for all query types with no differentiation in routing, retrieval, or model selection. Two problems emerged in production:
Problem 1 — No query taxonomy
All queries were treated identically. Platform queries (account status, usage metrics, billing) were sent through the same RAG pipeline as AVAP language questions, wasting retrieval resources and producing irrelevant context.
Problem 2 — Classifier anchoring bias
The LLM-based classifier received raw conversation messages as history. A 1.7B model exhibited anchoring bias: it computed P(type | history) instead of P(type | message_content), misclassifying new queries as the same type as recent turns regardless of actual content.
Decision
This ADR makes three decisions with different time horizons:
- Permanent — query taxonomy and routing contract
- Permanent — model assignment per type
- Tactical / bootstrap — LLM classifier as interim implementation
Decision 1 — Query taxonomy (permanent)
Four query types with fixed routing semantics:
| Type | Purpose | RAG | Model slot |
|---|---|---|---|
RETRIEVAL |
AVAP language documentation and concepts | Yes | main |
CODE_GENERATION |
Produce working AVAP code | Yes | main |
CONVERSATIONAL |
Rephrase or continue prior answer | No | conversational |
PLATFORM |
Account, metrics, usage, quota, billing | No | conversational |
These types and their RAG/model assignments are stable. Any future classifier implementation must preserve this taxonomy.
Decision 2 — Model specialization (permanent)
Two model slots configured via environment variables:
OLLAMA_MODEL_NAME=qwen3:1.7b # main slot: RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # conversational slot: CONVERSATIONAL + PLATFORM
If OLLAMA_MODEL_NAME_CONVERSATIONAL is unset, both slots fall back to OLLAMA_MODEL_NAME.
Decision 3 — LLM classifier as bootstrap [TACTICAL DEBT]
This is an acknowledged interim implementation, not the target architecture. See Future Path for the correct steady-state design.
A generative LLM is used for classification because no labeled training data exists yet. The design includes two mitigations for its known weaknesses:
a) Compact intent history instead of raw messages
classify_history replaces raw message history in the classifier context. Each entry stores only type + 60-char topic snippet:
[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"
This preserves reference resolution ("this", "esto", "lo anterior") without the topical noise that causes anchoring. classify_history is persisted in classify_history_store per session.
b) Prompt constraints to counteract generative bias
<history_rule>— explicit instruction that intent distribution of prior turns must not influence prior probability of current classification<platform_priority_rule>— hard semantic override: usage percentages, account metrics, quota or billing data → alwaysPLATFORM
These prompt rules are compensations for the architectural mismatch between a generative model and a discriminative task. They become unnecessary once the LLM classifier is replaced.
Routing Contract
This section is normative and implementation-independent. Any reimplementation — including the discriminative classifier described in Future Path — must satisfy all rules below. Rules are ordered by priority.
RC-01 — Fast-path override (priority: highest)
If the query contains a known platform-injected prefix, classify as PLATFORM without invoking any classifier.
∀ q : query
contains(q, known_platform_prefix) → route(q) = PLATFORM
Current registered prefixes (_PLATFORM_PATTERNS in graph.py):
"you are a direct and concise assistant"
Adding a prefix requires updating _PLATFORM_PATTERNS and this list.
RC-02 — Platform data signal (priority: high)
If the query contains usage percentages, account metrics, consumption figures, quota data, or billing information, the output MUST be PLATFORM regardless of history or classifier confidence.
In the current bootstrap implementation this is enforced via <platform_priority_rule>. In the future discriminative classifier it should be a hard pre-filter in Layer 1.
RC-03 — Intent history scoping (priority: medium)
The classifier MUST use classify_history only to resolve ambiguous deictic references. It MUST NOT use history to predict or bias the type of the current message.
classify(q, history) ≠ f(dominant_type(history))
classify(q, history) = f(intent(q), resolve_references(q, history))
Rationale: Small LLMs implicitly compute P(type | history) instead of P(type | message_content). The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 RETRIEVAL turns does not make the next message more likely to be RETRIEVAL.
RC-04 — RAG bypass (priority: medium)
| Type | RAG | Justification |
|---|---|---|
RETRIEVAL |
Yes | Requires documentation context |
CODE_GENERATION |
Yes | Requires syntax examples |
CONVERSATIONAL |
No | Prior answer already in context |
PLATFORM |
No | Data injected via extra_context |
A PLATFORM or CONVERSATIONAL query that triggers Elasticsearch retrieval is a contract violation.
RC-05 — Model assignment (priority: medium)
route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME
route(q) ∈ {CONVERSATIONAL, PLATFORM} → model = OLLAMA_MODEL_NAME_CONVERSATIONAL
?? OLLAMA_MODEL_NAME # fallback
RC-06 — History growth bound (priority: low)
classify_history input to the classifier MUST be capped at 6 entries per session.
Contract violations to monitor
| Symptom | Violated rule |
|---|---|
| Platform query hits Elasticsearch | RC-04 |
qwen3:1.7b used for a PLATFORM response |
RC-05 |
| Platform prefix triggers LLM classifier | RC-01 |
| Classifier output mirrors dominant history type | RC-03 |
Consequences
Positive
- Query taxonomy is formalized and stable — downstream graph, model assignment, and RAG decisions are decoupled from classifier implementation
classify_history_storeacts as a data flywheel for future classifier training- Platform-injected prompts classified in O(1) via RC-01
PLATFORMqueries never hit Elasticsearch
Negative / Trade-offs
- The LLM classifier is a generative model doing discriminative work — this is the accepted tactical debt
- Prompt engineering (
<history_rule>,<platform_priority_rule>) is a symptom of this mismatch, not a solution qwen3:1.7bcan still misclassify edge cases without platform signals — inherent to the bootstrap design
Future Path: Discriminative Classifier Pipeline
The fundamental problem with the bootstrap design
The LLM classifier is a generative model doing discriminative work. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering to compensate for what should be model properties. RC-01 through RC-06 exist precisely because of this mismatch.
The bootstrap design is justified while no labeled data exists. It should not be the steady-state architecture.
Target architecture
A layered pipeline where each layer is only invoked if the previous layer cannot produce a confident answer:
Query
│
▼
[Layer 1] Hard rules (RC-01, RC-02) ← O(1), deterministic
│ no match
▼
[Layer 2] Embedding similarity classifier ← ~1ms, CPU, no LLM
│ confidence < threshold
▼
[Layer 3] LLM classifier (current design) ← fallback for ambiguous queries only
│
▼
Classification result
In steady state, Layer 3 handles fewer than 5% of requests.
Layer 2: embedding classifier on bge-m3
bge-m3 is already running in the stack. The implementation:
- Embed each query via
bge-m3→ fixed-size vector - Train logistic regression (or SVM with RBF kernel) on labeled
(query, type)pairs - At inference: embed → class centroids → argmax with confidence score
- If
max(softmax(logits)) < 0.85→ fall through to Layer 3
This is microseconds of CPU inference. No GPU, no Ollama call, no prompt templating. RC-02 becomes a hard pre-filter in Layer 1, making it implementation-independent rather than prompt-dependent.
The data flywheel
classify_history_store already generates labeled training data. Every session produces (topic_snippet, type) pairs implicitly validated by user continuation.
classify_history_store → periodic export → labeled dataset → retrain Layer 2
The LLM classifier is the teacher. The embedding classifier is the student. This is knowledge distillation over production traffic without manual labeling.
Trigger: retrain when classify_history_store accumulates 500 sessions.
Caller-declared type
The platform generates PLATFORM prompts and knows the type at generation time. Adding query_type to AgentRequest (proto field 7) lets the caller declare the type explicitly, bypassing all three layers. This makes RC-01 and RC-02 redundant for platform-generated traffic.
Convergence path
| Phase | What changes | Layer 3 traffic |
|---|---|---|
| Now — bootstrap | LLM classifier for all unmatched queries | ~95% |
| Phase 1 | Collect labels via classify_history_store |
~95% |
| Phase 2 | Deploy embedding classifier (Layer 2) | ~10–20% |
| Phase 3 | Caller-declared type for platform prompts | <5% |
| Phase 4 | LLM classifier as anomaly handler only | <2% |
Phase 2 is the highest-leverage step: it replaces the dominant code path (LLM inference per request) with CPU-only inference, with no change to the routing contract or the downstream graph.