10 KiB

Raw Blame History

ADR-0008: Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy

Date: 2026-04-09 Status: Accepted Deciders: Rafael Ruiz (CTO) Related ADRs: ADR-0002 (Two-Phase Streaming), ADR-0003 (Hybrid Retrieval RRF)

Context

The assistance engine previously used a single Ollama model (qwen3:1.7b) for all query types with no differentiation in routing, retrieval, or model selection. Two problems emerged in production:

Problem 1 — No query taxonomy

All queries were treated identically. Platform queries (account status, usage metrics, billing) were sent through the same RAG pipeline as AVAP language questions, wasting retrieval resources and producing irrelevant context.

Problem 2 — Classifier anchoring bias

The LLM-based classifier received raw conversation messages as history. A 1.7B model exhibited anchoring bias: it computed P(type | history) instead of P(type | message_content), misclassifying new queries as the same type as recent turns regardless of actual content.

Decision

This ADR makes three decisions with different time horizons:

Permanent — query taxonomy and routing contract
Permanent — model assignment per type
Tactical / bootstrap — LLM classifier as interim implementation

Decision 1 — Query taxonomy (permanent)

Four query types with fixed routing semantics:

Type	Purpose	RAG	Model slot
`RETRIEVAL`	AVAP language documentation and concepts	Yes	`main`
`CODE_GENERATION`	Produce working AVAP code	Yes	`main`
`CONVERSATIONAL`	Rephrase or continue prior answer	No	`conversational`
`PLATFORM`	Account, metrics, usage, quota, billing	No	`conversational`

These types and their RAG/model assignments are stable. Any future classifier implementation must preserve this taxonomy.

Decision 2 — Model specialization (permanent)

Two model slots configured via environment variables:

OLLAMA_MODEL_NAME=qwen3:1.7b                # main slot: RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # conversational slot: CONVERSATIONAL + PLATFORM

If OLLAMA_MODEL_NAME_CONVERSATIONAL is unset, both slots fall back to OLLAMA_MODEL_NAME.

Decision 3 — LLM classifier as bootstrap `[TACTICAL DEBT]`

This is an acknowledged interim implementation, not the target architecture. See Future Path for the correct steady-state design.

A generative LLM is used for classification because no labeled training data exists yet. The design includes two mitigations for its known weaknesses:

a) Compact intent history instead of raw messages

classify_history replaces raw message history in the classifier context. Each entry stores only type + 60-char topic snippet:

[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"

This preserves reference resolution ("this", "esto", "lo anterior") without the topical noise that causes anchoring. classify_history is persisted in classify_history_store per session.

b) Prompt constraints to counteract generative bias

<history_rule> — explicit instruction that intent distribution of prior turns must not influence prior probability of current classification
<platform_priority_rule> — hard semantic override: usage percentages, account metrics, quota or billing data → always PLATFORM

These prompt rules are compensations for the architectural mismatch between a generative model and a discriminative task. They become unnecessary once the LLM classifier is replaced.

Routing Contract

This section is normative and implementation-independent. Any reimplementation — including the discriminative classifier described in Future Path — must satisfy all rules below. Rules are ordered by priority.

RC-01 — Fast-path override (priority: highest)

If the query contains a known platform-injected prefix, classify as PLATFORM without invoking any classifier.

∀ q : query
  contains(q, known_platform_prefix) → route(q) = PLATFORM

Current registered prefixes (_PLATFORM_PATTERNS in graph.py):

"you are a direct and concise assistant"

Adding a prefix requires updating _PLATFORM_PATTERNS and this list.

RC-02 — Platform data signal (priority: high)

If the query contains usage percentages, account metrics, consumption figures, quota data, or billing information, the output MUST be PLATFORM regardless of history or classifier confidence.

In the current bootstrap implementation this is enforced via <platform_priority_rule>. In the future discriminative classifier it should be a hard pre-filter in Layer 1.

RC-03 — Intent history scoping (priority: medium)

The classifier MUST use classify_history only to resolve ambiguous deictic references. It MUST NOT use history to predict or bias the type of the current message.

classify(q, history) ≠ f(dominant_type(history))
classify(q, history) = f(intent(q), resolve_references(q, history))

Rationale: Small LLMs implicitly compute P(type | history) instead of P(type | message_content). The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 RETRIEVAL turns does not make the next message more likely to be RETRIEVAL.

RC-04 — RAG bypass (priority: medium)

Type	RAG	Justification
`RETRIEVAL`	Yes	Requires documentation context
`CODE_GENERATION`	Yes	Requires syntax examples
`CONVERSATIONAL`	No	Prior answer already in context
`PLATFORM`	No	Data injected via `extra_context`

A PLATFORM or CONVERSATIONAL query that triggers Elasticsearch retrieval is a contract violation.

RC-05 — Model assignment (priority: medium)

route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME
route(q) ∈ {CONVERSATIONAL, PLATFORM}   → model = OLLAMA_MODEL_NAME_CONVERSATIONAL
                                                    ?? OLLAMA_MODEL_NAME  # fallback

RC-06 — History growth bound (priority: low)

classify_history input to the classifier MUST be capped at 6 entries per session.

Contract violations to monitor

Symptom	Violated rule
Platform query hits Elasticsearch	RC-04
`qwen3:1.7b` used for a `PLATFORM` response	RC-05
Platform prefix triggers LLM classifier	RC-01
Classifier output mirrors dominant history type	RC-03

Consequences

Positive

Query taxonomy is formalized and stable — downstream graph, model assignment, and RAG decisions are decoupled from classifier implementation
classify_history_store acts as a data flywheel for future classifier training
Platform-injected prompts classified in O(1) via RC-01
PLATFORM queries never hit Elasticsearch

Negative / Trade-offs

The LLM classifier is a generative model doing discriminative work — this is the accepted tactical debt
Prompt engineering (<history_rule>, <platform_priority_rule>) is a symptom of this mismatch, not a solution
qwen3:1.7b can still misclassify edge cases without platform signals — inherent to the bootstrap design

Future Path: Discriminative Classifier Pipeline

The fundamental problem with the bootstrap design

The LLM classifier is a generative model doing discriminative work. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering to compensate for what should be model properties. RC-01 through RC-06 exist precisely because of this mismatch.

The bootstrap design is justified while no labeled data exists. It should not be the steady-state architecture.

Target architecture

A layered pipeline where each layer is only invoked if the previous layer cannot produce a confident answer:

Query
  │
  ▼
[Layer 1] Hard rules (RC-01, RC-02)          ← O(1), deterministic
  │ no match
  ▼
[Layer 2] Embedding similarity classifier     ← ~1ms, CPU, no LLM
  │ confidence < threshold
  ▼
[Layer 3] LLM classifier (current design)     ← fallback for ambiguous queries only
  │
  ▼
Classification result

In steady state, Layer 3 handles fewer than 5% of requests.

Layer 2: embedding classifier on `bge-m3`

bge-m3 is already running in the stack. The implementation:

Embed each query via bge-m3 → fixed-size vector
Train logistic regression (or SVM with RBF kernel) on labeled (query, type) pairs
At inference: embed → class centroids → argmax with confidence score
If max(softmax(logits)) < 0.85 → fall through to Layer 3

This is microseconds of CPU inference. No GPU, no Ollama call, no prompt templating. RC-02 becomes a hard pre-filter in Layer 1, making it implementation-independent rather than prompt-dependent.

The data flywheel

classify_history_store already generates labeled training data. Every session produces (topic_snippet, type) pairs implicitly validated by user continuation.

classify_history_store → periodic export → labeled dataset → retrain Layer 2

The LLM classifier is the teacher. The embedding classifier is the student. This is knowledge distillation over production traffic without manual labeling.

Trigger: retrain when classify_history_store accumulates 500 sessions.

Caller-declared type

The platform generates PLATFORM prompts and knows the type at generation time. Adding query_type to AgentRequest (proto field 7) lets the caller declare the type explicitly, bypassing all three layers. This makes RC-01 and RC-02 redundant for platform-generated traffic.

Convergence path

Phase	What changes	Layer 3 traffic
Now — bootstrap	LLM classifier for all unmatched queries	~95%
Phase 1	Collect labels via `classify_history_store`	~95%
Phase 2	Deploy embedding classifier (Layer 2)	~10–20%
Phase 3	Caller-declared type for platform prompts	<5%
Phase 4	LLM classifier as anomaly handler only	<2%

Phase 2 is the highest-leverage step: it replaces the dominant code path (LLM inference per request) with CPU-only inference, with no change to the routing contract or the downstream graph.

10 KiB Raw Blame History Unescape Escape