9.9 KiB
PRD-0003: Adaptive Query Routing with Platform Intent and Model Specialization
Date: 2026-04-09 Status: Implemented Requested by: Rafael Ruiz (CTO) Purpose: Route platform queries correctly and reduce inference cost for non-RAG requests Related ADR: ADR-0008 (Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy)
Problem
The assistance engine treated all incoming queries identically: every request went through the same RAG pipeline (Elasticsearch retrieval) and was answered by the same model (qwen3:1.7b), regardless of what the user was actually asking.
This caused two observable problems in production:
1. Platform queries answered with AVAP documentation context
When a user or the platform sent a query about their account, usage metrics, or subscription — for example, "You have a project usage percentage of 20%, provide a recommendation" — the engine would retrieve AVAP language documentation chunks from Elasticsearch and attempt to answer using that context. The result was irrelevant or hallucinated responses, because the relevant data (account metrics) was already in the request, not in the knowledge base.
2. Classifier anchoring bias corrupted routing in mixed sessions
When a user had a conversation mixing AVAP language questions and platform queries, the classifier — a 1.7B generative model receiving raw message history — misclassified platform queries as RETRIEVAL or CODE_GENERATION. The model computed P(type | history) instead of P(type | message_content), biasing toward the dominant type of the session. A session with 5 prior RETRIEVAL turns made the 6th query — regardless of content — likely to be classified as RETRIEVAL.
Solution
New query type: PLATFORM
A fourth classification category is added to the existing taxonomy:
| Type | What the user wants | RAG | Model |
|---|---|---|---|
RETRIEVAL |
Understand AVAP language concepts or documentation | Yes | qwen3:1.7b |
CODE_GENERATION |
Get working AVAP code | Yes | qwen3:1.7b |
CONVERSATIONAL |
Rephrase or continue what was already said | No | qwen3:0.6b |
PLATFORM |
Information about their account, usage, metrics, or billing | No | qwen3:0.6b |
PLATFORM queries skip Elasticsearch retrieval entirely. The engine answers using only the data already present in the request (extra_context, user_info) via a dedicated PLATFORM_PROMPT.
Model specialization
Two model slots, configured independently:
OLLAMA_MODEL_NAME=qwen3:1.7b # RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM
If OLLAMA_MODEL_NAME_CONVERSATIONAL is not set, both fall back to the main model.
Classifier bias fix: intent history
The classifier no longer receives raw conversation messages. Instead, a compact intent history is passed — one entry per prior turn:
[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"
This gives the classifier enough context to resolve ambiguous references ("this", "esto", "lo anterior") without the topical content that causes anchoring. The distribution of prior intent types does not influence the prior probability of the current classification.
Intent history is persisted per session in classify_history_store, parallel to session_store.
User experience
Scenario 1 — Platform dashboard widget
The platform injects a prompt: "You have a project usage percentage of 20%. Provide an insight. Rules: Exactly 3 sentences." The engine detects this is a platform query, skips retrieval, and answers using the injected data with the qwen3:0.6b model. Response is fast and relevant.
Scenario 2 — Mixed session: AVAP coding then account check
A developer asks 4 questions about AVAP syntax, then asks "cuántas llamadas llevo este mes?". The classifier receives the intent history [RETRIEVAL, RETRIEVAL, CODE_GENERATION, RETRIEVAL] and the current message. Despite the history being dominated by RETRIEVAL, the classifier identifies the current message as PLATFORM because its content is about account metrics — not AVAP language.
Scenario 3 — Ambiguous reference resolved via history
During a debugging session (history: [CODE_GENERATION, CODE_GENERATION]), the user asks "explain this". The history tells the classifier the user is working with code, so it correctly returns CODE_GENERATION EDITOR — the question is about the code in the editor, not a platform topic.
Scenario 4 — Conversational follow-up (unchanged)
A user asks a general AVAP question, then "en menos palabras". The engine classifies this as CONVERSATIONAL, skips retrieval, and reformulates the prior answer using qwen3:0.6b. No change to existing behavior.
Scope
In scope:
- Add
PLATFORMquery type with dedicated routing, no RAG, andPLATFORM_PROMPT - Add
OLLAMA_MODEL_NAME_CONVERSATIONALenvironment variable for model slot assignment - Replace raw message history in classifier with compact
classify_history(type + 60-char snippet per turn) - Persist
classify_historyinclassify_history_storeper session - Add
<history_rule>to classifier prompt: history resolves references only, does not predict type - Add
<platform_priority_rule>to classifier prompt: usage percentages, account metrics, quota data → alwaysPLATFORM - Add fast-path detection for known platform-injected prompt prefixes (O(1), no LLM call)
- Route
PLATFORMandCONVERSATIONALinbuild_prepare_graphtoskip_retrieve - Select
active_llmper request inAskAgentStreambased onquery_type - Add
classify_historyfield toAgentStateandClassifyEntrytype tostate.py
Out of scope:
- Changes to
EvaluateRAG— golden dataset does not include platform queries user_infoconsumption in graph logic — available in state, not yet acted upon- Sub-typing
PLATFORM(e.g.PLATFORM_METRICSvsPLATFORM_BILLING) — deferred - Proto changes to add caller-declared
query_typefield — deferred to Phase 3 of Future Path (ADR-0008)
Technical design
New state fields (state.py)
class ClassifyEntry(TypedDict):
type: str # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM
topic: str # 60-char snippet of the query
class AgentState(TypedDict):
...
classify_history: list[ClassifyEntry] # persisted across turns
Graph changes (graph.py)
build_graphacceptsllm_conversational=Noneparameter;respond_conversationaland newrespond_platformnodes use_llm_conv- Both
classifynodes (inbuild_graphandbuild_prepare_graph) check_is_platform_query()before invoking the LLM — fast-path for known prefixes classify_historyis read from state, passed to_format_intent_history(), appended with the new entry, and returned in statebuild_prepare_graphroutesPLATFORM→skip_retrieve(same asCONVERSATIONAL)build_final_messageshandlesPLATFORMtype — returnsPLATFORM_PROMPT+ messages, no RAG context
Server changes (server.py)
__init__: createsself.llm_conversationalfromOLLAMA_MODEL_NAME_CONVERSATIONALenv var; falls back toself.llmif unset__init__: passesllm_conversational=self.llm_conversationaltobuild_graph- Both
AskAgentandAskAgentStream: readclassify_historyfromclassify_history_storeat request start; write back after response AskAgentStream: selectsactive_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm
New prompt (prompts.py)
PLATFORM_PROMPT — instructs the model to answer using extra_context as primary source, be concise, and not invent account data.
Classifier prompt additions:
<history_rule>— use history only to resolve deictic references; the distribution of previous intents must not influence the prior probability of current classification<platform_priority_rule>— hard semantic override for account/metrics/quota/billing content
Environment variables
| Variable | Purpose | Default |
|---|---|---|
OLLAMA_MODEL_NAME |
Main model: RETRIEVAL + CODE_GENERATION | required |
OLLAMA_MODEL_NAME_CONVERSATIONAL |
Light model: CONVERSATIONAL + PLATFORM | falls back to OLLAMA_MODEL_NAME |
Validation
Acceptance criteria:
"You have a project usage percentage of 20%, provide a recommendation"→PLATFORM, zero Elasticsearch calls, answered withqwen3:0.6b"You are a direct and concise assistant..."prefix → fast-pathPLATFORM, no LLM classifier call in logs- Mixed session (5 AVAP turns + 1 platform query) → platform query classified as
PLATFORMcorrectly - Ambiguous reference (
"explain this") during a code session → resolved asCODE_GENERATION EDITORvia intent history OLLAMA_MODEL_NAME_CONVERSATIONALunset → system operates normally using main model for all types- No regression on existing
RETRIEVALandCODE_GENERATIONflows
Signals to watch in logs:
| Log line | Expected |
|---|---|
[classify] platform prefix detected -> PLATFORM |
Fast-path firing for known prefixes |
[prepare/classify] ... -> PLATFORM |
LLM classifier correctly routing platform queries |
[AskAgentStream] query_type=PLATFORM context_len=0 |
Zero retrieval for PLATFORM |
[hybrid] RRF -> N final docs |
Must NOT appear for PLATFORM queries |
Impact on parallel workstreams
RAG evaluation (ADR-0007 / EvaluateRAG): No impact. The golden dataset contains only AVAP language questions. PLATFORM routing does not touch the retrieval pipeline or the embedding model.
Future classifier upgrade (ADR-0008 Future Path): classify_history_store is the data collection mechanism for Phase 1 of the discriminative classifier pipeline. Every session from this point forward contributes labeled training examples. The embedding classifier (Layer 2) can be trained once sufficient sessions accumulate (~500).