assistance-engine/docs/product/PRD-0003-adaptive-query-rou...

11 KiB
Raw Permalink Blame History

PRD-0003: Adaptive Query Routing with Platform Intent and Model Specialization

Date: 2026-04-09 Last updated: 2026-04-10 Status: Implemented (Phases 13 complete) Requested by: Rafael Ruiz (CTO) Purpose: Route platform queries correctly and reduce inference cost for non-RAG requests Related ADR: ADR-0008 (Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy)


Problem

The assistance engine treated all incoming queries identically: every request went through the same RAG pipeline (Elasticsearch retrieval) and was answered by the same model (qwen3:1.7b), regardless of what the user was actually asking.

This caused two observable problems in production:

1. Platform queries answered with AVAP documentation context

When a user or the platform sent a query about their account, usage metrics, or subscription — for example, "You have a project usage percentage of 20%, provide a recommendation" — the engine would retrieve AVAP language documentation chunks from Elasticsearch and attempt to answer using that context. The result was irrelevant or hallucinated responses, because the relevant data (account metrics) was already in the request, not in the knowledge base.

2. Classifier anchoring bias corrupted routing in mixed sessions

When a user had a conversation mixing AVAP language questions and platform queries, the classifier — a 1.7B generative model receiving raw message history — misclassified platform queries as RETRIEVAL or CODE_GENERATION. The model computed P(type | history) instead of P(type | message_content), biasing toward the dominant type of the session. A session with 5 prior RETRIEVAL turns made the 6th query — regardless of content — likely to be classified as RETRIEVAL.


Solution

New query type: PLATFORM

A fourth classification category is added to the existing taxonomy:

Type What the user wants RAG Model
RETRIEVAL Understand AVAP language concepts or documentation Yes qwen3:1.7b
CODE_GENERATION Get working AVAP code Yes qwen3:1.7b
CONVERSATIONAL Rephrase or continue what was already said No qwen3:0.6b
PLATFORM Information about their account, usage, metrics, or billing No qwen3:0.6b

PLATFORM queries skip Elasticsearch retrieval entirely. The engine answers using only the data already present in the request (extra_context, user_info) via a dedicated PLATFORM_PROMPT.

Model specialization

Two model slots, configured independently:

OLLAMA_MODEL_NAME=qwen3:1.7b                # RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM

If OLLAMA_MODEL_NAME_CONVERSATIONAL is not set, both fall back to the main model.

Classifier bias fix: intent history

The classifier no longer receives raw conversation messages. Instead, a compact intent history is passed — one entry per prior turn:

[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"

This gives the classifier enough context to resolve ambiguous references ("this", "esto", "lo anterior") without the topical content that causes anchoring. The distribution of prior intent types does not influence the prior probability of the current classification.

Intent history is persisted per session in classify_history_store, parallel to session_store.


User experience

Scenario 1 — Platform dashboard widget

The platform injects a prompt: "You have a project usage percentage of 20%. Provide an insight. Rules: Exactly 3 sentences." The engine detects this is a platform query, skips retrieval, and answers using the injected data with the qwen3:0.6b model. Response is fast and relevant.

Scenario 2 — Mixed session: AVAP coding then account check

A developer asks 4 questions about AVAP syntax, then asks "cuántas llamadas llevo este mes?". The classifier receives the intent history [RETRIEVAL, RETRIEVAL, CODE_GENERATION, RETRIEVAL] and the current message. Despite the history being dominated by RETRIEVAL, the classifier identifies the current message as PLATFORM because its content is about account metrics — not AVAP language.

Scenario 3 — Ambiguous reference resolved via history

During a debugging session (history: [CODE_GENERATION, CODE_GENERATION]), the user asks "explain this". The history tells the classifier the user is working with code, so it correctly returns CODE_GENERATION EDITOR — the question is about the code in the editor, not a platform topic.

Scenario 4 — Conversational follow-up (unchanged)

A user asks a general AVAP question, then "en menos palabras". The engine classifies this as CONVERSATIONAL, skips retrieval, and reformulates the prior answer using qwen3:0.6b. No change to existing behavior.


Scope

In scope:

  • Add PLATFORM query type with dedicated routing, no RAG, and PLATFORM_PROMPT
  • Add OLLAMA_MODEL_NAME_CONVERSATIONAL environment variable for model slot assignment
  • Replace raw message history in classifier with compact classify_history (type + 60-char snippet per turn)
  • Persist classify_history in classify_history_store per session
  • Add <history_rule> to classifier prompt: history resolves references only, does not predict type
  • Add <platform_priority_rule> to classifier prompt: usage percentages, account metrics, quota data → always PLATFORM
  • Add fast-path detection for known platform-injected prompt prefixes (O(1), no LLM call)
  • Route PLATFORM and CONVERSATIONAL in build_prepare_graph to skip_retrieve
  • Select active_llm per request in AskAgentStream based on query_type
  • Add classify_history field to AgentState and ClassifyEntry type to state.py

Also in scope (ADR-0008 Phases 13):

  • Export classify_history_store to labeled JSONL when threshold is reached (classifier_export.py) — data flywheel for Layer 2 retraining
  • Embedding classifier Layer 2: bge-m3 + LogisticRegression trained on seed dataset, loaded at startup, intercepts queries before LLM with ≥0.85 confidence threshold
  • Caller-declared query_type: proto field 7 in AgentRequest — when set, all three classifier layers are bypassed

Out of scope:

  • Changes to EvaluateRAG — golden dataset does not include platform queries
  • user_info consumption in graph logic — available in state, not yet acted upon
  • Sub-typing PLATFORM (e.g. PLATFORM_METRICS vs PLATFORM_BILLING) — deferred

Technical design

New state fields (state.py)

class ClassifyEntry(TypedDict):
    type:  str   # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM
    topic: str   # 60-char snippet of the query

class AgentState(TypedDict):
    ...
    classify_history: list[ClassifyEntry]  # persisted across turns

Graph changes (graph.py)

  • build_graph accepts llm_conversational=None parameter; respond_conversational and new respond_platform nodes use _llm_conv
  • Both classify nodes (in build_graph and build_prepare_graph) check _is_platform_query() before invoking the LLM — fast-path for known prefixes
  • classify_history is read from state, passed to _format_intent_history(), appended with the new entry, and returned in state
  • build_prepare_graph routes PLATFORMskip_retrieve (same as CONVERSATIONAL)
  • build_final_messages handles PLATFORM type — returns PLATFORM_PROMPT + messages, no RAG context

Server changes (server.py)

  • __init__: creates self.llm_conversational from OLLAMA_MODEL_NAME_CONVERSATIONAL env var; falls back to self.llm if unset
  • __init__: passes llm_conversational=self.llm_conversational to build_graph
  • Both AskAgent and AskAgentStream: read classify_history from classify_history_store at request start; write back after response
  • AskAgentStream: selects active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm

New prompt (prompts.py)

PLATFORM_PROMPT — instructs the model to answer using extra_context as primary source, be concise, and not invent account data.

Classifier prompt additions:

  • <history_rule> — use history only to resolve deictic references; the distribution of previous intents must not influence the prior probability of current classification
  • <platform_priority_rule> — hard semantic override for account/metrics/quota/billing content

Environment variables

Variable Purpose Default
OLLAMA_MODEL_NAME Main model: RETRIEVAL + CODE_GENERATION required
OLLAMA_MODEL_NAME_CONVERSATIONAL Light model: CONVERSATIONAL + PLATFORM falls back to OLLAMA_MODEL_NAME
CLASSIFIER_EXPORT_THRESHOLD Sessions before auto-export of labeled data 500
CLASSIFIER_EXPORT_DIR Directory for exported JSONL label files /data/classifier_labels
CLASSIFIER_MODEL_PATH Path to serialized Layer 2 model /data/classifier_model.pkl
CLASSIFIER_CONFIDENCE_THRESHOLD Minimum confidence for Layer 2 to classify (else fall through to LLM) 0.85

Validation

Acceptance criteria:

  • "You have a project usage percentage of 20%, provide a recommendation"PLATFORM, zero Elasticsearch calls, answered with qwen3:0.6b
  • "You are a direct and concise assistant..." prefix → fast-path PLATFORM, no LLM classifier call in logs
  • Mixed session (5 AVAP turns + 1 platform query) → platform query classified as PLATFORM correctly
  • Ambiguous reference ("explain this") during a code session → resolved as CODE_GENERATION EDITOR via intent history
  • OLLAMA_MODEL_NAME_CONVERSATIONAL unset → system operates normally using main model for all types
  • No regression on existing RETRIEVAL and CODE_GENERATION flows

Signals to watch in logs:

Log line Expected
[classify] L1 → PLATFORM Layer 1 fast-path firing for known prefixes
[classifier/L2] → PLATFORM (conf=0.97) Layer 2 classifying with high confidence
[classifier/L2] low confidence (0.72) → LLM fallback Layer 2 falling through to Layer 3
[classify] caller-declared → PLATFORM Phase 3 — caller bypassing all layers
[prepare/classify] L3 ... -> PLATFORM Layer 3 LLM classifier correctly routing
[AskAgentStream] query_type=PLATFORM context_len=0 Zero retrieval for PLATFORM
[hybrid] RRF -> N final docs Must NOT appear for PLATFORM queries
[classifier/L2] model loaded from /data/classifier_model.pkl Layer 2 loaded at startup

Impact on parallel workstreams

RAG evaluation (ADR-0007 / EvaluateRAG): No impact. The golden dataset contains only AVAP language questions. PLATFORM routing does not touch the retrieval pipeline or the embedding model.

Future classifier upgrade (ADR-0008 Future Path): classify_history_store is the data collection mechanism for Phase 1 of the discriminative classifier pipeline. Every session from this point forward contributes labeled training examples. The embedding classifier (Layer 2) can be trained once sufficient sessions accumulate (~500).