11 KiB

Raw Permalink Blame History

PRD-0003: Adaptive Query Routing with Platform Intent and Model Specialization

Date: 2026-04-09 Last updated: 2026-04-10 Status: Implemented (Phases 1–3 complete) Requested by: Rafael Ruiz (CTO) Purpose: Route platform queries correctly and reduce inference cost for non-RAG requests Related ADR: ADR-0008 (Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy)

Problem

The assistance engine treated all incoming queries identically: every request went through the same RAG pipeline (Elasticsearch retrieval) and was answered by the same model (qwen3:1.7b), regardless of what the user was actually asking.

This caused two observable problems in production:

1. Platform queries answered with AVAP documentation context

When a user or the platform sent a query about their account, usage metrics, or subscription — for example, "You have a project usage percentage of 20%, provide a recommendation" — the engine would retrieve AVAP language documentation chunks from Elasticsearch and attempt to answer using that context. The result was irrelevant or hallucinated responses, because the relevant data (account metrics) was already in the request, not in the knowledge base.

2. Classifier anchoring bias corrupted routing in mixed sessions

When a user had a conversation mixing AVAP language questions and platform queries, the classifier — a 1.7B generative model receiving raw message history — misclassified platform queries as RETRIEVAL or CODE_GENERATION. The model computed P(type | history) instead of P(type | message_content), biasing toward the dominant type of the session. A session with 5 prior RETRIEVAL turns made the 6th query — regardless of content — likely to be classified as RETRIEVAL.

Solution

New query type: `PLATFORM`

A fourth classification category is added to the existing taxonomy:

Type	What the user wants	RAG	Model
`RETRIEVAL`	Understand AVAP language concepts or documentation	Yes	`qwen3:1.7b`
`CODE_GENERATION`	Get working AVAP code	Yes	`qwen3:1.7b`
`CONVERSATIONAL`	Rephrase or continue what was already said	No	`qwen3:0.6b`
`PLATFORM`	Information about their account, usage, metrics, or billing	No	`qwen3:0.6b`

PLATFORM queries skip Elasticsearch retrieval entirely. The engine answers using only the data already present in the request (extra_context, user_info) via a dedicated PLATFORM_PROMPT.

Model specialization

Two model slots, configured independently:

OLLAMA_MODEL_NAME=qwen3:1.7b                # RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM

If OLLAMA_MODEL_NAME_CONVERSATIONAL is not set, both fall back to the main model.

Classifier bias fix: intent history

The classifier no longer receives raw conversation messages. Instead, a compact intent history is passed — one entry per prior turn:

[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"

This gives the classifier enough context to resolve ambiguous references ("this", "esto", "lo anterior") without the topical content that causes anchoring. The distribution of prior intent types does not influence the prior probability of the current classification.

Intent history is persisted per session in classify_history_store, parallel to session_store.

User experience

Scenario 1 — Platform dashboard widget

The platform injects a prompt: "You have a project usage percentage of 20%. Provide an insight. Rules: Exactly 3 sentences." The engine detects this is a platform query, skips retrieval, and answers using the injected data with the qwen3:0.6b model. Response is fast and relevant.

Scenario 2 — Mixed session: AVAP coding then account check

A developer asks 4 questions about AVAP syntax, then asks "cuántas llamadas llevo este mes?". The classifier receives the intent history [RETRIEVAL, RETRIEVAL, CODE_GENERATION, RETRIEVAL] and the current message. Despite the history being dominated by RETRIEVAL, the classifier identifies the current message as PLATFORM because its content is about account metrics — not AVAP language.

Scenario 3 — Ambiguous reference resolved via history

During a debugging session (history: [CODE_GENERATION, CODE_GENERATION]), the user asks "explain this". The history tells the classifier the user is working with code, so it correctly returns CODE_GENERATION EDITOR — the question is about the code in the editor, not a platform topic.

Scenario 4 — Conversational follow-up (unchanged)

A user asks a general AVAP question, then "en menos palabras". The engine classifies this as CONVERSATIONAL, skips retrieval, and reformulates the prior answer using qwen3:0.6b. No change to existing behavior.

Scope

In scope:

Add PLATFORM query type with dedicated routing, no RAG, and PLATFORM_PROMPT
Add OLLAMA_MODEL_NAME_CONVERSATIONAL environment variable for model slot assignment
Replace raw message history in classifier with compact classify_history (type + 60-char snippet per turn)
Persist classify_history in classify_history_store per session
Add <history_rule> to classifier prompt: history resolves references only, does not predict type
Add <platform_priority_rule> to classifier prompt: usage percentages, account metrics, quota data → always PLATFORM
Add fast-path detection for known platform-injected prompt prefixes (O(1), no LLM call)
Route PLATFORM and CONVERSATIONAL in build_prepare_graph to skip_retrieve
Select active_llm per request in AskAgentStream based on query_type
Add classify_history field to AgentState and ClassifyEntry type to state.py

Also in scope (ADR-0008 Phases 1–3):

Export classify_history_store to labeled JSONL when threshold is reached (classifier_export.py) — data flywheel for Layer 2 retraining
Embedding classifier Layer 2: bge-m3 + LogisticRegression trained on seed dataset, loaded at startup, intercepts queries before LLM with ≥0.85 confidence threshold
Caller-declared query_type: proto field 7 in AgentRequest — when set, all three classifier layers are bypassed

Out of scope:

Changes to EvaluateRAG — golden dataset does not include platform queries
user_info consumption in graph logic — available in state, not yet acted upon
Sub-typing PLATFORM (e.g. PLATFORM_METRICS vs PLATFORM_BILLING) — deferred

Technical design

New state fields (`state.py`)

class ClassifyEntry(TypedDict):
    type:  str   # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM
    topic: str   # 60-char snippet of the query

class AgentState(TypedDict):
    ...
    classify_history: list[ClassifyEntry]  # persisted across turns

Graph changes (`graph.py`)

build_graph accepts llm_conversational=None parameter; respond_conversational and new respond_platform nodes use _llm_conv
Both classify nodes (in build_graph and build_prepare_graph) check _is_platform_query() before invoking the LLM — fast-path for known prefixes
classify_history is read from state, passed to _format_intent_history(), appended with the new entry, and returned in state
build_prepare_graph routes PLATFORM → skip_retrieve (same as CONVERSATIONAL)
build_final_messages handles PLATFORM type — returns PLATFORM_PROMPT + messages, no RAG context

Server changes (`server.py`)

__init__: creates self.llm_conversational from OLLAMA_MODEL_NAME_CONVERSATIONAL env var; falls back to self.llm if unset
__init__: passes llm_conversational=self.llm_conversational to build_graph
Both AskAgent and AskAgentStream: read classify_history from classify_history_store at request start; write back after response
AskAgentStream: selects active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm

New prompt (`prompts.py`)

PLATFORM_PROMPT — instructs the model to answer using extra_context as primary source, be concise, and not invent account data.

Classifier prompt additions:

<history_rule> — use history only to resolve deictic references; the distribution of previous intents must not influence the prior probability of current classification
<platform_priority_rule> — hard semantic override for account/metrics/quota/billing content

Environment variables

Variable	Purpose	Default
`OLLAMA_MODEL_NAME`	Main model: RETRIEVAL + CODE_GENERATION	required
`OLLAMA_MODEL_NAME_CONVERSATIONAL`	Light model: CONVERSATIONAL + PLATFORM	falls back to `OLLAMA_MODEL_NAME`
`CLASSIFIER_EXPORT_THRESHOLD`	Sessions before auto-export of labeled data	`500`
`CLASSIFIER_EXPORT_DIR`	Directory for exported JSONL label files	`/data/classifier_labels`
`CLASSIFIER_MODEL_PATH`	Path to serialized Layer 2 model	`/data/classifier_model.pkl`
`CLASSIFIER_CONFIDENCE_THRESHOLD`	Minimum confidence for Layer 2 to classify (else fall through to LLM)	`0.85`

Validation

Acceptance criteria:

"You have a project usage percentage of 20%, provide a recommendation" → PLATFORM, zero Elasticsearch calls, answered with qwen3:0.6b
"You are a direct and concise assistant..." prefix → fast-path PLATFORM, no LLM classifier call in logs
Mixed session (5 AVAP turns + 1 platform query) → platform query classified as PLATFORM correctly
Ambiguous reference ("explain this") during a code session → resolved as CODE_GENERATION EDITOR via intent history
OLLAMA_MODEL_NAME_CONVERSATIONAL unset → system operates normally using main model for all types
No regression on existing RETRIEVAL and CODE_GENERATION flows

Signals to watch in logs:

Log line	Expected
`[classify] L1 → PLATFORM`	Layer 1 fast-path firing for known prefixes
`[classifier/L2] → PLATFORM (conf=0.97)`	Layer 2 classifying with high confidence
`[classifier/L2] low confidence (0.72) → LLM fallback`	Layer 2 falling through to Layer 3
`[classify] caller-declared → PLATFORM`	Phase 3 — caller bypassing all layers
`[prepare/classify] L3 ... -> PLATFORM`	Layer 3 LLM classifier correctly routing
`[AskAgentStream] query_type=PLATFORM context_len=0`	Zero retrieval for PLATFORM
`[hybrid] RRF -> N final docs`	Must NOT appear for PLATFORM queries
`[classifier/L2] model loaded from /data/classifier_model.pkl`	Layer 2 loaded at startup

Impact on parallel workstreams

RAG evaluation (ADR-0007 / EvaluateRAG): No impact. The golden dataset contains only AVAP language questions. PLATFORM routing does not touch the retrieval pipeline or the embedding model.

Future classifier upgrade (ADR-0008 Future Path): classify_history_store is the data collection mechanism for Phase 1 of the discriminative classifier pipeline. Every session from this point forward contributes labeled training examples. The embedding classifier (Layer 2) can be trained once sufficient sessions accumulate (~500).

11 KiB Raw Permalink Blame History Unescape Escape