From 5cb67bdc6ce332ac3e7c7be28cfbe419d0efbdc3 Mon Sep 17 00:00:00 2001 From: rafa-ruiz Date: Thu, 9 Apr 2026 20:19:13 -0700 Subject: [PATCH] update --- .../PRD-0003-adaptive-query-routing.md | 183 ++++++++++++++++++ 1 file changed, 183 insertions(+) create mode 100644 docs/product/PRD-0003-adaptive-query-routing.md diff --git a/docs/product/PRD-0003-adaptive-query-routing.md b/docs/product/PRD-0003-adaptive-query-routing.md new file mode 100644 index 0000000..b5aea2e --- /dev/null +++ b/docs/product/PRD-0003-adaptive-query-routing.md @@ -0,0 +1,183 @@ +# PRD-0003: Adaptive Query Routing with Platform Intent and Model Specialization + +**Date:** 2026-04-09 +**Status:** Implemented +**Requested by:** Rafael Ruiz (CTO) +**Purpose:** Route platform queries correctly and reduce inference cost for non-RAG requests +**Related ADR:** ADR-0008 (Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy) + +--- + +## Problem + +The assistance engine treated all incoming queries identically: every request went through the same RAG pipeline (Elasticsearch retrieval) and was answered by the same model (`qwen3:1.7b`), regardless of what the user was actually asking. + +This caused two observable problems in production: + +**1. Platform queries answered with AVAP documentation context** + +When a user or the platform sent a query about their account, usage metrics, or subscription — for example, `"You have a project usage percentage of 20%, provide a recommendation"` — the engine would retrieve AVAP language documentation chunks from Elasticsearch and attempt to answer using that context. The result was irrelevant or hallucinated responses, because the relevant data (account metrics) was already in the request, not in the knowledge base. + +**2. Classifier anchoring bias corrupted routing in mixed sessions** + +When a user had a conversation mixing AVAP language questions and platform queries, the classifier — a 1.7B generative model receiving raw message history — misclassified platform queries as `RETRIEVAL` or `CODE_GENERATION`. The model computed `P(type | history)` instead of `P(type | message_content)`, biasing toward the dominant type of the session. A session with 5 prior `RETRIEVAL` turns made the 6th query — regardless of content — likely to be classified as `RETRIEVAL`. + +--- + +## Solution + +### New query type: `PLATFORM` + +A fourth classification category is added to the existing taxonomy: + +| Type | What the user wants | RAG | Model | +|---|---|---|---| +| `RETRIEVAL` | Understand AVAP language concepts or documentation | Yes | `qwen3:1.7b` | +| `CODE_GENERATION` | Get working AVAP code | Yes | `qwen3:1.7b` | +| `CONVERSATIONAL` | Rephrase or continue what was already said | No | `qwen3:0.6b` | +| `PLATFORM` | Information about their account, usage, metrics, or billing | No | `qwen3:0.6b` | + +`PLATFORM` queries skip Elasticsearch retrieval entirely. The engine answers using only the data already present in the request (`extra_context`, `user_info`) via a dedicated `PLATFORM_PROMPT`. + +### Model specialization + +Two model slots, configured independently: + +``` +OLLAMA_MODEL_NAME=qwen3:1.7b # RETRIEVAL + CODE_GENERATION +OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM +``` + +If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is not set, both fall back to the main model. + +### Classifier bias fix: intent history + +The classifier no longer receives raw conversation messages. Instead, a compact **intent history** is passed — one entry per prior turn: + +``` +[RETRIEVAL] "What is addVar in AVAP?" +[CODE_GENERATION] "Write an API endpoint that retur" +[PLATFORM] "You have a project usage percentag" +``` + +This gives the classifier enough context to resolve ambiguous references (`"this"`, `"esto"`, `"lo anterior"`) without the topical content that causes anchoring. The distribution of prior intent types does not influence the prior probability of the current classification. + +Intent history is persisted per session in `classify_history_store`, parallel to `session_store`. + +--- + +## User experience + +**Scenario 1 — Platform dashboard widget** + +The platform injects a prompt: `"You have a project usage percentage of 20%. Provide an insight. Rules: Exactly 3 sentences."` The engine detects this is a platform query, skips retrieval, and answers using the injected data with the `qwen3:0.6b` model. Response is fast and relevant. + +**Scenario 2 — Mixed session: AVAP coding then account check** + +A developer asks 4 questions about AVAP syntax, then asks `"cuántas llamadas llevo este mes?"`. The classifier receives the intent history `[RETRIEVAL, RETRIEVAL, CODE_GENERATION, RETRIEVAL]` and the current message. Despite the history being dominated by `RETRIEVAL`, the classifier identifies the current message as `PLATFORM` because its content is about account metrics — not AVAP language. + +**Scenario 3 — Ambiguous reference resolved via history** + +During a debugging session (history: `[CODE_GENERATION, CODE_GENERATION]`), the user asks `"explain this"`. The history tells the classifier the user is working with code, so it correctly returns `CODE_GENERATION EDITOR` — the question is about the code in the editor, not a platform topic. + +**Scenario 4 — Conversational follow-up (unchanged)** + +A user asks a general AVAP question, then `"en menos palabras"`. The engine classifies this as `CONVERSATIONAL`, skips retrieval, and reformulates the prior answer using `qwen3:0.6b`. No change to existing behavior. + +--- + +## Scope + +**In scope:** +- Add `PLATFORM` query type with dedicated routing, no RAG, and `PLATFORM_PROMPT` +- Add `OLLAMA_MODEL_NAME_CONVERSATIONAL` environment variable for model slot assignment +- Replace raw message history in classifier with compact `classify_history` (type + 60-char snippet per turn) +- Persist `classify_history` in `classify_history_store` per session +- Add `` to classifier prompt: history resolves references only, does not predict type +- Add `` to classifier prompt: usage percentages, account metrics, quota data → always `PLATFORM` +- Add fast-path detection for known platform-injected prompt prefixes (O(1), no LLM call) +- Route `PLATFORM` and `CONVERSATIONAL` in `build_prepare_graph` to `skip_retrieve` +- Select `active_llm` per request in `AskAgentStream` based on `query_type` +- Add `classify_history` field to `AgentState` and `ClassifyEntry` type to `state.py` + +**Out of scope:** +- Changes to `EvaluateRAG` — golden dataset does not include platform queries +- `user_info` consumption in graph logic — available in state, not yet acted upon +- Sub-typing `PLATFORM` (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) — deferred +- Proto changes to add caller-declared `query_type` field — deferred to Phase 3 of Future Path (ADR-0008) + +--- + +## Technical design + +### New state fields (`state.py`) + +```python +class ClassifyEntry(TypedDict): + type: str # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM + topic: str # 60-char snippet of the query + +class AgentState(TypedDict): + ... + classify_history: list[ClassifyEntry] # persisted across turns +``` + +### Graph changes (`graph.py`) + +- `build_graph` accepts `llm_conversational=None` parameter; `respond_conversational` and new `respond_platform` nodes use `_llm_conv` +- Both `classify` nodes (in `build_graph` and `build_prepare_graph`) check `_is_platform_query()` before invoking the LLM — fast-path for known prefixes +- `classify_history` is read from state, passed to `_format_intent_history()`, appended with the new entry, and returned in state +- `build_prepare_graph` routes `PLATFORM` → `skip_retrieve` (same as `CONVERSATIONAL`) +- `build_final_messages` handles `PLATFORM` type — returns `PLATFORM_PROMPT` + messages, no RAG context + +### Server changes (`server.py`) + +- `__init__`: creates `self.llm_conversational` from `OLLAMA_MODEL_NAME_CONVERSATIONAL` env var; falls back to `self.llm` if unset +- `__init__`: passes `llm_conversational=self.llm_conversational` to `build_graph` +- Both `AskAgent` and `AskAgentStream`: read `classify_history` from `classify_history_store` at request start; write back after response +- `AskAgentStream`: selects `active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm` + +### New prompt (`prompts.py`) + +`PLATFORM_PROMPT` — instructs the model to answer using `extra_context` as primary source, be concise, and not invent account data. + +Classifier prompt additions: +- `` — use history only to resolve deictic references; the distribution of previous intents must not influence the prior probability of current classification +- `` — hard semantic override for account/metrics/quota/billing content + +### Environment variables + +| Variable | Purpose | Default | +|---|---|---| +| `OLLAMA_MODEL_NAME` | Main model: RETRIEVAL + CODE_GENERATION | required | +| `OLLAMA_MODEL_NAME_CONVERSATIONAL` | Light model: CONVERSATIONAL + PLATFORM | falls back to `OLLAMA_MODEL_NAME` | + +--- + +## Validation + +**Acceptance criteria:** + +- `"You have a project usage percentage of 20%, provide a recommendation"` → `PLATFORM`, zero Elasticsearch calls, answered with `qwen3:0.6b` +- `"You are a direct and concise assistant..."` prefix → fast-path `PLATFORM`, no LLM classifier call in logs +- Mixed session (5 AVAP turns + 1 platform query) → platform query classified as `PLATFORM` correctly +- Ambiguous reference (`"explain this"`) during a code session → resolved as `CODE_GENERATION EDITOR` via intent history +- `OLLAMA_MODEL_NAME_CONVERSATIONAL` unset → system operates normally using main model for all types +- No regression on existing `RETRIEVAL` and `CODE_GENERATION` flows + +**Signals to watch in logs:** + +| Log line | Expected | +|---|---| +| `[classify] platform prefix detected -> PLATFORM` | Fast-path firing for known prefixes | +| `[prepare/classify] ... -> PLATFORM` | LLM classifier correctly routing platform queries | +| `[AskAgentStream] query_type=PLATFORM context_len=0` | Zero retrieval for PLATFORM | +| `[hybrid] RRF -> N final docs` | Must NOT appear for PLATFORM queries | + +--- + +## Impact on parallel workstreams + +**RAG evaluation (ADR-0007 / EvaluateRAG):** No impact. The golden dataset contains only AVAP language questions. `PLATFORM` routing does not touch the retrieval pipeline or the embedding model. + +**Future classifier upgrade (ADR-0008 Future Path):** `classify_history_store` is the data collection mechanism for Phase 1 of the discriminative classifier pipeline. Every session from this point forward contributes labeled training examples. The embedding classifier (Layer 2) can be trained once sufficient sessions accumulate (~500).