From c955f69ad86a925599e54118a55ac712165c91cb Mon Sep 17 00:00:00 2001 From: rafa-ruiz Date: Thu, 9 Apr 2026 20:02:10 -0700 Subject: [PATCH] =?UTF-8?q?[DOC]=20ADR-0008:=20refactor=20=E2=80=94=20sepa?= =?UTF-8?q?rate=20permanent=20decisions=20from=20tactical=20debt?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Retitle to reflect actual scope: taxonomy, contract, classifier strategy - Split Decision section into permanent (taxonomy, model assignment) vs tactical [BOOTSTRAP] (LLM classifier) - Mark LLM classifier explicitly as interim implementation with pointer to Future Path - Clarify that Routing Contract is implementation-independent - Consolidate prompt engineering rules as symptoms of architectural mismatch Co-Authored-By: Claude Sonnet 4.6 --- ...8-adaptive-query-routing-intent-history.md | 178 ++++++++---------- 1 file changed, 83 insertions(+), 95 deletions(-) diff --git a/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md b/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md index 8541408..1cc0d44 100644 --- a/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md +++ b/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md @@ -1,4 +1,4 @@ -# ADR-0008: Adaptive Query Routing with Intent History and Model Specialization +# ADR-0008: Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy **Date:** 2026-04-09 **Status:** Accepted @@ -9,49 +9,60 @@ ## Context -The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types and a single LLM-based classifier that received raw conversation history. Two problems emerged in production: +The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types with no differentiation in routing, retrieval, or model selection. Two problems emerged in production: -### Problem 1 — Model oversizing for lightweight queries +### Problem 1 — No query taxonomy -Platform queries (account status, usage metrics, subscription data) and conversational follow-ups do not require retrieval or a large model. Running `qwen3:1.7b` for a one-sentence platform insight wastes resources and adds latency. +All queries were treated identically. Platform queries (account status, usage metrics, billing) were sent through the same RAG pipeline as AVAP language questions, wasting retrieval resources and producing irrelevant context. -### Problem 2 — Classifier bias from raw message history +### Problem 2 — Classifier anchoring bias -When the classifier received raw conversation messages as history, a small model (1.7B parameters) exhibited **anchoring bias**: it would classify new messages as the same type as recent messages, regardless of the actual content of the new query. This caused platform queries (`"You have a project usage percentage of 20%, provide a recommendation"`) to be misclassified as `RETRIEVAL` or `CODE_GENERATION` during sessions that had previously handled AVAP language questions. - -Root cause: passing full message content to a small classifier is too noisy. The model uses conversation topic as a proxy for intent type. +The LLM-based classifier received raw conversation messages as history. A 1.7B model exhibited **anchoring bias**: it computed `P(type | history)` instead of `P(type | message_content)`, misclassifying new queries as the same type as recent turns regardless of actual content. --- ## Decision -### 1. New query type: `PLATFORM` +This ADR makes three decisions with different time horizons: -A fourth classification category is introduced alongside `RETRIEVAL`, `CODE_GENERATION`, and `CONVERSATIONAL`: +1. **Permanent** — query taxonomy and routing contract +2. **Permanent** — model assignment per type +3. **Tactical / bootstrap** — LLM classifier as interim implementation -| Type | Purpose | RAG | Model | +### Decision 1 — Query taxonomy (permanent) + +Four query types with fixed routing semantics: + +| Type | Purpose | RAG | Model slot | |---|---|---|---| -| `RETRIEVAL` | AVAP language documentation | Yes | `OLLAMA_MODEL_NAME` | -| `CODE_GENERATION` | Produce working AVAP code | Yes | `OLLAMA_MODEL_NAME` | -| `CONVERSATIONAL` | Rephrase / continue prior answer | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | -| `PLATFORM` | Account, metrics, usage, billing | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | +| `RETRIEVAL` | AVAP language documentation and concepts | Yes | `main` | +| `CODE_GENERATION` | Produce working AVAP code | Yes | `main` | +| `CONVERSATIONAL` | Rephrase or continue prior answer | No | `conversational` | +| `PLATFORM` | Account, metrics, usage, quota, billing | No | `conversational` | -`PLATFORM` queries skip RAG entirely and are served with a dedicated `PLATFORM_PROMPT` that instructs the model to use `extra_context` (where user account data is injected) as primary source. +These types and their RAG/model assignments are stable. Any future classifier implementation must preserve this taxonomy. -### 2. Model specialization via environment variables +### Decision 2 — Model specialization (permanent) -Two model slots are configured independently: +Two model slots configured via environment variables: ``` -OLLAMA_MODEL_NAME=qwen3:1.7b # RETRIEVAL + CODE_GENERATION -OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM +OLLAMA_MODEL_NAME=qwen3:1.7b # main slot: RETRIEVAL + CODE_GENERATION +OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # conversational slot: CONVERSATIONAL + PLATFORM ``` -If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is not set, both slots fall back to `OLLAMA_MODEL_NAME` (backward compatible). +If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is unset, both slots fall back to `OLLAMA_MODEL_NAME`. -### 3. Intent history instead of raw message history for classification +### Decision 3 — LLM classifier as bootstrap `[TACTICAL DEBT]` -The classifier no longer receives raw conversation messages. Instead, a compact **intent history** (`classify_history`) is maintained per session: +> **This is an acknowledged interim implementation, not the target architecture.** +> See [Future Path](#future-path-discriminative-classifier-pipeline) for the correct steady-state design. + +A generative LLM is used for classification because no labeled training data exists yet. The design includes two mitigations for its known weaknesses: + +**a) Compact intent history instead of raw messages** + +`classify_history` replaces raw message history in the classifier context. Each entry stores only `type` + 60-char topic snippet: ``` [RETRIEVAL] "What is addVar in AVAP?" @@ -59,89 +70,74 @@ The classifier no longer receives raw conversation messages. Instead, a compact [PLATFORM] "You have a project usage percentag" ``` -Each entry stores only the `type` and a 60-character topic snippet. This gives the classifier the conversational thread (useful for resolving ambiguous references like "this", "esto", "lo anterior") without the topical noise that causes anchoring bias. +This preserves reference resolution (`"this"`, `"esto"`, `"lo anterior"`) without the topical noise that causes anchoring. `classify_history` is persisted in `classify_history_store` per session. -`classify_history` is persisted in `classify_history_store` (parallel to `session_store`) and passed in `AgentState` across turns. +**b) Prompt constraints to counteract generative bias** -### 4. Classifier prompt redesign +- `` — explicit instruction that intent distribution of prior turns must not influence prior probability of current classification +- `` — hard semantic override: usage percentages, account metrics, quota or billing data → always `PLATFORM` -The prompt now includes: - -- **``** — explicit instruction: use history only to resolve ambiguous references, not to predict the category of the new message -- **``** — hard override: if the message contains usage percentages, account metrics, quota data, or billing information, classify as `PLATFORM` regardless of history -- **``** replaced by inline role instruction that each message must be evaluated independently - -### 5. Fast-path for known platform prefixes - -Queries containing `"you are a direct and concise assistant"` (a system-injected prefix used by the platform) are classified as `PLATFORM` deterministically without invoking the LLM classifier. This is justified because this prefix is controlled by the platform itself, not by user input, so deterministic detection is both correct and cheaper. +These prompt rules are compensations for the architectural mismatch between a generative model and a discriminative task. They become unnecessary once the LLM classifier is replaced. --- ## Routing Contract -This section is normative. Any reimplementation of the classifier or the graph must satisfy all rules below. Rules are ordered by priority — a higher-priority rule always wins. +This section is normative and **implementation-independent**. Any reimplementation — including the discriminative classifier described in Future Path — must satisfy all rules below. Rules are ordered by priority. ### RC-01 — Fast-path override (priority: highest) -If the query contains a known platform-injected prefix, the system **MUST** classify it as `PLATFORM` without invoking any LLM. +If the query contains a known platform-injected prefix, classify as `PLATFORM` without invoking any classifier. ``` ∀ q : query contains(q, known_platform_prefix) → route(q) = PLATFORM ``` -Current registered prefixes (see `_PLATFORM_PATTERNS` in `graph.py`): +Current registered prefixes (`_PLATFORM_PATTERNS` in `graph.py`): - `"you are a direct and concise assistant"` -Adding a new prefix requires a code change to `_PLATFORM_PATTERNS` and a corresponding update to this list. +Adding a prefix requires updating `_PLATFORM_PATTERNS` and this list. ### RC-02 — Platform data signal (priority: high) -If the query contains any of the following signals, the classifier **MUST** output `PLATFORM` regardless of conversation history: +If the query contains usage percentages, account metrics, consumption figures, quota data, or billing information, the output **MUST** be `PLATFORM` regardless of history or classifier confidence. -- Usage percentages (e.g. `"20%"` in the context of project/account usage) -- Account metrics or consumption figures -- Quota, limit, or billing data - -This rule is enforced via `` in the classifier prompt. It cannot be overridden by history. +In the current bootstrap implementation this is enforced via ``. In the future discriminative classifier it should be a hard pre-filter in Layer 1. ### RC-03 — Intent history scoping (priority: medium) -The classifier **MUST** use `classify_history` only to resolve ambiguous pronoun or deictic references (`"this"`, `"esto"`, `"lo anterior"`, `"that function"`). It **MUST NOT** use history to predict or bias the type of the current message. +The classifier **MUST** use `classify_history` only to resolve ambiguous deictic references. It **MUST NOT** use history to predict or bias the type of the current message. ``` classify(q, history) ≠ f(dominant_type(history)) classify(q, history) = f(intent(q), resolve_references(q, history)) ``` -**Rationale:** Small LLMs implicitly compute `P(type | history)` instead of `P(type | message_content)`. The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 `RETRIEVAL` turns does not make the next message more likely to be `RETRIEVAL`. The `` in the classifier prompt enforces this explicitly. +**Rationale:** Small LLMs implicitly compute `P(type | history)` instead of `P(type | message_content)`. The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 `RETRIEVAL` turns does not make the next message more likely to be `RETRIEVAL`. ### RC-04 — RAG bypass (priority: medium) -Query types that bypass Elasticsearch retrieval: - | Type | RAG | Justification | |---|---|---| | `RETRIEVAL` | Yes | Requires documentation context | | `CODE_GENERATION` | Yes | Requires syntax examples | -| `CONVERSATIONAL` | No | Reformulates prior answer already in context | -| `PLATFORM` | No | Data is injected via `extra_context`, not retrieved | +| `CONVERSATIONAL` | No | Prior answer already in context | +| `PLATFORM` | No | Data injected via `extra_context` | -A `PLATFORM` or `CONVERSATIONAL` query that triggers a retrieval step is a contract violation. +A `PLATFORM` or `CONVERSATIONAL` query that triggers Elasticsearch retrieval is a contract violation. ### RC-05 — Model assignment (priority: medium) ``` route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME route(q) ∈ {CONVERSATIONAL, PLATFORM} → model = OLLAMA_MODEL_NAME_CONVERSATIONAL - ?? OLLAMA_MODEL_NAME # fallback if unset + ?? OLLAMA_MODEL_NAME # fallback ``` -Changing which types map to which model slot requires updating this contract. - ### RC-06 — History growth bound (priority: low) -`classify_history` per session **MUST** be bounded. The classifier reads at most the last 6 entries. The store may grow unbounded in memory but the classifier input is always capped. +`classify_history` input to the classifier **MUST** be capped at 6 entries per session. ### Contract violations to monitor @@ -149,7 +145,7 @@ Changing which types map to which model slot requires updating this contract. |---|---| | Platform query hits Elasticsearch | RC-04 | | `qwen3:1.7b` used for a `PLATFORM` response | RC-05 | -| Platform prefix query triggers LLM classifier | RC-01 | +| Platform prefix triggers LLM classifier | RC-01 | | Classifier output mirrors dominant history type | RC-03 | --- @@ -158,30 +154,26 @@ Changing which types map to which model slot requires updating this contract. ### Positive -- Platform and conversational queries are served by a smaller, faster model -- Classifier bias from conversation history is eliminated while preserving the ability to resolve ambiguous references -- `PLATFORM` queries never hit Elasticsearch, reducing unnecessary retrieval load -- The system is more predictable: platform-injected prompts are classified in O(1) without an LLM call +- Query taxonomy is formalized and stable — downstream graph, model assignment, and RAG decisions are decoupled from classifier implementation +- `classify_history_store` acts as a data flywheel for future classifier training +- Platform-injected prompts classified in O(1) via RC-01 +- `PLATFORM` queries never hit Elasticsearch ### Negative / Trade-offs -- `classify_history` adds a small amount of state per session (bounded to last 6 entries) -- Two model slots mean two warm-up calls at startup if models differ -- The `qwen3:1.7b` classifier can still misclassify edge cases where no platform signals are present in the text — this is inherent to using a 1.7B model for semantic classification - -### Open questions - -- Whether `PLATFORM` should eventually split into sub-types (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) as the platform data schema grows +- The LLM classifier is a generative model doing discriminative work — this is the accepted tactical debt +- Prompt engineering (``, ``) is a symptom of this mismatch, not a solution +- `qwen3:1.7b` can still misclassify edge cases without platform signals — inherent to the bootstrap design --- ## Future Path: Discriminative Classifier Pipeline -### The fundamental problem with the current design +### The fundamental problem with the bootstrap design -The LLM classifier is a **generative model doing discriminative work**. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering as a substitute for proper model design. The rules in RC-01–RC-06 exist precisely to compensate for this architectural mismatch. +The LLM classifier is a generative model doing discriminative work. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering to compensate for what should be model properties. RC-01 through RC-06 exist precisely because of this mismatch. -The current design is correct as a **bootstrap mechanism** — it lets the system operate before labeled training data exists. But it should not be the steady-state architecture. +The bootstrap design is justified while no labeled data exists. It should not be the steady-state architecture. ### Target architecture @@ -197,53 +189,49 @@ Query [Layer 2] Embedding similarity classifier ← ~1ms, CPU, no LLM │ confidence < threshold ▼ -[Layer 3] LLM classifier (current design) ← fallback only +[Layer 3] LLM classifier (current design) ← fallback for ambiguous queries only │ ▼ Classification result ``` -In steady state, Layer 3 should handle fewer than 5% of requests — only genuinely ambiguous queries that neither rules nor the trained classifier can resolve with confidence. +In steady state, Layer 3 handles fewer than 5% of requests. -### Layer 2: Embedding similarity classifier +### Layer 2: embedding classifier on `bge-m3` -`bge-m3` is already running in the stack. The right move is to reuse it as the backbone of a lightweight discriminative classifier rather than adding a second LLM. +`bge-m3` is already running in the stack. The implementation: -**Implementation:** +1. Embed each query via `bge-m3` → fixed-size vector +2. Train logistic regression (or SVM with RBF kernel) on labeled `(query, type)` pairs +3. At inference: embed → class centroids → argmax with confidence score +4. If `max(softmax(logits)) < 0.85` → fall through to Layer 3 -1. Embed each query using `bge-m3` → fixed-size vector representation -2. Train a logistic regression (or SVM with RBF kernel) over those embeddings on a labeled dataset of `(query, type)` pairs -3. At inference: embed query → dot product against class centroids → argmax with confidence score -4. If `max(softmax(logits)) < threshold` (e.g. 0.85), fall through to Layer 3 +This is microseconds of CPU inference. No GPU, no Ollama call, no prompt templating. RC-02 becomes a hard pre-filter in Layer 1, making it implementation-independent rather than prompt-dependent. -This is microseconds of CPU inference, not LLM inference. No GPU, no Ollama call, no prompt templating. +### The data flywheel -### The data flywheel: `classify_history_store` as training set - -Every session already generates labeled examples. `classify_history_store` stores `(topic_snippet, type)` pairs that are implicitly validated by the system — if the user continued the conversation without correcting the assistant, the classification was likely correct. +`classify_history_store` already generates labeled training data. Every session produces `(topic_snippet, type)` pairs implicitly validated by user continuation. ``` -classify_history_store → periodic export → labeled dataset → retrain classifier +classify_history_store → periodic export → labeled dataset → retrain Layer 2 ``` -The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation without the overhead of explicit distillation training — the teacher labels production traffic automatically. +The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation over production traffic without manual labeling. -**Data collection trigger:** When `classify_history_store` accumulates N sessions (suggested: 500), export and retrain. The classifier improves continuously without human labeling. +**Trigger:** retrain when `classify_history_store` accumulates 500 sessions. -### Caller-declared type (for platform-injected prompts) +### Caller-declared type -The platform generates `PLATFORM` prompts — it always knows the type at generation time. Adding a `query_type` field to `AgentRequest` (proto field 7) allows the caller to declare the type explicitly. When set, all three classifier layers are bypassed entirely. - -This makes RC-01 and RC-02 redundant for platform-generated traffic and eliminates the only remaining case where a generative model is used to classify structured platform data. +The platform generates `PLATFORM` prompts and knows the type at generation time. Adding `query_type` to `AgentRequest` (proto field 7) lets the caller declare the type explicitly, bypassing all three layers. This makes RC-01 and RC-02 redundant for platform-generated traffic. ### Convergence path -| Phase | What changes | Expected Layer 3 traffic | +| Phase | What changes | Layer 3 traffic | |---|---|---| -| Now (bootstrap) | LLM classifier for all unmatched queries | ~95% | +| Now — bootstrap | LLM classifier for all unmatched queries | ~95% | | Phase 1 | Collect labels via `classify_history_store` | ~95% | | Phase 2 | Deploy embedding classifier (Layer 2) | ~10–20% | | Phase 3 | Caller-declared type for platform prompts | <5% | -| Phase 4 | LLM classifier becomes anomaly handler only | <2% | +| Phase 4 | LLM classifier as anomaly handler only | <2% | -Phase 2 is the highest-leverage step: it replaces the most frequent code path (LLM inference per request) with a CPU-only operation, with no change to the routing contract or the downstream graph. +Phase 2 is the highest-leverage step: it replaces the dominant code path (LLM inference per request) with CPU-only inference, with no change to the routing contract or the downstream graph.