[DOC] ADR-0008: refactor — separate permanent decisions from tactical debt

- Retitle to reflect actual scope: taxonomy, contract, classifier strategy
- Split Decision section into permanent (taxonomy, model assignment) vs
  tactical [BOOTSTRAP] (LLM classifier)
- Mark LLM classifier explicitly as interim implementation with pointer
  to Future Path
- Clarify that Routing Contract is implementation-independent
- Consolidate prompt engineering rules as symptoms of architectural mismatch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
rafa-ruiz 2026-04-09 20:02:10 -07:00
parent 273049b705
commit c955f69ad8
1 changed files with 83 additions and 95 deletions

View File

@ -1,4 +1,4 @@
# ADR-0008: Adaptive Query Routing with Intent History and Model Specialization # ADR-0008: Adaptive Query Routing — Taxonomy, Contract, and Classifier Strategy
**Date:** 2026-04-09 **Date:** 2026-04-09
**Status:** Accepted **Status:** Accepted
@ -9,49 +9,60 @@
## Context ## Context
The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types and a single LLM-based classifier that received raw conversation history. Two problems emerged in production: The assistance engine previously used a single Ollama model (`qwen3:1.7b`) for all query types with no differentiation in routing, retrieval, or model selection. Two problems emerged in production:
### Problem 1 — Model oversizing for lightweight queries ### Problem 1 — No query taxonomy
Platform queries (account status, usage metrics, subscription data) and conversational follow-ups do not require retrieval or a large model. Running `qwen3:1.7b` for a one-sentence platform insight wastes resources and adds latency. All queries were treated identically. Platform queries (account status, usage metrics, billing) were sent through the same RAG pipeline as AVAP language questions, wasting retrieval resources and producing irrelevant context.
### Problem 2 — Classifier bias from raw message history ### Problem 2 — Classifier anchoring bias
When the classifier received raw conversation messages as history, a small model (1.7B parameters) exhibited **anchoring bias**: it would classify new messages as the same type as recent messages, regardless of the actual content of the new query. This caused platform queries (`"You have a project usage percentage of 20%, provide a recommendation"`) to be misclassified as `RETRIEVAL` or `CODE_GENERATION` during sessions that had previously handled AVAP language questions. The LLM-based classifier received raw conversation messages as history. A 1.7B model exhibited **anchoring bias**: it computed `P(type | history)` instead of `P(type | message_content)`, misclassifying new queries as the same type as recent turns regardless of actual content.
Root cause: passing full message content to a small classifier is too noisy. The model uses conversation topic as a proxy for intent type.
--- ---
## Decision ## Decision
### 1. New query type: `PLATFORM` This ADR makes three decisions with different time horizons:
A fourth classification category is introduced alongside `RETRIEVAL`, `CODE_GENERATION`, and `CONVERSATIONAL`: 1. **Permanent** — query taxonomy and routing contract
2. **Permanent** — model assignment per type
3. **Tactical / bootstrap** — LLM classifier as interim implementation
| Type | Purpose | RAG | Model | ### Decision 1 — Query taxonomy (permanent)
Four query types with fixed routing semantics:
| Type | Purpose | RAG | Model slot |
|---|---|---|---| |---|---|---|---|
| `RETRIEVAL` | AVAP language documentation | Yes | `OLLAMA_MODEL_NAME` | | `RETRIEVAL` | AVAP language documentation and concepts | Yes | `main` |
| `CODE_GENERATION` | Produce working AVAP code | Yes | `OLLAMA_MODEL_NAME` | | `CODE_GENERATION` | Produce working AVAP code | Yes | `main` |
| `CONVERSATIONAL` | Rephrase / continue prior answer | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | | `CONVERSATIONAL` | Rephrase or continue prior answer | No | `conversational` |
| `PLATFORM` | Account, metrics, usage, billing | No | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | | `PLATFORM` | Account, metrics, usage, quota, billing | No | `conversational` |
`PLATFORM` queries skip RAG entirely and are served with a dedicated `PLATFORM_PROMPT` that instructs the model to use `extra_context` (where user account data is injected) as primary source. These types and their RAG/model assignments are stable. Any future classifier implementation must preserve this taxonomy.
### 2. Model specialization via environment variables ### Decision 2 — Model specialization (permanent)
Two model slots are configured independently: Two model slots configured via environment variables:
``` ```
OLLAMA_MODEL_NAME=qwen3:1.7b # RETRIEVAL + CODE_GENERATION OLLAMA_MODEL_NAME=qwen3:1.7b # main slot: RETRIEVAL + CODE_GENERATION
OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # CONVERSATIONAL + PLATFORM OLLAMA_MODEL_NAME_CONVERSATIONAL=qwen3:0.6b # conversational slot: CONVERSATIONAL + PLATFORM
``` ```
If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is not set, both slots fall back to `OLLAMA_MODEL_NAME` (backward compatible). If `OLLAMA_MODEL_NAME_CONVERSATIONAL` is unset, both slots fall back to `OLLAMA_MODEL_NAME`.
### 3. Intent history instead of raw message history for classification ### Decision 3 — LLM classifier as bootstrap `[TACTICAL DEBT]`
The classifier no longer receives raw conversation messages. Instead, a compact **intent history** (`classify_history`) is maintained per session: > **This is an acknowledged interim implementation, not the target architecture.**
> See [Future Path](#future-path-discriminative-classifier-pipeline) for the correct steady-state design.
A generative LLM is used for classification because no labeled training data exists yet. The design includes two mitigations for its known weaknesses:
**a) Compact intent history instead of raw messages**
`classify_history` replaces raw message history in the classifier context. Each entry stores only `type` + 60-char topic snippet:
``` ```
[RETRIEVAL] "What is addVar in AVAP?" [RETRIEVAL] "What is addVar in AVAP?"
@ -59,89 +70,74 @@ The classifier no longer receives raw conversation messages. Instead, a compact
[PLATFORM] "You have a project usage percentag" [PLATFORM] "You have a project usage percentag"
``` ```
Each entry stores only the `type` and a 60-character topic snippet. This gives the classifier the conversational thread (useful for resolving ambiguous references like "this", "esto", "lo anterior") without the topical noise that causes anchoring bias. This preserves reference resolution (`"this"`, `"esto"`, `"lo anterior"`) without the topical noise that causes anchoring. `classify_history` is persisted in `classify_history_store` per session.
`classify_history` is persisted in `classify_history_store` (parallel to `session_store`) and passed in `AgentState` across turns. **b) Prompt constraints to counteract generative bias**
### 4. Classifier prompt redesign - `<history_rule>` — explicit instruction that intent distribution of prior turns must not influence prior probability of current classification
- `<platform_priority_rule>` — hard semantic override: usage percentages, account metrics, quota or billing data → always `PLATFORM`
The prompt now includes: These prompt rules are compensations for the architectural mismatch between a generative model and a discriminative task. They become unnecessary once the LLM classifier is replaced.
- **`<history_rule>`** — explicit instruction: use history only to resolve ambiguous references, not to predict the category of the new message
- **`<platform_priority_rule>`** — hard override: if the message contains usage percentages, account metrics, quota data, or billing information, classify as `PLATFORM` regardless of history
- **`<step1_purpose>`** replaced by inline role instruction that each message must be evaluated independently
### 5. Fast-path for known platform prefixes
Queries containing `"you are a direct and concise assistant"` (a system-injected prefix used by the platform) are classified as `PLATFORM` deterministically without invoking the LLM classifier. This is justified because this prefix is controlled by the platform itself, not by user input, so deterministic detection is both correct and cheaper.
--- ---
## Routing Contract ## Routing Contract
This section is normative. Any reimplementation of the classifier or the graph must satisfy all rules below. Rules are ordered by priority — a higher-priority rule always wins. This section is normative and **implementation-independent**. Any reimplementation — including the discriminative classifier described in Future Path — must satisfy all rules below. Rules are ordered by priority.
### RC-01 — Fast-path override (priority: highest) ### RC-01 — Fast-path override (priority: highest)
If the query contains a known platform-injected prefix, the system **MUST** classify it as `PLATFORM` without invoking any LLM. If the query contains a known platform-injected prefix, classify as `PLATFORM` without invoking any classifier.
``` ```
∀ q : query ∀ q : query
contains(q, known_platform_prefix) → route(q) = PLATFORM contains(q, known_platform_prefix) → route(q) = PLATFORM
``` ```
Current registered prefixes (see `_PLATFORM_PATTERNS` in `graph.py`): Current registered prefixes (`_PLATFORM_PATTERNS` in `graph.py`):
- `"you are a direct and concise assistant"` - `"you are a direct and concise assistant"`
Adding a new prefix requires a code change to `_PLATFORM_PATTERNS` and a corresponding update to this list. Adding a prefix requires updating `_PLATFORM_PATTERNS` and this list.
### RC-02 — Platform data signal (priority: high) ### RC-02 — Platform data signal (priority: high)
If the query contains any of the following signals, the classifier **MUST** output `PLATFORM` regardless of conversation history: If the query contains usage percentages, account metrics, consumption figures, quota data, or billing information, the output **MUST** be `PLATFORM` regardless of history or classifier confidence.
- Usage percentages (e.g. `"20%"` in the context of project/account usage) In the current bootstrap implementation this is enforced via `<platform_priority_rule>`. In the future discriminative classifier it should be a hard pre-filter in Layer 1.
- Account metrics or consumption figures
- Quota, limit, or billing data
This rule is enforced via `<platform_priority_rule>` in the classifier prompt. It cannot be overridden by history.
### RC-03 — Intent history scoping (priority: medium) ### RC-03 — Intent history scoping (priority: medium)
The classifier **MUST** use `classify_history` only to resolve ambiguous pronoun or deictic references (`"this"`, `"esto"`, `"lo anterior"`, `"that function"`). It **MUST NOT** use history to predict or bias the type of the current message. The classifier **MUST** use `classify_history` only to resolve ambiguous deictic references. It **MUST NOT** use history to predict or bias the type of the current message.
``` ```
classify(q, history) ≠ f(dominant_type(history)) classify(q, history) ≠ f(dominant_type(history))
classify(q, history) = f(intent(q), resolve_references(q, history)) classify(q, history) = f(intent(q), resolve_references(q, history))
``` ```
**Rationale:** Small LLMs implicitly compute `P(type | history)` instead of `P(type | message_content)`. The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 `RETRIEVAL` turns does not make the next message more likely to be `RETRIEVAL`. The `<history_rule>` in the classifier prompt enforces this explicitly. **Rationale:** Small LLMs implicitly compute `P(type | history)` instead of `P(type | message_content)`. The distribution of previous intents must not influence the prior probability of the current classification. Each message is an independent classification event — a session with 10 `RETRIEVAL` turns does not make the next message more likely to be `RETRIEVAL`.
### RC-04 — RAG bypass (priority: medium) ### RC-04 — RAG bypass (priority: medium)
Query types that bypass Elasticsearch retrieval:
| Type | RAG | Justification | | Type | RAG | Justification |
|---|---|---| |---|---|---|
| `RETRIEVAL` | Yes | Requires documentation context | | `RETRIEVAL` | Yes | Requires documentation context |
| `CODE_GENERATION` | Yes | Requires syntax examples | | `CODE_GENERATION` | Yes | Requires syntax examples |
| `CONVERSATIONAL` | No | Reformulates prior answer already in context | | `CONVERSATIONAL` | No | Prior answer already in context |
| `PLATFORM` | No | Data is injected via `extra_context`, not retrieved | | `PLATFORM` | No | Data injected via `extra_context` |
A `PLATFORM` or `CONVERSATIONAL` query that triggers a retrieval step is a contract violation. A `PLATFORM` or `CONVERSATIONAL` query that triggers Elasticsearch retrieval is a contract violation.
### RC-05 — Model assignment (priority: medium) ### RC-05 — Model assignment (priority: medium)
``` ```
route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME route(q) ∈ {RETRIEVAL, CODE_GENERATION} → model = OLLAMA_MODEL_NAME
route(q) ∈ {CONVERSATIONAL, PLATFORM} → model = OLLAMA_MODEL_NAME_CONVERSATIONAL route(q) ∈ {CONVERSATIONAL, PLATFORM} → model = OLLAMA_MODEL_NAME_CONVERSATIONAL
?? OLLAMA_MODEL_NAME # fallback if unset ?? OLLAMA_MODEL_NAME # fallback
``` ```
Changing which types map to which model slot requires updating this contract.
### RC-06 — History growth bound (priority: low) ### RC-06 — History growth bound (priority: low)
`classify_history` per session **MUST** be bounded. The classifier reads at most the last 6 entries. The store may grow unbounded in memory but the classifier input is always capped. `classify_history` input to the classifier **MUST** be capped at 6 entries per session.
### Contract violations to monitor ### Contract violations to monitor
@ -149,7 +145,7 @@ Changing which types map to which model slot requires updating this contract.
|---|---| |---|---|
| Platform query hits Elasticsearch | RC-04 | | Platform query hits Elasticsearch | RC-04 |
| `qwen3:1.7b` used for a `PLATFORM` response | RC-05 | | `qwen3:1.7b` used for a `PLATFORM` response | RC-05 |
| Platform prefix query triggers LLM classifier | RC-01 | | Platform prefix triggers LLM classifier | RC-01 |
| Classifier output mirrors dominant history type | RC-03 | | Classifier output mirrors dominant history type | RC-03 |
--- ---
@ -158,30 +154,26 @@ Changing which types map to which model slot requires updating this contract.
### Positive ### Positive
- Platform and conversational queries are served by a smaller, faster model - Query taxonomy is formalized and stable — downstream graph, model assignment, and RAG decisions are decoupled from classifier implementation
- Classifier bias from conversation history is eliminated while preserving the ability to resolve ambiguous references - `classify_history_store` acts as a data flywheel for future classifier training
- `PLATFORM` queries never hit Elasticsearch, reducing unnecessary retrieval load - Platform-injected prompts classified in O(1) via RC-01
- The system is more predictable: platform-injected prompts are classified in O(1) without an LLM call - `PLATFORM` queries never hit Elasticsearch
### Negative / Trade-offs ### Negative / Trade-offs
- `classify_history` adds a small amount of state per session (bounded to last 6 entries) - The LLM classifier is a generative model doing discriminative work — this is the accepted tactical debt
- Two model slots mean two warm-up calls at startup if models differ - Prompt engineering (`<history_rule>`, `<platform_priority_rule>`) is a symptom of this mismatch, not a solution
- The `qwen3:1.7b` classifier can still misclassify edge cases where no platform signals are present in the text — this is inherent to using a 1.7B model for semantic classification - `qwen3:1.7b` can still misclassify edge cases without platform signals — inherent to the bootstrap design
### Open questions
- Whether `PLATFORM` should eventually split into sub-types (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) as the platform data schema grows
--- ---
## Future Path: Discriminative Classifier Pipeline ## Future Path: Discriminative Classifier Pipeline
### The fundamental problem with the current design ### The fundamental problem with the bootstrap design
The LLM classifier is a **generative model doing discriminative work**. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering as a substitute for proper model design. The rules in RC-01RC-06 exist precisely to compensate for this architectural mismatch. The LLM classifier is a generative model doing discriminative work. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering to compensate for what should be model properties. RC-01 through RC-06 exist precisely because of this mismatch.
The current design is correct as a **bootstrap mechanism** — it lets the system operate before labeled training data exists. But it should not be the steady-state architecture. The bootstrap design is justified while no labeled data exists. It should not be the steady-state architecture.
### Target architecture ### Target architecture
@ -197,53 +189,49 @@ Query
[Layer 2] Embedding similarity classifier ← ~1ms, CPU, no LLM [Layer 2] Embedding similarity classifier ← ~1ms, CPU, no LLM
│ confidence < threshold │ confidence < threshold
[Layer 3] LLM classifier (current design) ← fallback only [Layer 3] LLM classifier (current design) ← fallback for ambiguous queries only
Classification result Classification result
``` ```
In steady state, Layer 3 should handle fewer than 5% of requests — only genuinely ambiguous queries that neither rules nor the trained classifier can resolve with confidence. In steady state, Layer 3 handles fewer than 5% of requests.
### Layer 2: Embedding similarity classifier ### Layer 2: embedding classifier on `bge-m3`
`bge-m3` is already running in the stack. The right move is to reuse it as the backbone of a lightweight discriminative classifier rather than adding a second LLM. `bge-m3` is already running in the stack. The implementation:
**Implementation:** 1. Embed each query via `bge-m3` → fixed-size vector
2. Train logistic regression (or SVM with RBF kernel) on labeled `(query, type)` pairs
3. At inference: embed → class centroids → argmax with confidence score
4. If `max(softmax(logits)) < 0.85` → fall through to Layer 3
1. Embed each query using `bge-m3` → fixed-size vector representation This is microseconds of CPU inference. No GPU, no Ollama call, no prompt templating. RC-02 becomes a hard pre-filter in Layer 1, making it implementation-independent rather than prompt-dependent.
2. Train a logistic regression (or SVM with RBF kernel) over those embeddings on a labeled dataset of `(query, type)` pairs
3. At inference: embed query → dot product against class centroids → argmax with confidence score
4. If `max(softmax(logits)) < threshold` (e.g. 0.85), fall through to Layer 3
This is microseconds of CPU inference, not LLM inference. No GPU, no Ollama call, no prompt templating. ### The data flywheel
### The data flywheel: `classify_history_store` as training set `classify_history_store` already generates labeled training data. Every session produces `(topic_snippet, type)` pairs implicitly validated by user continuation.
Every session already generates labeled examples. `classify_history_store` stores `(topic_snippet, type)` pairs that are implicitly validated by the system — if the user continued the conversation without correcting the assistant, the classification was likely correct.
``` ```
classify_history_store → periodic export → labeled dataset → retrain classifier classify_history_store → periodic export → labeled dataset → retrain Layer 2
``` ```
The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation without the overhead of explicit distillation training — the teacher labels production traffic automatically. The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation over production traffic without manual labeling.
**Data collection trigger:** When `classify_history_store` accumulates N sessions (suggested: 500), export and retrain. The classifier improves continuously without human labeling. **Trigger:** retrain when `classify_history_store` accumulates 500 sessions.
### Caller-declared type (for platform-injected prompts) ### Caller-declared type
The platform generates `PLATFORM` prompts — it always knows the type at generation time. Adding a `query_type` field to `AgentRequest` (proto field 7) allows the caller to declare the type explicitly. When set, all three classifier layers are bypassed entirely. The platform generates `PLATFORM` prompts and knows the type at generation time. Adding `query_type` to `AgentRequest` (proto field 7) lets the caller declare the type explicitly, bypassing all three layers. This makes RC-01 and RC-02 redundant for platform-generated traffic.
This makes RC-01 and RC-02 redundant for platform-generated traffic and eliminates the only remaining case where a generative model is used to classify structured platform data.
### Convergence path ### Convergence path
| Phase | What changes | Expected Layer 3 traffic | | Phase | What changes | Layer 3 traffic |
|---|---|---| |---|---|---|
| Now (bootstrap) | LLM classifier for all unmatched queries | ~95% | | Now — bootstrap | LLM classifier for all unmatched queries | ~95% |
| Phase 1 | Collect labels via `classify_history_store` | ~95% | | Phase 1 | Collect labels via `classify_history_store` | ~95% |
| Phase 2 | Deploy embedding classifier (Layer 2) | ~1020% | | Phase 2 | Deploy embedding classifier (Layer 2) | ~1020% |
| Phase 3 | Caller-declared type for platform prompts | <5% | | Phase 3 | Caller-declared type for platform prompts | <5% |
| Phase 4 | LLM classifier becomes anomaly handler only | <2% | | Phase 4 | LLM classifier as anomaly handler only | <2% |
Phase 2 is the highest-leverage step: it replaces the most frequent code path (LLM inference per request) with a CPU-only operation, with no change to the routing contract or the downstream graph. Phase 2 is the highest-leverage step: it replaces the dominant code path (LLM inference per request) with CPU-only inference, with no change to the routing contract or the downstream graph.