[DOC] ADR-0008: add Future Path — discriminative classifier pipeline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 19:59:34 -07:00 · 2026-04-09 19:59:34 -07:00 · 273049b705
parent ce2306c4e5
commit 273049b705
1 changed files with 75 additions and 1 deletions
--- a/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md
+++ b/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md
@ -171,5 +171,79 @@ Changing which types map to which model slot requires updating this contract.
 ### Open questions
 - Whether the classifier should be upgraded to a more capable model in the future (at the cost of latency/resources)
 - Whether `PLATFORM` should eventually split into sub-types (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) as the platform data schema grows
 ---
 ## Future Path: Discriminative Classifier Pipeline
 ### The fundamental problem with the current design
 The LLM classifier is a **generative model doing discriminative work**. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering as a substitute for proper model design. The rules in RC-01–RC-06 exist precisely to compensate for this architectural mismatch.
 The current design is correct as a **bootstrap mechanism** — it lets the system operate before labeled training data exists. But it should not be the steady-state architecture.
 ### Target architecture
 A layered pipeline where each layer is only invoked if the previous layer cannot produce a confident answer:
 ```
 Query
  │
  ▼
 [Layer 1] Hard rules (RC-01, RC-02)          ← O(1), deterministic
  │ no match
  ▼
 [Layer 2] Embedding similarity classifier     ← ~1ms, CPU, no LLM
  │ confidence < threshold
  ▼
 [Layer 3] LLM classifier (current design)     ← fallback only
  │
  ▼
 Classification result
 ```
 In steady state, Layer 3 should handle fewer than 5% of requests — only genuinely ambiguous queries that neither rules nor the trained classifier can resolve with confidence.
 ### Layer 2: Embedding similarity classifier
 `bge-m3` is already running in the stack. The right move is to reuse it as the backbone of a lightweight discriminative classifier rather than adding a second LLM.
 **Implementation:**
 1. Embed each query using `bge-m3` → fixed-size vector representation
 2. Train a logistic regression (or SVM with RBF kernel) over those embeddings on a labeled dataset of `(query, type)` pairs
 3. At inference: embed query → dot product against class centroids → argmax with confidence score
 4. If `max(softmax(logits)) < threshold` (e.g. 0.85), fall through to Layer 3
 This is microseconds of CPU inference, not LLM inference. No GPU, no Ollama call, no prompt templating.
 ### The data flywheel: `classify_history_store` as training set
 Every session already generates labeled examples. `classify_history_store` stores `(topic_snippet, type)` pairs that are implicitly validated by the system — if the user continued the conversation without correcting the assistant, the classification was likely correct.
 ```
 classify_history_store → periodic export → labeled dataset → retrain classifier
 ```
 The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation without the overhead of explicit distillation training — the teacher labels production traffic automatically.
 **Data collection trigger:** When `classify_history_store` accumulates N sessions (suggested: 500), export and retrain. The classifier improves continuously without human labeling.
 ### Caller-declared type (for platform-injected prompts)
 The platform generates `PLATFORM` prompts — it always knows the type at generation time. Adding a `query_type` field to `AgentRequest` (proto field 7) allows the caller to declare the type explicitly. When set, all three classifier layers are bypassed entirely.
 This makes RC-01 and RC-02 redundant for platform-generated traffic and eliminates the only remaining case where a generative model is used to classify structured platform data.
 ### Convergence path
 | Phase | What changes | Expected Layer 3 traffic |
 |---|---|---|
 | Now (bootstrap) | LLM classifier for all unmatched queries | ~95% |
 | Phase 1 | Collect labels via `classify_history_store` | ~95% |
 | Phase 2 | Deploy embedding classifier (Layer 2) | ~10–20% |
 | Phase 3 | Caller-declared type for platform prompts | <5% |
 | Phase 4 | LLM classifier becomes anomaly handler only | <2% |
 Phase 2 is the highest-leverage step: it replaces the most frequent code path (LLM inference per request) with a CPU-only operation, with no change to the routing contract or the downstream graph.