From 273049b7051e2777404450541d2a39bfca940a9d Mon Sep 17 00:00:00 2001
From: rafa-ruiz <rafa.ruiz@101obex.com>
Date: Thu, 9 Apr 2026 19:59:34 -0700
Subject: [PATCH] =?UTF-8?q?[DOC]=20ADR-0008:=20add=20Future=20Path=20?=
 =?UTF-8?q?=E2=80=94=20discriminative=20classifier=20pipeline?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 ...8-adaptive-query-routing-intent-history.md | 76 ++++++++++++++++++-
 1 file changed, 75 insertions(+), 1 deletion(-)

diff --git a/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md b/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md
index 31225a8..8541408 100644
--- a/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md
+++ b/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md
@@ -171,5 +171,79 @@ Changing which types map to which model slot requires updating this contract.
 
 ### Open questions
 
-- Whether the classifier should be upgraded to a more capable model in the future (at the cost of latency/resources)
 - Whether `PLATFORM` should eventually split into sub-types (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) as the platform data schema grows
+
+---
+
+## Future Path: Discriminative Classifier Pipeline
+
+### The fundamental problem with the current design
+
+The LLM classifier is a **generative model doing discriminative work**. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering as a substitute for proper model design. The rules in RC-01–RC-06 exist precisely to compensate for this architectural mismatch.
+
+The current design is correct as a **bootstrap mechanism** — it lets the system operate before labeled training data exists. But it should not be the steady-state architecture.
+
+### Target architecture
+
+A layered pipeline where each layer is only invoked if the previous layer cannot produce a confident answer:
+
+```
+Query
+  │
+  ▼
+[Layer 1] Hard rules (RC-01, RC-02)          ← O(1), deterministic
+  │ no match
+  ▼
+[Layer 2] Embedding similarity classifier     ← ~1ms, CPU, no LLM
+  │ confidence < threshold
+  ▼
+[Layer 3] LLM classifier (current design)     ← fallback only
+  │
+  ▼
+Classification result
+```
+
+In steady state, Layer 3 should handle fewer than 5% of requests — only genuinely ambiguous queries that neither rules nor the trained classifier can resolve with confidence.
+
+### Layer 2: Embedding similarity classifier
+
+`bge-m3` is already running in the stack. The right move is to reuse it as the backbone of a lightweight discriminative classifier rather than adding a second LLM.
+
+**Implementation:**
+
+1. Embed each query using `bge-m3` → fixed-size vector representation
+2. Train a logistic regression (or SVM with RBF kernel) over those embeddings on a labeled dataset of `(query, type)` pairs
+3. At inference: embed query → dot product against class centroids → argmax with confidence score
+4. If `max(softmax(logits)) < threshold` (e.g. 0.85), fall through to Layer 3
+
+This is microseconds of CPU inference, not LLM inference. No GPU, no Ollama call, no prompt templating.
+
+### The data flywheel: `classify_history_store` as training set
+
+Every session already generates labeled examples. `classify_history_store` stores `(topic_snippet, type)` pairs that are implicitly validated by the system — if the user continued the conversation without correcting the assistant, the classification was likely correct.
+
+```
+classify_history_store → periodic export → labeled dataset → retrain classifier
+```
+
+The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation without the overhead of explicit distillation training — the teacher labels production traffic automatically.
+
+**Data collection trigger:** When `classify_history_store` accumulates N sessions (suggested: 500), export and retrain. The classifier improves continuously without human labeling.
+
+### Caller-declared type (for platform-injected prompts)
+
+The platform generates `PLATFORM` prompts — it always knows the type at generation time. Adding a `query_type` field to `AgentRequest` (proto field 7) allows the caller to declare the type explicitly. When set, all three classifier layers are bypassed entirely.
+
+This makes RC-01 and RC-02 redundant for platform-generated traffic and eliminates the only remaining case where a generative model is used to classify structured platform data.
+
+### Convergence path
+
+| Phase | What changes | Expected Layer 3 traffic |
+|---|---|---|
+| Now (bootstrap) | LLM classifier for all unmatched queries | ~95% |
+| Phase 1 | Collect labels via `classify_history_store` | ~95% |
+| Phase 2 | Deploy embedding classifier (Layer 2) | ~10–20% |
+| Phase 3 | Caller-declared type for platform prompts | <5% |
+| Phase 4 | LLM classifier becomes anomaly handler only | <2% |
+
+Phase 2 is the highest-leverage step: it replaces the most frequent code path (LLM inference per request) with a CPU-only operation, with no change to the routing contract or the downstream graph.