From 273049b7051e2777404450541d2a39bfca940a9d Mon Sep 17 00:00:00 2001 From: rafa-ruiz Date: Thu, 9 Apr 2026 19:59:34 -0700 Subject: [PATCH] =?UTF-8?q?[DOC]=20ADR-0008:=20add=20Future=20Path=20?= =?UTF-8?q?=E2=80=94=20discriminative=20classifier=20pipeline?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- ...8-adaptive-query-routing-intent-history.md | 76 ++++++++++++++++++- 1 file changed, 75 insertions(+), 1 deletion(-) diff --git a/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md b/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md index 31225a8..8541408 100644 --- a/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md +++ b/docs/ADR/ADR-0008-adaptive-query-routing-intent-history.md @@ -171,5 +171,79 @@ Changing which types map to which model slot requires updating this contract. ### Open questions -- Whether the classifier should be upgraded to a more capable model in the future (at the cost of latency/resources) - Whether `PLATFORM` should eventually split into sub-types (e.g. `PLATFORM_METRICS` vs `PLATFORM_BILLING`) as the platform data schema grows + +--- + +## Future Path: Discriminative Classifier Pipeline + +### The fundamental problem with the current design + +The LLM classifier is a **generative model doing discriminative work**. Generating tokens to produce a 4-class label wastes orders of magnitude more compute than the task requires, introduces non-determinism, and forces prompt engineering as a substitute for proper model design. The rules in RC-01–RC-06 exist precisely to compensate for this architectural mismatch. + +The current design is correct as a **bootstrap mechanism** — it lets the system operate before labeled training data exists. But it should not be the steady-state architecture. + +### Target architecture + +A layered pipeline where each layer is only invoked if the previous layer cannot produce a confident answer: + +``` +Query + │ + ▼ +[Layer 1] Hard rules (RC-01, RC-02) ← O(1), deterministic + │ no match + ▼ +[Layer 2] Embedding similarity classifier ← ~1ms, CPU, no LLM + │ confidence < threshold + ▼ +[Layer 3] LLM classifier (current design) ← fallback only + │ + ▼ +Classification result +``` + +In steady state, Layer 3 should handle fewer than 5% of requests — only genuinely ambiguous queries that neither rules nor the trained classifier can resolve with confidence. + +### Layer 2: Embedding similarity classifier + +`bge-m3` is already running in the stack. The right move is to reuse it as the backbone of a lightweight discriminative classifier rather than adding a second LLM. + +**Implementation:** + +1. Embed each query using `bge-m3` → fixed-size vector representation +2. Train a logistic regression (or SVM with RBF kernel) over those embeddings on a labeled dataset of `(query, type)` pairs +3. At inference: embed query → dot product against class centroids → argmax with confidence score +4. If `max(softmax(logits)) < threshold` (e.g. 0.85), fall through to Layer 3 + +This is microseconds of CPU inference, not LLM inference. No GPU, no Ollama call, no prompt templating. + +### The data flywheel: `classify_history_store` as training set + +Every session already generates labeled examples. `classify_history_store` stores `(topic_snippet, type)` pairs that are implicitly validated by the system — if the user continued the conversation without correcting the assistant, the classification was likely correct. + +``` +classify_history_store → periodic export → labeled dataset → retrain classifier +``` + +The LLM classifier is the **teacher**. The embedding classifier is the **student**. This is knowledge distillation without the overhead of explicit distillation training — the teacher labels production traffic automatically. + +**Data collection trigger:** When `classify_history_store` accumulates N sessions (suggested: 500), export and retrain. The classifier improves continuously without human labeling. + +### Caller-declared type (for platform-injected prompts) + +The platform generates `PLATFORM` prompts — it always knows the type at generation time. Adding a `query_type` field to `AgentRequest` (proto field 7) allows the caller to declare the type explicitly. When set, all three classifier layers are bypassed entirely. + +This makes RC-01 and RC-02 redundant for platform-generated traffic and eliminates the only remaining case where a generative model is used to classify structured platform data. + +### Convergence path + +| Phase | What changes | Expected Layer 3 traffic | +|---|---|---| +| Now (bootstrap) | LLM classifier for all unmatched queries | ~95% | +| Phase 1 | Collect labels via `classify_history_store` | ~95% | +| Phase 2 | Deploy embedding classifier (Layer 2) | ~10–20% | +| Phase 3 | Caller-declared type for platform prompts | <5% | +| Phase 4 | LLM classifier becomes anomaly handler only | <2% | + +Phase 2 is the highest-leverage step: it replaces the most frequent code path (LLM inference per request) with a CPU-only operation, with no change to the routing contract or the downstream graph.