209 lines
5.9 KiB
Markdown
209 lines
5.9 KiB
Markdown
# Layer 2 Classifier — Training Pipeline
|
||
|
||
**Author:** Rafael Ruiz (CTO, 101OBEX Corp)
|
||
**Related:** ADR-0008 — Adaptive Query Routing
|
||
|
||
Part of **ADR-0008 Phase 2**: trains the embedding-based classifier that intercepts
|
||
queries before they reach the LLM (Layer 3), reducing per-request Ollama calls to
|
||
near zero for well-represented query types.
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
The classifier embeds each query with **bge-m3** (already running in the stack),
|
||
trains a **LogisticRegression** on the resulting vectors, and serializes the model
|
||
with joblib. At engine startup, `graph.py` loads the model and uses it as Layer 2
|
||
in the classification pipeline.
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Q([Query]) --> L1
|
||
|
||
L1["Layer 1 — Hard rules\nRC-01 · RC-02\nO(1), deterministic"]
|
||
L1 -->|match| R([Classification result])
|
||
L1 -->|no match| L2
|
||
|
||
L2["Layer 2 — Embedding classifier\nbge-m3 + LogisticRegression\n~1ms · CPU only · no LLM"]
|
||
L2 -->|confidence ≥ 0.85| R
|
||
L2 -->|confidence < 0.85| L3
|
||
|
||
L3["Layer 3 — LLM classifier\nOllama fallback\n~300–800ms"]
|
||
L3 --> R
|
||
```
|
||
|
||
If the model file does not exist, the engine starts normally and uses L3 only.
|
||
|
||
---
|
||
|
||
## Files
|
||
|
||
| File | Purpose |
|
||
|---|---|
|
||
| `train_classifier.py` | Training script |
|
||
| `seed_classifier_dataset.jsonl` | Labeled dataset (seed, 100 examples) |
|
||
| `requirements.txt` | Python dependencies for the training venv |
|
||
|
||
---
|
||
|
||
## Setup
|
||
|
||
```bash
|
||
cd scripts/pipelines/classifier
|
||
|
||
python3 -m venv .venv
|
||
source .venv/bin/activate
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
---
|
||
|
||
## Running the training
|
||
|
||
```bash
|
||
python train_classifier.py
|
||
```
|
||
|
||
Default behaviour (no flags needed if running from this directory):
|
||
- **data**: `seed_classifier_dataset.jsonl` in the same folder
|
||
- **ollama**: `http://localhost:11434` (or `$OLLAMA_LOCAL_URL`)
|
||
- **output**: `/data/classifier_model.pkl` (or `$CLASSIFIER_MODEL_PATH`)
|
||
- **min CV accuracy**: 0.90
|
||
|
||
The script exits with code 1 if CV accuracy is below the threshold — the model is
|
||
**not saved** in that case.
|
||
|
||
### All arguments
|
||
|
||
| Argument | Default | Description |
|
||
|---|---|---|
|
||
| `--data` | `seed_classifier_dataset.jsonl` | Path to labeled JSONL dataset |
|
||
| `--output` | `/data/classifier_model.pkl` | Output path for serialized model |
|
||
| `--ollama` | `http://localhost:11434` | Ollama base URL |
|
||
| `--min-cv-accuracy` | `0.90` | Minimum CV accuracy to save the model |
|
||
|
||
### Custom output path (recommended when running from host)
|
||
|
||
```bash
|
||
python train_classifier.py \
|
||
--output ../../../Docker/data/classifier_model.pkl
|
||
```
|
||
|
||
---
|
||
|
||
## Deploying the model to Docker
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant Host
|
||
participant Docker
|
||
|
||
Host->>Host: python train_classifier.py --output ./Docker/data/classifier_model.pkl
|
||
Host->>Docker: docker cp classifier_model.pkl brunix-assistance-engine:/data/
|
||
Host->>Docker: docker restart brunix-assistance-engine
|
||
Docker-->>Host: [classifier/L2] model loaded from /data/classifier_model.pkl
|
||
```
|
||
|
||
On startup you will see in the logs:
|
||
|
||
```
|
||
[classifier/L2] model loaded from /data/classifier_model.pkl — {'classes': [...], 'n_train': 100, 'cv_mean': 0.96, ...}
|
||
```
|
||
|
||
If the model is missing:
|
||
|
||
```
|
||
[classifier/L2] model not found at /data/classifier_model.pkl — using LLM fallback only
|
||
```
|
||
|
||
---
|
||
|
||
## Dataset format
|
||
|
||
Every line is a JSON object with two required fields:
|
||
|
||
```json
|
||
{"query": "What is addVar in AVAP?", "type": "RETRIEVAL"}
|
||
```
|
||
|
||
Valid types: `RETRIEVAL`, `CODE_GENERATION`, `CONVERSATIONAL`, `PLATFORM`.
|
||
|
||
Lines with missing `query` or `type`, or invalid JSON, are skipped with a warning.
|
||
|
||
### Merging production exports with the seed
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
S[seed_classifier_dataset.jsonl] --> M
|
||
E["/data/classifier_labels/\nclassifier_labels_*.jsonl\n(production exports)"] --> M
|
||
M([merge]) --> D[merged_dataset.jsonl]
|
||
D --> T[train_classifier.py]
|
||
T --> P[classifier_model.pkl]
|
||
P --> E2[engine restart]
|
||
```
|
||
|
||
```bash
|
||
cat seed_classifier_dataset.jsonl \
|
||
/data/classifier_labels/classifier_labels_*.jsonl \
|
||
> merged_dataset.jsonl
|
||
|
||
python train_classifier.py --data merged_dataset.jsonl
|
||
```
|
||
|
||
The engine exports labeled data automatically to `CLASSIFIER_EXPORT_DIR`
|
||
(default `/data/classifier_labels/`) once `CLASSIFIER_EXPORT_THRESHOLD` sessions
|
||
accumulate. Those files use the same format and can be merged directly.
|
||
|
||
---
|
||
|
||
## Expected output
|
||
|
||
```
|
||
[1/4] Loading data from seed_classifier_dataset.jsonl
|
||
100 examples loaded
|
||
Distribution: {'RETRIEVAL': 25, 'CODE_GENERATION': 25, 'CONVERSATIONAL': 25, 'PLATFORM': 25}
|
||
|
||
[2/4] Embedding with bge-m3 via http://localhost:11434
|
||
Embedding batch 1/4 (32 queries)...
|
||
Embedding batch 2/4 (32 queries)...
|
||
Embedding batch 3/4 (32 queries)...
|
||
Embedding batch 4/4 (4 queries)...
|
||
Embedding matrix: (100, 1024)
|
||
|
||
[3/4] Training LogisticRegression (C=1.0) with 5-fold CV
|
||
CV accuracy: 0.970 ± 0.021 (folds: [0.95, 1.0, 0.95, 0.95, 1.0])
|
||
|
||
Per-class report:
|
||
precision recall f1-score support
|
||
CODE_GENERATION 1.00 0.96 0.98 25
|
||
CONVERSATIONAL 0.96 1.00 0.98 25
|
||
PLATFORM 1.00 1.00 1.00 25
|
||
RETRIEVAL 0.96 0.96 0.96 25
|
||
|
||
[4/4] Saving model to ../../../Docker/data/classifier_model.pkl
|
||
Model saved → ../../../Docker/data/classifier_model.pkl
|
||
|
||
Done. Classes: ['CODE_GENERATION', 'CONVERSATIONAL', 'PLATFORM', 'RETRIEVAL']
|
||
```
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
**`CV accuracy below threshold`** — Add more examples to the underperforming class
|
||
(check the per-class recall column). 5–10 extra examples per class usually suffice.
|
||
|
||
**`langchain-ollama not installed`** — Run `pip install -r requirements.txt` inside
|
||
the venv.
|
||
|
||
**Ollama connection error** — Verify Ollama is running and bge-m3 is pulled:
|
||
|
||
```bash
|
||
curl http://localhost:11434/api/tags | grep bge-m3
|
||
# if missing:
|
||
ollama pull bge-m3
|
||
```
|
||
|
||
**Container not picking up the model** — The engine loads the model once at startup.
|
||
A `docker restart` is required after every `docker cp`.
|