assistance-engine/scripts/pipelines/classifier/README.md

209 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Layer 2 Classifier — Training Pipeline
**Author:** Rafael Ruiz (CTO, 101OBEX Corp)
**Related:** ADR-0008 — Adaptive Query Routing
Part of **ADR-0008 Phase 2**: trains the embedding-based classifier that intercepts
queries before they reach the LLM (Layer 3), reducing per-request Ollama calls to
near zero for well-represented query types.
---
## Overview
The classifier embeds each query with **bge-m3** (already running in the stack),
trains a **LogisticRegression** on the resulting vectors, and serializes the model
with joblib. At engine startup, `graph.py` loads the model and uses it as Layer 2
in the classification pipeline.
```mermaid
flowchart TD
Q([Query]) --> L1
L1["Layer 1 — Hard rules\nRC-01 · RC-02\nO(1), deterministic"]
L1 -->|match| R([Classification result])
L1 -->|no match| L2
L2["Layer 2 — Embedding classifier\nbge-m3 + LogisticRegression\n~1ms · CPU only · no LLM"]
L2 -->|confidence ≥ 0.85| R
L2 -->|confidence < 0.85| L3
L3["Layer 3 — LLM classifier\nOllama fallback\n~300800ms"]
L3 --> R
```
If the model file does not exist, the engine starts normally and uses L3 only.
---
## Files
| File | Purpose |
|---|---|
| `train_classifier.py` | Training script |
| `seed_classifier_dataset.jsonl` | Labeled dataset (seed, 100 examples) |
| `requirements.txt` | Python dependencies for the training venv |
---
## Setup
```bash
cd scripts/pipelines/classifier
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
---
## Running the training
```bash
python train_classifier.py
```
Default behaviour (no flags needed if running from this directory):
- **data**: `seed_classifier_dataset.jsonl` in the same folder
- **ollama**: `http://localhost:11434` (or `$OLLAMA_LOCAL_URL`)
- **output**: `/data/classifier_model.pkl` (or `$CLASSIFIER_MODEL_PATH`)
- **min CV accuracy**: 0.90
The script exits with code 1 if CV accuracy is below the threshold — the model is
**not saved** in that case.
### All arguments
| Argument | Default | Description |
|---|---|---|
| `--data` | `seed_classifier_dataset.jsonl` | Path to labeled JSONL dataset |
| `--output` | `/data/classifier_model.pkl` | Output path for serialized model |
| `--ollama` | `http://localhost:11434` | Ollama base URL |
| `--min-cv-accuracy` | `0.90` | Minimum CV accuracy to save the model |
### Custom output path (recommended when running from host)
```bash
python train_classifier.py \
--output ../../../Docker/data/classifier_model.pkl
```
---
## Deploying the model to Docker
```mermaid
sequenceDiagram
participant Host
participant Docker
Host->>Host: python train_classifier.py --output ./Docker/data/classifier_model.pkl
Host->>Docker: docker cp classifier_model.pkl brunix-assistance-engine:/data/
Host->>Docker: docker restart brunix-assistance-engine
Docker-->>Host: [classifier/L2] model loaded from /data/classifier_model.pkl
```
On startup you will see in the logs:
```
[classifier/L2] model loaded from /data/classifier_model.pkl — {'classes': [...], 'n_train': 100, 'cv_mean': 0.96, ...}
```
If the model is missing:
```
[classifier/L2] model not found at /data/classifier_model.pkl — using LLM fallback only
```
---
## Dataset format
Every line is a JSON object with two required fields:
```json
{"query": "What is addVar in AVAP?", "type": "RETRIEVAL"}
```
Valid types: `RETRIEVAL`, `CODE_GENERATION`, `CONVERSATIONAL`, `PLATFORM`.
Lines with missing `query` or `type`, or invalid JSON, are skipped with a warning.
### Merging production exports with the seed
```mermaid
flowchart LR
S[seed_classifier_dataset.jsonl] --> M
E["/data/classifier_labels/\nclassifier_labels_*.jsonl\n(production exports)"] --> M
M([merge]) --> D[merged_dataset.jsonl]
D --> T[train_classifier.py]
T --> P[classifier_model.pkl]
P --> E2[engine restart]
```
```bash
cat seed_classifier_dataset.jsonl \
/data/classifier_labels/classifier_labels_*.jsonl \
> merged_dataset.jsonl
python train_classifier.py --data merged_dataset.jsonl
```
The engine exports labeled data automatically to `CLASSIFIER_EXPORT_DIR`
(default `/data/classifier_labels/`) once `CLASSIFIER_EXPORT_THRESHOLD` sessions
accumulate. Those files use the same format and can be merged directly.
---
## Expected output
```
[1/4] Loading data from seed_classifier_dataset.jsonl
100 examples loaded
Distribution: {'RETRIEVAL': 25, 'CODE_GENERATION': 25, 'CONVERSATIONAL': 25, 'PLATFORM': 25}
[2/4] Embedding with bge-m3 via http://localhost:11434
Embedding batch 1/4 (32 queries)...
Embedding batch 2/4 (32 queries)...
Embedding batch 3/4 (32 queries)...
Embedding batch 4/4 (4 queries)...
Embedding matrix: (100, 1024)
[3/4] Training LogisticRegression (C=1.0) with 5-fold CV
CV accuracy: 0.970 ± 0.021 (folds: [0.95, 1.0, 0.95, 0.95, 1.0])
Per-class report:
precision recall f1-score support
CODE_GENERATION 1.00 0.96 0.98 25
CONVERSATIONAL 0.96 1.00 0.98 25
PLATFORM 1.00 1.00 1.00 25
RETRIEVAL 0.96 0.96 0.96 25
[4/4] Saving model to ../../../Docker/data/classifier_model.pkl
Model saved → ../../../Docker/data/classifier_model.pkl
Done. Classes: ['CODE_GENERATION', 'CONVERSATIONAL', 'PLATFORM', 'RETRIEVAL']
```
---
## Troubleshooting
**`CV accuracy below threshold`** — Add more examples to the underperforming class
(check the per-class recall column). 510 extra examples per class usually suffice.
**`langchain-ollama not installed`** — Run `pip install -r requirements.txt` inside
the venv.
**Ollama connection error** — Verify Ollama is running and bge-m3 is pulled:
```bash
curl http://localhost:11434/api/tags | grep bge-m3
# if missing:
ollama pull bge-m3
```
**Container not picking up the model** — The engine loads the model once at startup.
A `docker restart` is required after every `docker cp`.