assistance-engine/scripts/pipelines/classifier/README.md

# Layer 2 Classifier — Training Pipeline

**Author:** Rafael Ruiz (CTO, 101OBEX Corp)
**Related:** ADR-0008 — Adaptive Query Routing

Part of **ADR-0008 Phase 2**: trains the embedding-based classifier that intercepts
queries before they reach the LLM (Layer 3), reducing per-request Ollama calls to
near zero for well-represented query types.

---

## Overview

The classifier embeds each query with **bge-m3** (already running in the stack),
trains a **LogisticRegression** on the resulting vectors, and serializes the model
with joblib. At engine startup, `graph.py` loads the model and uses it as Layer 2
in the classification pipeline.

```mermaid
flowchart TD
    Q([Query]) --> L1

    L1["Layer 1 — Hard rules\nRC-01 · RC-02\nO(1), deterministic"]
    L1 -->|match| R([Classification result])
    L1 -->|no match| L2

    L2["Layer 2 — Embedding classifier\nbge-m3 + LogisticRegression\n~1ms · CPU only · no LLM"]
    L2 -->|confidence ≥ 0.85| R
    L2 -->|confidence < 0.85| L3

    L3["Layer 3 — LLM classifier\nOllama fallback\n~300–800ms"]
    L3 --> R
```

If the model file does not exist, the engine starts normally and uses L3 only.

---

## Files

| File | Purpose |
|---|---|
| `train_classifier.py` | Training script |
| `seed_classifier_dataset.jsonl` | Labeled dataset (seed, 100 examples) |
| `requirements.txt` | Python dependencies for the training venv |

---

## Setup

```bash
cd scripts/pipelines/classifier

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

---

## Running the training

```bash
python train_classifier.py
```

Default behaviour (no flags needed if running from this directory):
- **data**: `seed_classifier_dataset.jsonl` in the same folder
- **ollama**: `http://localhost:11434` (or `$OLLAMA_LOCAL_URL`)
- **output**: `/data/classifier_model.pkl` (or `$CLASSIFIER_MODEL_PATH`)
- **min CV accuracy**: 0.90

The script exits with code 1 if CV accuracy is below the threshold — the model is
**not saved** in that case.

### All arguments

| Argument | Default | Description |
|---|---|---|
| `--data` | `seed_classifier_dataset.jsonl` | Path to labeled JSONL dataset |
| `--output` | `/data/classifier_model.pkl` | Output path for serialized model |
| `--ollama` | `http://localhost:11434` | Ollama base URL |
| `--min-cv-accuracy` | `0.90` | Minimum CV accuracy to save the model |

### Custom output path (recommended when running from host)

```bash
python train_classifier.py \
  --output ../../../Docker/data/classifier_model.pkl
```

---

## Deploying the model to Docker

```mermaid
sequenceDiagram
    participant Host
    participant Docker

    Host->>Host: python train_classifier.py --output ./Docker/data/classifier_model.pkl
    Host->>Docker: docker cp classifier_model.pkl brunix-assistance-engine:/data/
    Host->>Docker: docker restart brunix-assistance-engine
    Docker-->>Host: [classifier/L2] model loaded from /data/classifier_model.pkl
```

On startup you will see in the logs:

```
[classifier/L2] model loaded from /data/classifier_model.pkl — {'classes': [...], 'n_train': 100, 'cv_mean': 0.96, ...}
```

If the model is missing:

```
[classifier/L2] model not found at /data/classifier_model.pkl — using LLM fallback only
```

---

## Dataset format

Every line is a JSON object with two required fields:

```json
{"query": "What is addVar in AVAP?", "type": "RETRIEVAL"}
```

Valid types: `RETRIEVAL`, `CODE_GENERATION`, `CONVERSATIONAL`, `PLATFORM`.

Lines with missing `query` or `type`, or invalid JSON, are skipped with a warning.

### Merging production exports with the seed

```mermaid
flowchart LR
    S[seed_classifier_dataset.jsonl] --> M
    E["/data/classifier_labels/\nclassifier_labels_*.jsonl\n(production exports)"] --> M
    M([merge]) --> D[merged_dataset.jsonl]
    D --> T[train_classifier.py]
    T --> P[classifier_model.pkl]
    P --> E2[engine restart]
```

```bash
cat seed_classifier_dataset.jsonl \
    /data/classifier_labels/classifier_labels_*.jsonl \
    > merged_dataset.jsonl

python train_classifier.py --data merged_dataset.jsonl
```

The engine exports labeled data automatically to `CLASSIFIER_EXPORT_DIR`
(default `/data/classifier_labels/`) once `CLASSIFIER_EXPORT_THRESHOLD` sessions
accumulate. Those files use the same format and can be merged directly.

---

## Expected output

```
[1/4] Loading data from seed_classifier_dataset.jsonl
  100 examples loaded
  Distribution: {'RETRIEVAL': 25, 'CODE_GENERATION': 25, 'CONVERSATIONAL': 25, 'PLATFORM': 25}

[2/4] Embedding with bge-m3 via http://localhost:11434
  Embedding batch 1/4 (32 queries)...
  Embedding batch 2/4 (32 queries)...
  Embedding batch 3/4 (32 queries)...
  Embedding batch 4/4 (4 queries)...
  Embedding matrix: (100, 1024)

[3/4] Training LogisticRegression (C=1.0) with 5-fold CV
  CV accuracy: 0.970 ± 0.021  (folds: [0.95, 1.0, 0.95, 0.95, 1.0])

  Per-class report:
                   precision  recall  f1-score  support
    CODE_GENERATION     1.00    0.96      0.98       25
    CONVERSATIONAL      0.96    1.00      0.98       25
    PLATFORM            1.00    1.00      1.00       25
    RETRIEVAL           0.96    0.96      0.96       25

[4/4] Saving model to ../../../Docker/data/classifier_model.pkl
  Model saved → ../../../Docker/data/classifier_model.pkl

Done. Classes: ['CODE_GENERATION', 'CONVERSATIONAL', 'PLATFORM', 'RETRIEVAL']
```

---

## Troubleshooting

**`CV accuracy below threshold`** — Add more examples to the underperforming class
(check the per-class recall column). 5–10 extra examples per class usually suffice.

**`langchain-ollama not installed`** — Run `pip install -r requirements.txt` inside
the venv.

**Ollama connection error** — Verify Ollama is running and bge-m3 is pulled:

```bash
curl http://localhost:11434/api/tags | grep bge-m3
# if missing:
ollama pull bge-m3
```

**Container not picking up the model** — The engine loads the model once at startup.
A `docker restart` is required after every `docker cp`.