assistance-engine/scripts/pipelines/classifier/README.md

5.9 KiB
Raw Blame History

Layer 2 Classifier — Training Pipeline

Author: Rafael Ruiz (CTO, 101OBEX Corp) Related: ADR-0008 — Adaptive Query Routing

Part of ADR-0008 Phase 2: trains the embedding-based classifier that intercepts queries before they reach the LLM (Layer 3), reducing per-request Ollama calls to near zero for well-represented query types.


Overview

The classifier embeds each query with bge-m3 (already running in the stack), trains a LogisticRegression on the resulting vectors, and serializes the model with joblib. At engine startup, graph.py loads the model and uses it as Layer 2 in the classification pipeline.

flowchart TD
    Q([Query]) --> L1

    L1["Layer 1 — Hard rules\nRC-01 · RC-02\nO(1), deterministic"]
    L1 -->|match| R([Classification result])
    L1 -->|no match| L2

    L2["Layer 2 — Embedding classifier\nbge-m3 + LogisticRegression\n~1ms · CPU only · no LLM"]
    L2 -->|confidence ≥ 0.85| R
    L2 -->|confidence < 0.85| L3

    L3["Layer 3 — LLM classifier\nOllama fallback\n~300800ms"]
    L3 --> R

If the model file does not exist, the engine starts normally and uses L3 only.


Files

File Purpose
train_classifier.py Training script
seed_classifier_dataset.jsonl Labeled dataset (seed, 100 examples)
requirements.txt Python dependencies for the training venv

Setup

cd scripts/pipelines/classifier

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Running the training

python train_classifier.py

Default behaviour (no flags needed if running from this directory):

  • data: seed_classifier_dataset.jsonl in the same folder
  • ollama: http://localhost:11434 (or $OLLAMA_LOCAL_URL)
  • output: /data/classifier_model.pkl (or $CLASSIFIER_MODEL_PATH)
  • min CV accuracy: 0.90

The script exits with code 1 if CV accuracy is below the threshold — the model is not saved in that case.

All arguments

Argument Default Description
--data seed_classifier_dataset.jsonl Path to labeled JSONL dataset
--output /data/classifier_model.pkl Output path for serialized model
--ollama http://localhost:11434 Ollama base URL
--min-cv-accuracy 0.90 Minimum CV accuracy to save the model
python train_classifier.py \
  --output ../../../Docker/data/classifier_model.pkl

Deploying the model to Docker

sequenceDiagram
    participant Host
    participant Docker

    Host->>Host: python train_classifier.py --output ./Docker/data/classifier_model.pkl
    Host->>Docker: docker cp classifier_model.pkl brunix-assistance-engine:/data/
    Host->>Docker: docker restart brunix-assistance-engine
    Docker-->>Host: [classifier/L2] model loaded from /data/classifier_model.pkl

On startup you will see in the logs:

[classifier/L2] model loaded from /data/classifier_model.pkl — {'classes': [...], 'n_train': 100, 'cv_mean': 0.96, ...}

If the model is missing:

[classifier/L2] model not found at /data/classifier_model.pkl — using LLM fallback only

Dataset format

Every line is a JSON object with two required fields:

{"query": "What is addVar in AVAP?", "type": "RETRIEVAL"}

Valid types: RETRIEVAL, CODE_GENERATION, CONVERSATIONAL, PLATFORM.

Lines with missing query or type, or invalid JSON, are skipped with a warning.

Merging production exports with the seed

flowchart LR
    S[seed_classifier_dataset.jsonl] --> M
    E["/data/classifier_labels/\nclassifier_labels_*.jsonl\n(production exports)"] --> M
    M([merge]) --> D[merged_dataset.jsonl]
    D --> T[train_classifier.py]
    T --> P[classifier_model.pkl]
    P --> E2[engine restart]
cat seed_classifier_dataset.jsonl \
    /data/classifier_labels/classifier_labels_*.jsonl \
    > merged_dataset.jsonl

python train_classifier.py --data merged_dataset.jsonl

The engine exports labeled data automatically to CLASSIFIER_EXPORT_DIR (default /data/classifier_labels/) once CLASSIFIER_EXPORT_THRESHOLD sessions accumulate. Those files use the same format and can be merged directly.


Expected output

[1/4] Loading data from seed_classifier_dataset.jsonl
  100 examples loaded
  Distribution: {'RETRIEVAL': 25, 'CODE_GENERATION': 25, 'CONVERSATIONAL': 25, 'PLATFORM': 25}

[2/4] Embedding with bge-m3 via http://localhost:11434
  Embedding batch 1/4 (32 queries)...
  Embedding batch 2/4 (32 queries)...
  Embedding batch 3/4 (32 queries)...
  Embedding batch 4/4 (4 queries)...
  Embedding matrix: (100, 1024)

[3/4] Training LogisticRegression (C=1.0) with 5-fold CV
  CV accuracy: 0.970 ± 0.021  (folds: [0.95, 1.0, 0.95, 0.95, 1.0])

  Per-class report:
                   precision  recall  f1-score  support
    CODE_GENERATION     1.00    0.96      0.98       25
    CONVERSATIONAL      0.96    1.00      0.98       25
    PLATFORM            1.00    1.00      1.00       25
    RETRIEVAL           0.96    0.96      0.96       25

[4/4] Saving model to ../../../Docker/data/classifier_model.pkl
  Model saved → ../../../Docker/data/classifier_model.pkl

Done. Classes: ['CODE_GENERATION', 'CONVERSATIONAL', 'PLATFORM', 'RETRIEVAL']

Troubleshooting

CV accuracy below threshold — Add more examples to the underperforming class (check the per-class recall column). 510 extra examples per class usually suffice.

langchain-ollama not installed — Run pip install -r requirements.txt inside the venv.

Ollama connection error — Verify Ollama is running and bge-m3 is pulled:

curl http://localhost:11434/api/tags | grep bge-m3
# if missing:
ollama pull bge-m3

Container not picking up the model — The engine loads the model once at startup. A docker restart is required after every docker cp.