|
|
||
|---|---|---|
| .. | ||
| README.md | ||
| requirements.txt | ||
| retrain_pipeline.py | ||
| seed_classifier_dataset.jsonl | ||
| train_classifier.py | ||
README.md
Layer 2 Classifier — Training Pipeline
Author: Rafael Ruiz (CTO, 101OBEX Corp) Related: ADR-0008 — Adaptive Query Routing
Part of ADR-0008 Phase 2: trains the embedding-based classifier that intercepts queries before they reach the LLM (Layer 3), reducing per-request Ollama calls to near zero for well-represented query types.
Overview
The classifier embeds each query with bge-m3 (already running in the stack),
trains a LogisticRegression on the resulting vectors, and serializes the model
with joblib. At engine startup, graph.py loads the model and uses it as Layer 2
in the classification pipeline.
flowchart TD
Q([Query]) --> L1
L1["Layer 1 — Hard rules\nRC-01 · RC-02\nO(1), deterministic"]
L1 -->|match| R([Classification result])
L1 -->|no match| L2
L2["Layer 2 — Embedding classifier\nbge-m3 + LogisticRegression\n~1ms · CPU only · no LLM"]
L2 -->|confidence ≥ 0.85| R
L2 -->|confidence < 0.85| L3
L3["Layer 3 — LLM classifier\nOllama fallback\n~300–800ms"]
L3 --> R
If the model file does not exist, the engine starts normally and uses L3 only.
Files
| File | Purpose |
|---|---|
train_classifier.py |
Training script |
seed_classifier_dataset.jsonl |
Labeled dataset (seed, 100 examples) |
requirements.txt |
Python dependencies for the training venv |
Setup
cd scripts/pipelines/classifier
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Running the training
python train_classifier.py
Default behaviour (no flags needed if running from this directory):
- data:
seed_classifier_dataset.jsonlin the same folder - ollama:
http://localhost:11434(or$OLLAMA_LOCAL_URL) - output:
/data/classifier_model.pkl(or$CLASSIFIER_MODEL_PATH) - min CV accuracy: 0.90
The script exits with code 1 if CV accuracy is below the threshold — the model is not saved in that case.
All arguments
| Argument | Default | Description |
|---|---|---|
--data |
seed_classifier_dataset.jsonl |
Path to labeled JSONL dataset |
--output |
/data/classifier_model.pkl |
Output path for serialized model |
--ollama |
http://localhost:11434 |
Ollama base URL |
--min-cv-accuracy |
0.90 |
Minimum CV accuracy to save the model |
Custom output path (recommended when running from host)
python train_classifier.py \
--output ../../../Docker/data/classifier_model.pkl
Deploying the model to Docker
sequenceDiagram
participant Host
participant Docker
Host->>Host: python train_classifier.py --output ./Docker/data/classifier_model.pkl
Host->>Docker: docker cp classifier_model.pkl brunix-assistance-engine:/data/
Host->>Docker: docker restart brunix-assistance-engine
Docker-->>Host: [classifier/L2] model loaded from /data/classifier_model.pkl
On startup you will see in the logs:
[classifier/L2] model loaded from /data/classifier_model.pkl — {'classes': [...], 'n_train': 100, 'cv_mean': 0.96, ...}
If the model is missing:
[classifier/L2] model not found at /data/classifier_model.pkl — using LLM fallback only
Dataset format
Every line is a JSON object with two required fields:
{"query": "What is addVar in AVAP?", "type": "RETRIEVAL"}
Valid types: RETRIEVAL, CODE_GENERATION, CONVERSATIONAL, PLATFORM.
Lines with missing query or type, or invalid JSON, are skipped with a warning.
Merging production exports with the seed
flowchart LR
S[seed_classifier_dataset.jsonl] --> M
E["/data/classifier_labels/\nclassifier_labels_*.jsonl\n(production exports)"] --> M
M([merge]) --> D[merged_dataset.jsonl]
D --> T[train_classifier.py]
T --> P[classifier_model.pkl]
P --> E2[engine restart]
cat seed_classifier_dataset.jsonl \
/data/classifier_labels/classifier_labels_*.jsonl \
> merged_dataset.jsonl
python train_classifier.py --data merged_dataset.jsonl
The engine exports labeled data automatically to CLASSIFIER_EXPORT_DIR
(default /data/classifier_labels/) once CLASSIFIER_EXPORT_THRESHOLD sessions
accumulate. Those files use the same format and can be merged directly.
Expected output
[1/4] Loading data from seed_classifier_dataset.jsonl
100 examples loaded
Distribution: {'RETRIEVAL': 25, 'CODE_GENERATION': 25, 'CONVERSATIONAL': 25, 'PLATFORM': 25}
[2/4] Embedding with bge-m3 via http://localhost:11434
Embedding batch 1/4 (32 queries)...
Embedding batch 2/4 (32 queries)...
Embedding batch 3/4 (32 queries)...
Embedding batch 4/4 (4 queries)...
Embedding matrix: (100, 1024)
[3/4] Training LogisticRegression (C=1.0) with 5-fold CV
CV accuracy: 0.970 ± 0.021 (folds: [0.95, 1.0, 0.95, 0.95, 1.0])
Per-class report:
precision recall f1-score support
CODE_GENERATION 1.00 0.96 0.98 25
CONVERSATIONAL 0.96 1.00 0.98 25
PLATFORM 1.00 1.00 1.00 25
RETRIEVAL 0.96 0.96 0.96 25
[4/4] Saving model to ../../../Docker/data/classifier_model.pkl
Model saved → ../../../Docker/data/classifier_model.pkl
Done. Classes: ['CODE_GENERATION', 'CONVERSATIONAL', 'PLATFORM', 'RETRIEVAL']
Troubleshooting
CV accuracy below threshold — Add more examples to the underperforming class
(check the per-class recall column). 5–10 extra examples per class usually suffice.
langchain-ollama not installed — Run pip install -r requirements.txt inside
the venv.
Ollama connection error — Verify Ollama is running and bge-m3 is pulled:
curl http://localhost:11434/api/tags | grep bge-m3
# if missing:
ollama pull bge-m3
Container not picking up the model — The engine loads the model once at startup.
A docker restart is required after every docker cp.