# Layer 2 Classifier — Training Pipeline **Author:** Rafael Ruiz (CTO, 101OBEX Corp) **Related:** ADR-0008 — Adaptive Query Routing Part of **ADR-0008 Phase 2**: trains the embedding-based classifier that intercepts queries before they reach the LLM (Layer 3), reducing per-request Ollama calls to near zero for well-represented query types. --- ## Overview The classifier embeds each query with **bge-m3** (already running in the stack), trains a **LogisticRegression** on the resulting vectors, and serializes the model with joblib. At engine startup, `graph.py` loads the model and uses it as Layer 2 in the classification pipeline. ```mermaid flowchart TD Q([Query]) --> L1 L1["Layer 1 — Hard rules\nRC-01 · RC-02\nO(1), deterministic"] L1 -->|match| R([Classification result]) L1 -->|no match| L2 L2["Layer 2 — Embedding classifier\nbge-m3 + LogisticRegression\n~1ms · CPU only · no LLM"] L2 -->|confidence ≥ 0.85| R L2 -->|confidence < 0.85| L3 L3["Layer 3 — LLM classifier\nOllama fallback\n~300–800ms"] L3 --> R ``` If the model file does not exist, the engine starts normally and uses L3 only. --- ## Files | File | Purpose | |---|---| | `train_classifier.py` | Training script | | `seed_classifier_dataset.jsonl` | Labeled dataset (seed, 100 examples) | | `requirements.txt` | Python dependencies for the training venv | --- ## Setup ```bash cd scripts/pipelines/classifier python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` --- ## Running the training ```bash python train_classifier.py ``` Default behaviour (no flags needed if running from this directory): - **data**: `seed_classifier_dataset.jsonl` in the same folder - **ollama**: `http://localhost:11434` (or `$OLLAMA_LOCAL_URL`) - **output**: `/data/classifier_model.pkl` (or `$CLASSIFIER_MODEL_PATH`) - **min CV accuracy**: 0.90 The script exits with code 1 if CV accuracy is below the threshold — the model is **not saved** in that case. ### All arguments | Argument | Default | Description | |---|---|---| | `--data` | `seed_classifier_dataset.jsonl` | Path to labeled JSONL dataset | | `--output` | `/data/classifier_model.pkl` | Output path for serialized model | | `--ollama` | `http://localhost:11434` | Ollama base URL | | `--min-cv-accuracy` | `0.90` | Minimum CV accuracy to save the model | ### Custom output path (recommended when running from host) ```bash python train_classifier.py \ --output ../../../Docker/data/classifier_model.pkl ``` --- ## Deploying the model to Docker ```mermaid sequenceDiagram participant Host participant Docker Host->>Host: python train_classifier.py --output ./Docker/data/classifier_model.pkl Host->>Docker: docker cp classifier_model.pkl brunix-assistance-engine:/data/ Host->>Docker: docker restart brunix-assistance-engine Docker-->>Host: [classifier/L2] model loaded from /data/classifier_model.pkl ``` On startup you will see in the logs: ``` [classifier/L2] model loaded from /data/classifier_model.pkl — {'classes': [...], 'n_train': 100, 'cv_mean': 0.96, ...} ``` If the model is missing: ``` [classifier/L2] model not found at /data/classifier_model.pkl — using LLM fallback only ``` --- ## Dataset format Every line is a JSON object with two required fields: ```json {"query": "What is addVar in AVAP?", "type": "RETRIEVAL"} ``` Valid types: `RETRIEVAL`, `CODE_GENERATION`, `CONVERSATIONAL`, `PLATFORM`. Lines with missing `query` or `type`, or invalid JSON, are skipped with a warning. ### Merging production exports with the seed ```mermaid flowchart LR S[seed_classifier_dataset.jsonl] --> M E["/data/classifier_labels/\nclassifier_labels_*.jsonl\n(production exports)"] --> M M([merge]) --> D[merged_dataset.jsonl] D --> T[train_classifier.py] T --> P[classifier_model.pkl] P --> E2[engine restart] ``` ```bash cat seed_classifier_dataset.jsonl \ /data/classifier_labels/classifier_labels_*.jsonl \ > merged_dataset.jsonl python train_classifier.py --data merged_dataset.jsonl ``` The engine exports labeled data automatically to `CLASSIFIER_EXPORT_DIR` (default `/data/classifier_labels/`) once `CLASSIFIER_EXPORT_THRESHOLD` sessions accumulate. Those files use the same format and can be merged directly. --- ## Expected output ``` [1/4] Loading data from seed_classifier_dataset.jsonl 100 examples loaded Distribution: {'RETRIEVAL': 25, 'CODE_GENERATION': 25, 'CONVERSATIONAL': 25, 'PLATFORM': 25} [2/4] Embedding with bge-m3 via http://localhost:11434 Embedding batch 1/4 (32 queries)... Embedding batch 2/4 (32 queries)... Embedding batch 3/4 (32 queries)... Embedding batch 4/4 (4 queries)... Embedding matrix: (100, 1024) [3/4] Training LogisticRegression (C=1.0) with 5-fold CV CV accuracy: 0.970 ± 0.021 (folds: [0.95, 1.0, 0.95, 0.95, 1.0]) Per-class report: precision recall f1-score support CODE_GENERATION 1.00 0.96 0.98 25 CONVERSATIONAL 0.96 1.00 0.98 25 PLATFORM 1.00 1.00 1.00 25 RETRIEVAL 0.96 0.96 0.96 25 [4/4] Saving model to ../../../Docker/data/classifier_model.pkl Model saved → ../../../Docker/data/classifier_model.pkl Done. Classes: ['CODE_GENERATION', 'CONVERSATIONAL', 'PLATFORM', 'RETRIEVAL'] ``` --- ## Troubleshooting **`CV accuracy below threshold`** — Add more examples to the underperforming class (check the per-class recall column). 5–10 extra examples per class usually suffice. **`langchain-ollama not installed`** — Run `pip install -r requirements.txt` inside the venv. **Ollama connection error** — Verify Ollama is running and bge-m3 is pulled: ```bash curl http://localhost:11434/api/tags | grep bge-m3 # if missing: ollama pull bge-m3 ``` **Container not picking up the model** — The engine loads the model once at startup. A `docker restart` is required after every `docker cp`.