# ADR-0010: Classifier Continuous Retraining — Champion/Challenger Pipeline **Date:** 2026-04-10 **Status:** Accepted **Deciders:** Rafael Ruiz (CTO) **Related ADRs:** ADR-0008 (Adaptive Query Routing — Layer 2 classifier), ADR-0009 (Per-Type Response Validation) --- ## Context ADR-0008 Phase 2 deployed a Layer 2 embedding classifier trained on a **seed dataset of 94 hand-crafted examples**. This model works well for the initial distribution of queries but has two structural limitations: 1. **The seed dataset does not reflect production traffic.** Hand-crafted examples are idealized. Real users ask questions with typos, mixed languages, ambiguous phrasing, and domain-specific vocabulary that is not in the seed. 2. **The model never improves without manual intervention.** The data flywheel (ADR-0008 Phase 1) accumulates labeled examples automatically via `classify_history_store`, but nothing uses them. Data piles up in `/data/classifier_labels/` and the model stays frozen at its initial accuracy. The consequence is that Layer 2 confidence degrades over time relative to the actual query distribution, pushing more requests to Layer 3 (LLM classifier) than necessary. --- ## Decision Implement a **Champion/Challenger automatic retraining pipeline** that triggers every time a new batch of labeled data is exported. ### Core design ```mermaid flowchart TD S[classify_history_store\naccumulates sessions] -->|100 sessions| EX[classifier_export.py\nexport JSONL] EX -->|RETRAIN_ON_EXPORT=true| RT[retrain_pipeline.py\nbackground thread] RT --> LOAD[Load seed + all exports] LOAD --> EMB[Embed with bge-m3] EMB --> SPLIT[Stratified 80/20 split\ntrain / held-out] SPLIT --> CV[Cross-validate challenger\nStratifiedKFold] CV -->|CV < 0.90| ABORT[Abort — do not deploy\nlog alert] CV -->|CV ≥ 0.90| TRAIN[Train challenger\non full train split] TRAIN --> EVAL_CH[Evaluate challenger\non held-out set] EVAL_CH --> EVAL_CP[Evaluate champion\non same held-out set] EVAL_CP --> DECISION{challenger ≥ champion?} DECISION -->|yes| BACKUP[Backup champion] BACKUP --> DEPLOY[Deploy challenger\noverwrite CLASSIFIER_MODEL_PATH] DEPLOY --> ARCHIVE[Archive processed exports] DECISION -->|no| KEEP[Keep champion\nlog alert\ndiscard challenger] KEEP --> ARCHIVE ``` ### Champion/Challenger semantics The model currently in production is the **champion**. The newly trained model is the **challenger**. The challenger is only promoted if its accuracy on the held-out set is **greater than or equal to** the champion's accuracy on the same set. This guarantees that the production model never regresses. A retraining run triggered by noisy or unbalanced data will produce a challenger that loses the comparison and is discarded automatically. If no champion exists (first deployment), the challenger is promoted unconditionally provided CV accuracy ≥ 0.90. ### Trigger The pipeline is triggered by `classifier_export.py` after every successful export — which happens every `CLASSIFIER_EXPORT_THRESHOLD` sessions (default: **100**). The pipeline runs in a **background daemon thread** inside the engine container. It does not block gRPC request handling. A 10-minute hard timeout prevents runaway retraining from consuming resources indefinitely. ### Model loading after promotion The engine does **not** hot-reload the model mid-operation. The new champion is written to `CLASSIFIER_MODEL_PATH` and loaded on the next engine restart. This is intentional: - Hot-reload would require locking around every inference call, adding latency. - Mid-session model changes could produce inconsistent classification for the same user within a conversation. - Docker container restarts are cheap and already part of the deployment workflow. The engine logs a clear message after promotion: ``` [classifier_export] retraining completed — restart engine to load new model ``` ### Safety mechanisms | Mechanism | Purpose | |---|---| | CV accuracy gate (≥ 0.90) | Rejects challengers trained on insufficient or unbalanced data before the held-out comparison | | Champion comparison on held-out | Prevents regression — challenger must equal or beat the current production model | | Champion backup before overwrite | `classifier_model_backup_{timestamp}.pkl` — roll back manually if needed | | Export archiving | Processed files moved to `CLASSIFIER_ARCHIVE_DIR` — prevents re-inclusion in future runs | | 10-minute subprocess timeout | Prevents runaway retraining from blocking the engine indefinitely | | `RETRAIN_ON_EXPORT=false` | Disables automatic retraining without code changes — useful in staging or during debugging | --- ## Environment variables | Variable | Default | Purpose | |---|---|---| | `CLASSIFIER_EXPORT_THRESHOLD` | `100` | Sessions before export + retrain trigger | | `RETRAIN_ON_EXPORT` | `true` | Enable/disable automatic retraining | | `RETRAIN_SCRIPT_PATH` | `/app/scripts/pipelines/classifier/retrain_pipeline.py` | Path to retrain script inside container | | `CLASSIFIER_ARCHIVE_DIR` | `/data/classifier_labels/archived` | Where processed exports are moved after retraining | | `CLASSIFIER_SEED_DATASET` | `/app/scripts/pipelines/classifier/seed_classifier_dataset.jsonl` | Seed dataset always included in retraining | | `CLASSIFIER_MIN_CV_ACCURACY` | `0.90` | Minimum CV accuracy for challenger to proceed | | `CLASSIFIER_HELD_OUT_RATIO` | `0.20` | Fraction of merged dataset reserved for champion/challenger comparison | --- ## Files | File | Role | |---|---| | `scripts/pipelines/classifier/retrain_pipeline.py` | Champion/Challenger training, evaluation, promotion, and archiving | | `Docker/src/utils/classifier_export.py` | Export trigger — launches `retrain_pipeline.py` in background after export | | `scripts/pipelines/classifier/seed_classifier_dataset.jsonl` | Always included in retraining — anchors the model on known-good examples | --- ## Convergence behavior Each retraining cycle merges the seed dataset with all accumulated production exports. As production traffic grows, the model progressively reflects real user queries rather than the hand-crafted seed. ```mermaid flowchart LR T0["Cycle 0\n94 seed examples\nCV 1.0 on seed"] --> T1["Cycle 1\n94 + ~100 production\nreal query distribution"] --> T2["Cycle 2\n94 + ~200 production\nincreasing coverage"] --> TN["Cycle N\nseed becomes minority\nmodel reflects production traffic"] ``` The seed dataset is never removed — it acts as a regularizer that prevents the model from drifting entirely to production distribution edge cases. --- ## Consequences ### Positive - Layer 2 accuracy improves automatically with usage — no human intervention required after initial deployment. - The champion/challenger gate prevents production regressions from noisy batches. - `RETRAIN_ON_EXPORT=false` provides a complete off switch without code changes. - Export archiving keeps `/data/classifier_labels/` clean — only unprocessed exports accumulate. - The backup mechanism allows manual rollback in the rare case a promoted challenger performs unexpectedly in production. ### Negative / Trade-offs - **Retraining uses Ollama (bge-m3) on the host.** The retrain script runs inside the container but needs `OLLAMA_LOCAL_URL` reachable. If Ollama is down at retraining time, the pipeline fails and logs an error — the champion is unchanged. - **The engine requires a restart to load the new model.** Promotions are invisible to users until restart. In low-traffic periods this is acceptable; high-traffic deployments may need an orchestrated restart strategy. - **Accumulated exports grow unboundedly if retraining fails.** If the pipeline consistently fails (e.g., Ollama unreachable), exports accumulate in `/data/classifier_labels/` without being archived. A monitoring alert on directory size is recommended. ### Open questions 1. **Orchestrated restart policy:** Who triggers the engine restart after a promotion? Currently manual. Could be automated via a health-check endpoint that returns a `model_updated` flag, allowing the orchestrator to restart when traffic is low. 2. **Held-out set stability:** The held-out set is resampled on every retraining cycle from the merged dataset. As the dataset grows, the held-out set changes between cycles, making champion accuracy scores not directly comparable across cycles. A fixed held-out set (frozen after the first N examples) would improve comparability. Deferred. 3. **Class imbalance over time:** Production traffic may not be balanced across the four query types. If `CODE_GENERATION` is rare in production, its representation in the training set shrinks relative to the seed. The CV gate catches catastrophic imbalance but not gradual drift. A per-class recall threshold could be added to the gate.