8.7 KiB
ADR-0010: Classifier Continuous Retraining — Champion/Challenger Pipeline
Date: 2026-04-10 Status: Accepted Deciders: Rafael Ruiz (CTO) Related ADRs: ADR-0008 (Adaptive Query Routing — Layer 2 classifier), ADR-0009 (Per-Type Response Validation)
Context
ADR-0008 Phase 2 deployed a Layer 2 embedding classifier trained on a seed dataset of 94 hand-crafted examples. This model works well for the initial distribution of queries but has two structural limitations:
-
The seed dataset does not reflect production traffic. Hand-crafted examples are idealized. Real users ask questions with typos, mixed languages, ambiguous phrasing, and domain-specific vocabulary that is not in the seed.
-
The model never improves without manual intervention. The data flywheel (ADR-0008 Phase 1) accumulates labeled examples automatically via
classify_history_store, but nothing uses them. Data piles up in/data/classifier_labels/and the model stays frozen at its initial accuracy.
The consequence is that Layer 2 confidence degrades over time relative to the actual query distribution, pushing more requests to Layer 3 (LLM classifier) than necessary.
Decision
Implement a Champion/Challenger automatic retraining pipeline that triggers every time a new batch of labeled data is exported.
Core design
flowchart TD
S[classify_history_store\naccumulates sessions] -->|100 sessions| EX[classifier_export.py\nexport JSONL]
EX -->|RETRAIN_ON_EXPORT=true| RT[retrain_pipeline.py\nbackground thread]
RT --> LOAD[Load seed + all exports]
LOAD --> EMB[Embed with bge-m3]
EMB --> SPLIT[Stratified 80/20 split\ntrain / held-out]
SPLIT --> CV[Cross-validate challenger\nStratifiedKFold]
CV -->|CV < 0.90| ABORT[Abort — do not deploy\nlog alert]
CV -->|CV ≥ 0.90| TRAIN[Train challenger\non full train split]
TRAIN --> EVAL_CH[Evaluate challenger\non held-out set]
EVAL_CH --> EVAL_CP[Evaluate champion\non same held-out set]
EVAL_CP --> DECISION{challenger ≥ champion?}
DECISION -->|yes| BACKUP[Backup champion]
BACKUP --> DEPLOY[Deploy challenger\noverwrite CLASSIFIER_MODEL_PATH]
DEPLOY --> ARCHIVE[Archive processed exports]
DECISION -->|no| KEEP[Keep champion\nlog alert\ndiscard challenger]
KEEP --> ARCHIVE
Champion/Challenger semantics
The model currently in production is the champion. The newly trained model is the challenger. The challenger is only promoted if its accuracy on the held-out set is greater than or equal to the champion's accuracy on the same set.
This guarantees that the production model never regresses. A retraining run triggered by noisy or unbalanced data will produce a challenger that loses the comparison and is discarded automatically.
If no champion exists (first deployment), the challenger is promoted unconditionally provided CV accuracy ≥ 0.90.
Trigger
The pipeline is triggered by classifier_export.py after every successful export — which happens every CLASSIFIER_EXPORT_THRESHOLD sessions (default: 100).
The pipeline runs in a background daemon thread inside the engine container. It does not block gRPC request handling. A 10-minute hard timeout prevents runaway retraining from consuming resources indefinitely.
Model loading after promotion
The engine does not hot-reload the model mid-operation. The new champion is written to CLASSIFIER_MODEL_PATH and loaded on the next engine restart. This is intentional:
- Hot-reload would require locking around every inference call, adding latency.
- Mid-session model changes could produce inconsistent classification for the same user within a conversation.
- Docker container restarts are cheap and already part of the deployment workflow.
The engine logs a clear message after promotion:
[classifier_export] retraining completed — restart engine to load new model
Safety mechanisms
| Mechanism | Purpose |
|---|---|
| CV accuracy gate (≥ 0.90) | Rejects challengers trained on insufficient or unbalanced data before the held-out comparison |
| Champion comparison on held-out | Prevents regression — challenger must equal or beat the current production model |
| Champion backup before overwrite | classifier_model_backup_{timestamp}.pkl — roll back manually if needed |
| Export archiving | Processed files moved to CLASSIFIER_ARCHIVE_DIR — prevents re-inclusion in future runs |
| 10-minute subprocess timeout | Prevents runaway retraining from blocking the engine indefinitely |
RETRAIN_ON_EXPORT=false |
Disables automatic retraining without code changes — useful in staging or during debugging |
Environment variables
| Variable | Default | Purpose |
|---|---|---|
CLASSIFIER_EXPORT_THRESHOLD |
100 |
Sessions before export + retrain trigger |
RETRAIN_ON_EXPORT |
true |
Enable/disable automatic retraining |
RETRAIN_SCRIPT_PATH |
/app/scripts/pipelines/classifier/retrain_pipeline.py |
Path to retrain script inside container |
CLASSIFIER_ARCHIVE_DIR |
/data/classifier_labels/archived |
Where processed exports are moved after retraining |
CLASSIFIER_SEED_DATASET |
/app/scripts/pipelines/classifier/seed_classifier_dataset.jsonl |
Seed dataset always included in retraining |
CLASSIFIER_MIN_CV_ACCURACY |
0.90 |
Minimum CV accuracy for challenger to proceed |
CLASSIFIER_HELD_OUT_RATIO |
0.20 |
Fraction of merged dataset reserved for champion/challenger comparison |
Files
| File | Role |
|---|---|
scripts/pipelines/classifier/retrain_pipeline.py |
Champion/Challenger training, evaluation, promotion, and archiving |
Docker/src/utils/classifier_export.py |
Export trigger — launches retrain_pipeline.py in background after export |
scripts/pipelines/classifier/seed_classifier_dataset.jsonl |
Always included in retraining — anchors the model on known-good examples |
Convergence behavior
Each retraining cycle merges the seed dataset with all accumulated production exports. As production traffic grows, the model progressively reflects real user queries rather than the hand-crafted seed.
flowchart LR
T0["Cycle 0\n94 seed examples\nCV 1.0 on seed"] -->
T1["Cycle 1\n94 + ~100 production\nreal query distribution"] -->
T2["Cycle 2\n94 + ~200 production\nincreasing coverage"] -->
TN["Cycle N\nseed becomes minority\nmodel reflects production traffic"]
The seed dataset is never removed — it acts as a regularizer that prevents the model from drifting entirely to production distribution edge cases.
Consequences
Positive
- Layer 2 accuracy improves automatically with usage — no human intervention required after initial deployment.
- The champion/challenger gate prevents production regressions from noisy batches.
RETRAIN_ON_EXPORT=falseprovides a complete off switch without code changes.- Export archiving keeps
/data/classifier_labels/clean — only unprocessed exports accumulate. - The backup mechanism allows manual rollback in the rare case a promoted challenger performs unexpectedly in production.
Negative / Trade-offs
- Retraining uses Ollama (bge-m3) on the host. The retrain script runs inside the container but needs
OLLAMA_LOCAL_URLreachable. If Ollama is down at retraining time, the pipeline fails and logs an error — the champion is unchanged. - The engine requires a restart to load the new model. Promotions are invisible to users until restart. In low-traffic periods this is acceptable; high-traffic deployments may need an orchestrated restart strategy.
- Accumulated exports grow unboundedly if retraining fails. If the pipeline consistently fails (e.g., Ollama unreachable), exports accumulate in
/data/classifier_labels/without being archived. A monitoring alert on directory size is recommended.
Open questions
-
Orchestrated restart policy: Who triggers the engine restart after a promotion? Currently manual. Could be automated via a health-check endpoint that returns a
model_updatedflag, allowing the orchestrator to restart when traffic is low. -
Held-out set stability: The held-out set is resampled on every retraining cycle from the merged dataset. As the dataset grows, the held-out set changes between cycles, making champion accuracy scores not directly comparable across cycles. A fixed held-out set (frozen after the first N examples) would improve comparability. Deferred.
-
Class imbalance over time: Production traffic may not be balanced across the four query types. If
CODE_GENERATIONis rare in production, its representation in the training set shrinks relative to the seed. The CV gate catches catastrophic imbalance but not gradual drift. A per-class recall threshold could be added to the gate.