157 lines
8.7 KiB
Markdown
157 lines
8.7 KiB
Markdown
# ADR-0010: Classifier Continuous Retraining — Champion/Challenger Pipeline
|
|
|
|
**Date:** 2026-04-10
|
|
**Status:** Accepted
|
|
**Deciders:** Rafael Ruiz (CTO)
|
|
**Related ADRs:** ADR-0008 (Adaptive Query Routing — Layer 2 classifier), ADR-0009 (Per-Type Response Validation)
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
ADR-0008 Phase 2 deployed a Layer 2 embedding classifier trained on a **seed dataset of 94 hand-crafted examples**. This model works well for the initial distribution of queries but has two structural limitations:
|
|
|
|
1. **The seed dataset does not reflect production traffic.** Hand-crafted examples are idealized. Real users ask questions with typos, mixed languages, ambiguous phrasing, and domain-specific vocabulary that is not in the seed.
|
|
|
|
2. **The model never improves without manual intervention.** The data flywheel (ADR-0008 Phase 1) accumulates labeled examples automatically via `classify_history_store`, but nothing uses them. Data piles up in `/data/classifier_labels/` and the model stays frozen at its initial accuracy.
|
|
|
|
The consequence is that Layer 2 confidence degrades over time relative to the actual query distribution, pushing more requests to Layer 3 (LLM classifier) than necessary.
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
Implement a **Champion/Challenger automatic retraining pipeline** that triggers every time a new batch of labeled data is exported.
|
|
|
|
### Core design
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
S[classify_history_store\naccumulates sessions] -->|100 sessions| EX[classifier_export.py\nexport JSONL]
|
|
EX -->|RETRAIN_ON_EXPORT=true| RT[retrain_pipeline.py\nbackground thread]
|
|
|
|
RT --> LOAD[Load seed + all exports]
|
|
LOAD --> EMB[Embed with bge-m3]
|
|
EMB --> SPLIT[Stratified 80/20 split\ntrain / held-out]
|
|
|
|
SPLIT --> CV[Cross-validate challenger\nStratifiedKFold]
|
|
CV -->|CV < 0.90| ABORT[Abort — do not deploy\nlog alert]
|
|
CV -->|CV ≥ 0.90| TRAIN[Train challenger\non full train split]
|
|
|
|
TRAIN --> EVAL_CH[Evaluate challenger\non held-out set]
|
|
EVAL_CH --> EVAL_CP[Evaluate champion\non same held-out set]
|
|
|
|
EVAL_CP --> DECISION{challenger ≥ champion?}
|
|
DECISION -->|yes| BACKUP[Backup champion]
|
|
BACKUP --> DEPLOY[Deploy challenger\noverwrite CLASSIFIER_MODEL_PATH]
|
|
DEPLOY --> ARCHIVE[Archive processed exports]
|
|
|
|
DECISION -->|no| KEEP[Keep champion\nlog alert\ndiscard challenger]
|
|
KEEP --> ARCHIVE
|
|
```
|
|
|
|
### Champion/Challenger semantics
|
|
|
|
The model currently in production is the **champion**. The newly trained model is the **challenger**. The challenger is only promoted if its accuracy on the held-out set is **greater than or equal to** the champion's accuracy on the same set.
|
|
|
|
This guarantees that the production model never regresses. A retraining run triggered by noisy or unbalanced data will produce a challenger that loses the comparison and is discarded automatically.
|
|
|
|
If no champion exists (first deployment), the challenger is promoted unconditionally provided CV accuracy ≥ 0.90.
|
|
|
|
### Trigger
|
|
|
|
The pipeline is triggered by `classifier_export.py` after every successful export — which happens every `CLASSIFIER_EXPORT_THRESHOLD` sessions (default: **100**).
|
|
|
|
The pipeline runs in a **background daemon thread** inside the engine container. It does not block gRPC request handling. A 10-minute hard timeout prevents runaway retraining from consuming resources indefinitely.
|
|
|
|
### Model loading after promotion
|
|
|
|
The engine does **not** hot-reload the model mid-operation. The new champion is written to `CLASSIFIER_MODEL_PATH` and loaded on the next engine restart. This is intentional:
|
|
|
|
- Hot-reload would require locking around every inference call, adding latency.
|
|
- Mid-session model changes could produce inconsistent classification for the same user within a conversation.
|
|
- Docker container restarts are cheap and already part of the deployment workflow.
|
|
|
|
The engine logs a clear message after promotion:
|
|
|
|
```
|
|
[classifier_export] retraining completed — restart engine to load new model
|
|
```
|
|
|
|
### Safety mechanisms
|
|
|
|
| Mechanism | Purpose |
|
|
|---|---|
|
|
| CV accuracy gate (≥ 0.90) | Rejects challengers trained on insufficient or unbalanced data before the held-out comparison |
|
|
| Champion comparison on held-out | Prevents regression — challenger must equal or beat the current production model |
|
|
| Champion backup before overwrite | `classifier_model_backup_{timestamp}.pkl` — roll back manually if needed |
|
|
| Export archiving | Processed files moved to `CLASSIFIER_ARCHIVE_DIR` — prevents re-inclusion in future runs |
|
|
| 10-minute subprocess timeout | Prevents runaway retraining from blocking the engine indefinitely |
|
|
| `RETRAIN_ON_EXPORT=false` | Disables automatic retraining without code changes — useful in staging or during debugging |
|
|
|
|
---
|
|
|
|
## Environment variables
|
|
|
|
| Variable | Default | Purpose |
|
|
|---|---|---|
|
|
| `CLASSIFIER_EXPORT_THRESHOLD` | `100` | Sessions before export + retrain trigger |
|
|
| `RETRAIN_ON_EXPORT` | `true` | Enable/disable automatic retraining |
|
|
| `RETRAIN_SCRIPT_PATH` | `/app/scripts/pipelines/classifier/retrain_pipeline.py` | Path to retrain script inside container |
|
|
| `CLASSIFIER_ARCHIVE_DIR` | `/data/classifier_labels/archived` | Where processed exports are moved after retraining |
|
|
| `CLASSIFIER_SEED_DATASET` | `/app/scripts/pipelines/classifier/seed_classifier_dataset.jsonl` | Seed dataset always included in retraining |
|
|
| `CLASSIFIER_MIN_CV_ACCURACY` | `0.90` | Minimum CV accuracy for challenger to proceed |
|
|
| `CLASSIFIER_HELD_OUT_RATIO` | `0.20` | Fraction of merged dataset reserved for champion/challenger comparison |
|
|
|
|
---
|
|
|
|
## Files
|
|
|
|
| File | Role |
|
|
|---|---|
|
|
| `scripts/pipelines/classifier/retrain_pipeline.py` | Champion/Challenger training, evaluation, promotion, and archiving |
|
|
| `Docker/src/utils/classifier_export.py` | Export trigger — launches `retrain_pipeline.py` in background after export |
|
|
| `scripts/pipelines/classifier/seed_classifier_dataset.jsonl` | Always included in retraining — anchors the model on known-good examples |
|
|
|
|
---
|
|
|
|
## Convergence behavior
|
|
|
|
Each retraining cycle merges the seed dataset with all accumulated production exports. As production traffic grows, the model progressively reflects real user queries rather than the hand-crafted seed.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
T0["Cycle 0\n94 seed examples\nCV 1.0 on seed"] -->
|
|
T1["Cycle 1\n94 + ~100 production\nreal query distribution"] -->
|
|
T2["Cycle 2\n94 + ~200 production\nincreasing coverage"] -->
|
|
TN["Cycle N\nseed becomes minority\nmodel reflects production traffic"]
|
|
```
|
|
|
|
The seed dataset is never removed — it acts as a regularizer that prevents the model from drifting entirely to production distribution edge cases.
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- Layer 2 accuracy improves automatically with usage — no human intervention required after initial deployment.
|
|
- The champion/challenger gate prevents production regressions from noisy batches.
|
|
- `RETRAIN_ON_EXPORT=false` provides a complete off switch without code changes.
|
|
- Export archiving keeps `/data/classifier_labels/` clean — only unprocessed exports accumulate.
|
|
- The backup mechanism allows manual rollback in the rare case a promoted challenger performs unexpectedly in production.
|
|
|
|
### Negative / Trade-offs
|
|
|
|
- **Retraining uses Ollama (bge-m3) on the host.** The retrain script runs inside the container but needs `OLLAMA_LOCAL_URL` reachable. If Ollama is down at retraining time, the pipeline fails and logs an error — the champion is unchanged.
|
|
- **The engine requires a restart to load the new model.** Promotions are invisible to users until restart. In low-traffic periods this is acceptable; high-traffic deployments may need an orchestrated restart strategy.
|
|
- **Accumulated exports grow unboundedly if retraining fails.** If the pipeline consistently fails (e.g., Ollama unreachable), exports accumulate in `/data/classifier_labels/` without being archived. A monitoring alert on directory size is recommended.
|
|
|
|
### Open questions
|
|
|
|
1. **Orchestrated restart policy:** Who triggers the engine restart after a promotion? Currently manual. Could be automated via a health-check endpoint that returns a `model_updated` flag, allowing the orchestrator to restart when traffic is low.
|
|
|
|
2. **Held-out set stability:** The held-out set is resampled on every retraining cycle from the merged dataset. As the dataset grows, the held-out set changes between cycles, making champion accuracy scores not directly comparable across cycles. A fixed held-out set (frozen after the first N examples) would improve comparability. Deferred.
|
|
|
|
3. **Class imbalance over time:** Production traffic may not be balanced across the four query types. If `CODE_GENERATION` is rare in production, its representation in the training set shrinks relative to the seed. The CV gate catches catastrophic imbalance but not gradual drift. A per-class recall threshold could be added to the gate.
|