8.7 KiB

Raw Blame History

ADR-0010: Classifier Continuous Retraining — Champion/Challenger Pipeline

Date: 2026-04-10 Status: Accepted Deciders: Rafael Ruiz (CTO) Related ADRs: ADR-0008 (Adaptive Query Routing — Layer 2 classifier), ADR-0009 (Per-Type Response Validation)

Context

ADR-0008 Phase 2 deployed a Layer 2 embedding classifier trained on a seed dataset of 94 hand-crafted examples. This model works well for the initial distribution of queries but has two structural limitations:

The seed dataset does not reflect production traffic. Hand-crafted examples are idealized. Real users ask questions with typos, mixed languages, ambiguous phrasing, and domain-specific vocabulary that is not in the seed.
The model never improves without manual intervention. The data flywheel (ADR-0008 Phase 1) accumulates labeled examples automatically via classify_history_store, but nothing uses them. Data piles up in /data/classifier_labels/ and the model stays frozen at its initial accuracy.

The consequence is that Layer 2 confidence degrades over time relative to the actual query distribution, pushing more requests to Layer 3 (LLM classifier) than necessary.

Decision

Implement a Champion/Challenger automatic retraining pipeline that triggers every time a new batch of labeled data is exported.

Core design

flowchart TD
    S[classify_history_store\naccumulates sessions] -->|100 sessions| EX[classifier_export.py\nexport JSONL]
    EX -->|RETRAIN_ON_EXPORT=true| RT[retrain_pipeline.py\nbackground thread]

    RT --> LOAD[Load seed + all exports]
    LOAD --> EMB[Embed with bge-m3]
    EMB --> SPLIT[Stratified 80/20 split\ntrain / held-out]

    SPLIT --> CV[Cross-validate challenger\nStratifiedKFold]
    CV -->|CV < 0.90| ABORT[Abort — do not deploy\nlog alert]
    CV -->|CV ≥ 0.90| TRAIN[Train challenger\non full train split]

    TRAIN --> EVAL_CH[Evaluate challenger\non held-out set]
    EVAL_CH --> EVAL_CP[Evaluate champion\non same held-out set]

    EVAL_CP --> DECISION{challenger ≥ champion?}
    DECISION -->|yes| BACKUP[Backup champion]
    BACKUP --> DEPLOY[Deploy challenger\noverwrite CLASSIFIER_MODEL_PATH]
    DEPLOY --> ARCHIVE[Archive processed exports]

    DECISION -->|no| KEEP[Keep champion\nlog alert\ndiscard challenger]
    KEEP --> ARCHIVE

Champion/Challenger semantics

The model currently in production is the champion. The newly trained model is the challenger. The challenger is only promoted if its accuracy on the held-out set is greater than or equal to the champion's accuracy on the same set.

This guarantees that the production model never regresses. A retraining run triggered by noisy or unbalanced data will produce a challenger that loses the comparison and is discarded automatically.

If no champion exists (first deployment), the challenger is promoted unconditionally provided CV accuracy ≥ 0.90.

Trigger

The pipeline is triggered by classifier_export.py after every successful export — which happens every CLASSIFIER_EXPORT_THRESHOLD sessions (default: 100).

The pipeline runs in a background daemon thread inside the engine container. It does not block gRPC request handling. A 10-minute hard timeout prevents runaway retraining from consuming resources indefinitely.

Model loading after promotion

The engine does not hot-reload the model mid-operation. The new champion is written to CLASSIFIER_MODEL_PATH and loaded on the next engine restart. This is intentional:

Hot-reload would require locking around every inference call, adding latency.
Mid-session model changes could produce inconsistent classification for the same user within a conversation.
Docker container restarts are cheap and already part of the deployment workflow.

The engine logs a clear message after promotion:

[classifier_export] retraining completed — restart engine to load new model

Safety mechanisms

Mechanism	Purpose
CV accuracy gate (≥ 0.90)	Rejects challengers trained on insufficient or unbalanced data before the held-out comparison
Champion comparison on held-out	Prevents regression — challenger must equal or beat the current production model
Champion backup before overwrite	`classifier_model_backup_{timestamp}.pkl` — roll back manually if needed
Export archiving	Processed files moved to `CLASSIFIER_ARCHIVE_DIR` — prevents re-inclusion in future runs
10-minute subprocess timeout	Prevents runaway retraining from blocking the engine indefinitely
`RETRAIN_ON_EXPORT=false`	Disables automatic retraining without code changes — useful in staging or during debugging

Environment variables

Variable	Default	Purpose
`CLASSIFIER_EXPORT_THRESHOLD`	`100`	Sessions before export + retrain trigger
`RETRAIN_ON_EXPORT`	`true`	Enable/disable automatic retraining
`RETRAIN_SCRIPT_PATH`	`/app/scripts/pipelines/classifier/retrain_pipeline.py`	Path to retrain script inside container
`CLASSIFIER_ARCHIVE_DIR`	`/data/classifier_labels/archived`	Where processed exports are moved after retraining
`CLASSIFIER_SEED_DATASET`	`/app/scripts/pipelines/classifier/seed_classifier_dataset.jsonl`	Seed dataset always included in retraining
`CLASSIFIER_MIN_CV_ACCURACY`	`0.90`	Minimum CV accuracy for challenger to proceed
`CLASSIFIER_HELD_OUT_RATIO`	`0.20`	Fraction of merged dataset reserved for champion/challenger comparison

Files

File	Role
`scripts/pipelines/classifier/retrain_pipeline.py`	Champion/Challenger training, evaluation, promotion, and archiving
`Docker/src/utils/classifier_export.py`	Export trigger — launches `retrain_pipeline.py` in background after export
`scripts/pipelines/classifier/seed_classifier_dataset.jsonl`	Always included in retraining — anchors the model on known-good examples

Convergence behavior

Each retraining cycle merges the seed dataset with all accumulated production exports. As production traffic grows, the model progressively reflects real user queries rather than the hand-crafted seed.

flowchart LR
    T0["Cycle 0\n94 seed examples\nCV 1.0 on seed"] -->
    T1["Cycle 1\n94 + ~100 production\nreal query distribution"] -->
    T2["Cycle 2\n94 + ~200 production\nincreasing coverage"] -->
    TN["Cycle N\nseed becomes minority\nmodel reflects production traffic"]

The seed dataset is never removed — it acts as a regularizer that prevents the model from drifting entirely to production distribution edge cases.

Consequences

Positive

Layer 2 accuracy improves automatically with usage — no human intervention required after initial deployment.
The champion/challenger gate prevents production regressions from noisy batches.
RETRAIN_ON_EXPORT=false provides a complete off switch without code changes.
Export archiving keeps /data/classifier_labels/ clean — only unprocessed exports accumulate.
The backup mechanism allows manual rollback in the rare case a promoted challenger performs unexpectedly in production.

Negative / Trade-offs

Retraining uses Ollama (bge-m3) on the host. The retrain script runs inside the container but needs OLLAMA_LOCAL_URL reachable. If Ollama is down at retraining time, the pipeline fails and logs an error — the champion is unchanged.
The engine requires a restart to load the new model. Promotions are invisible to users until restart. In low-traffic periods this is acceptable; high-traffic deployments may need an orchestrated restart strategy.
Accumulated exports grow unboundedly if retraining fails. If the pipeline consistently fails (e.g., Ollama unreachable), exports accumulate in /data/classifier_labels/ without being archived. A monitoring alert on directory size is recommended.

Open questions

Orchestrated restart policy: Who triggers the engine restart after a promotion? Currently manual. Could be automated via a health-check endpoint that returns a model_updated flag, allowing the orchestrator to restart when traffic is low.
Held-out set stability: The held-out set is resampled on every retraining cycle from the merged dataset. As the dataset grows, the held-out set changes between cycles, making champion accuracy scores not directly comparable across cycles. A fixed held-out set (frozen after the first N examples) would improve comparability. Deferred.
Class imbalance over time: Production traffic may not be balanced across the four query types. If CODE_GENERATION is rare in production, its representation in the training set shrinks relative to the seed. The CV gate catches catastrophic imbalance but not gradual drift. A per-class recall threshold could be added to the gate.

8.7 KiB Raw Blame History