assistance-engine/docs/ADR/ADR-0010-classifier-continu...

# ADR-0010: Classifier Continuous Retraining — Champion/Challenger Pipeline

**Date:** 2026-04-10
**Status:** Accepted
**Deciders:** Rafael Ruiz (CTO)
**Related ADRs:** ADR-0008 (Adaptive Query Routing — Layer 2 classifier), ADR-0009 (Per-Type Response Validation)

---

## Context

ADR-0008 Phase 2 deployed a Layer 2 embedding classifier trained on a **seed dataset of 94 hand-crafted examples**. This model works well for the initial distribution of queries but has two structural limitations:

1. **The seed dataset does not reflect production traffic.** Hand-crafted examples are idealized. Real users ask questions with typos, mixed languages, ambiguous phrasing, and domain-specific vocabulary that is not in the seed.

2. **The model never improves without manual intervention.** The data flywheel (ADR-0008 Phase 1) accumulates labeled examples automatically via `classify_history_store`, but nothing uses them. Data piles up in `/data/classifier_labels/` and the model stays frozen at its initial accuracy.

The consequence is that Layer 2 confidence degrades over time relative to the actual query distribution, pushing more requests to Layer 3 (LLM classifier) than necessary.

---

## Decision

Implement a **Champion/Challenger automatic retraining pipeline** that triggers every time a new batch of labeled data is exported.

### Core design

```mermaid
flowchart TD
    S[classify_history_store\naccumulates sessions] -->|100 sessions| EX[classifier_export.py\nexport JSONL]
    EX -->|RETRAIN_ON_EXPORT=true| RT[retrain_pipeline.py\nbackground thread]

    RT --> LOAD[Load seed + all exports]
    LOAD --> EMB[Embed with bge-m3]
    EMB --> SPLIT[Stratified 80/20 split\ntrain / held-out]

    SPLIT --> CV[Cross-validate challenger\nStratifiedKFold]
    CV -->|CV < 0.90| ABORT[Abort — do not deploy\nlog alert]
    CV -->|CV ≥ 0.90| TRAIN[Train challenger\non full train split]

    TRAIN --> EVAL_CH[Evaluate challenger\non held-out set]
    EVAL_CH --> EVAL_CP[Evaluate champion\non same held-out set]

    EVAL_CP --> DECISION{challenger ≥ champion?}
    DECISION -->|yes| BACKUP[Backup champion]
    BACKUP --> DEPLOY[Deploy challenger\noverwrite CLASSIFIER_MODEL_PATH]
    DEPLOY --> ARCHIVE[Archive processed exports]

    DECISION -->|no| KEEP[Keep champion\nlog alert\ndiscard challenger]
    KEEP --> ARCHIVE
```

### Champion/Challenger semantics

The model currently in production is the **champion**. The newly trained model is the **challenger**. The challenger is only promoted if its accuracy on the held-out set is **greater than or equal to** the champion's accuracy on the same set.

This guarantees that the production model never regresses. A retraining run triggered by noisy or unbalanced data will produce a challenger that loses the comparison and is discarded automatically.

If no champion exists (first deployment), the challenger is promoted unconditionally provided CV accuracy ≥ 0.90.

### Trigger

The pipeline is triggered by `classifier_export.py` after every successful export — which happens every `CLASSIFIER_EXPORT_THRESHOLD` sessions (default: **100**).

The pipeline runs in a **background daemon thread** inside the engine container. It does not block gRPC request handling. A 10-minute hard timeout prevents runaway retraining from consuming resources indefinitely.

### Model loading after promotion

The engine does **not** hot-reload the model mid-operation. The new champion is written to `CLASSIFIER_MODEL_PATH` and loaded on the next engine restart. This is intentional:

- Hot-reload would require locking around every inference call, adding latency.
- Mid-session model changes could produce inconsistent classification for the same user within a conversation.
- Docker container restarts are cheap and already part of the deployment workflow.

The engine logs a clear message after promotion:

```
[classifier_export] retraining completed — restart engine to load new model
```

### Safety mechanisms

| Mechanism | Purpose |
|---|---|
| CV accuracy gate (≥ 0.90) | Rejects challengers trained on insufficient or unbalanced data before the held-out comparison |
| Champion comparison on held-out | Prevents regression — challenger must equal or beat the current production model |
| Champion backup before overwrite | `classifier_model_backup_{timestamp}.pkl` — roll back manually if needed |
| Export archiving | Processed files moved to `CLASSIFIER_ARCHIVE_DIR` — prevents re-inclusion in future runs |
| 10-minute subprocess timeout | Prevents runaway retraining from blocking the engine indefinitely |
| `RETRAIN_ON_EXPORT=false` | Disables automatic retraining without code changes — useful in staging or during debugging |

---

## Environment variables

| Variable | Default | Purpose |
|---|---|---|
| `CLASSIFIER_EXPORT_THRESHOLD` | `100` | Sessions before export + retrain trigger |
| `RETRAIN_ON_EXPORT` | `true` | Enable/disable automatic retraining |
| `RETRAIN_SCRIPT_PATH` | `/app/scripts/pipelines/classifier/retrain_pipeline.py` | Path to retrain script inside container |
| `CLASSIFIER_ARCHIVE_DIR` | `/data/classifier_labels/archived` | Where processed exports are moved after retraining |
| `CLASSIFIER_SEED_DATASET` | `/app/scripts/pipelines/classifier/seed_classifier_dataset.jsonl` | Seed dataset always included in retraining |
| `CLASSIFIER_MIN_CV_ACCURACY` | `0.90` | Minimum CV accuracy for challenger to proceed |
| `CLASSIFIER_HELD_OUT_RATIO` | `0.20` | Fraction of merged dataset reserved for champion/challenger comparison |

---

## Files

| File | Role |
|---|---|
| `scripts/pipelines/classifier/retrain_pipeline.py` | Champion/Challenger training, evaluation, promotion, and archiving |
| `Docker/src/utils/classifier_export.py` | Export trigger — launches `retrain_pipeline.py` in background after export |
| `scripts/pipelines/classifier/seed_classifier_dataset.jsonl` | Always included in retraining — anchors the model on known-good examples |

---

## Convergence behavior

Each retraining cycle merges the seed dataset with all accumulated production exports. As production traffic grows, the model progressively reflects real user queries rather than the hand-crafted seed.

```mermaid
flowchart LR
    T0["Cycle 0\n94 seed examples\nCV 1.0 on seed"] -->
    T1["Cycle 1\n94 + ~100 production\nreal query distribution"] -->
    T2["Cycle 2\n94 + ~200 production\nincreasing coverage"] -->
    TN["Cycle N\nseed becomes minority\nmodel reflects production traffic"]
```

The seed dataset is never removed — it acts as a regularizer that prevents the model from drifting entirely to production distribution edge cases.

---

## Consequences

### Positive

- Layer 2 accuracy improves automatically with usage — no human intervention required after initial deployment.
- The champion/challenger gate prevents production regressions from noisy batches.
- `RETRAIN_ON_EXPORT=false` provides a complete off switch without code changes.
- Export archiving keeps `/data/classifier_labels/` clean — only unprocessed exports accumulate.
- The backup mechanism allows manual rollback in the rare case a promoted challenger performs unexpectedly in production.

### Negative / Trade-offs

- **Retraining uses Ollama (bge-m3) on the host.** The retrain script runs inside the container but needs `OLLAMA_LOCAL_URL` reachable. If Ollama is down at retraining time, the pipeline fails and logs an error — the champion is unchanged.
- **The engine requires a restart to load the new model.** Promotions are invisible to users until restart. In low-traffic periods this is acceptable; high-traffic deployments may need an orchestrated restart strategy.
- **Accumulated exports grow unboundedly if retraining fails.** If the pipeline consistently fails (e.g., Ollama unreachable), exports accumulate in `/data/classifier_labels/` without being archived. A monitoring alert on directory size is recommended.

### Open questions

1. **Orchestrated restart policy:** Who triggers the engine restart after a promotion? Currently manual. Could be automated via a health-check endpoint that returns a `model_updated` flag, allowing the orchestrator to restart when traffic is low.

2. **Held-out set stability:** The held-out set is resampled on every retraining cycle from the merged dataset. As the dataset grows, the held-out set changes between cycles, making champion accuracy scores not directly comparable across cycles. A fixed held-out set (frozen after the first N examples) would improve comparability. Deferred.

3. **Class imbalance over time:** Production traffic may not be balanced across the four query types. If `CODE_GENERATION` is rare in production, its representation in the training set shrinks relative to the seed. The CV gate catches catastrophic imbalance but not gradual drift. A per-class recall threshold could be added to the gate.