assistance-engine/docs/RUNBOOK.md

# Brunix Assistance Engine — Operations Runbook

> **Audience:** Engineers on-call, DevOps, and anyone debugging the Brunix Engine in a live environment.
> **Last updated:** 2026-03-18

---

## Table of Contents

1. [Health Checks](#1-health-checks)
2. [Starting the Engine](#2-starting-the-engine)
3. [Stopping & Restarting](#3-stopping--restarting)
4. [Tunnel Management](#4-tunnel-management)
5. [Incident Playbooks](#5-incident-playbooks)
   - [Engine fails to start](#51-engine-fails-to-start)
   - [Elasticsearch unreachable](#52-elasticsearch-unreachable)
   - [Ollama unreachable / model not found](#53-ollama-unreachable--model-not-found)
   - [AskAgent returns `[ENG] Error`](#54-askagent-returns-eng-error)
   - [EvaluateRAG returns ANTHROPIC_API_KEY error](#55-evaluaterag-returns-anthropic_api_key-error)
   - [Container memory / OOM](#56-container-memory--oom)
   - [Session history not persisting between requests](#57-session-history-not-persisting-between-requests)
6. [Log Reference](#6-log-reference)
7. [Useful Commands](#7-useful-commands)
8. [Escalation Path](#8-escalation-path)

---

## 1. Health Checks

### Is the gRPC server up?

```bash
grpcurl -plaintext localhost:50052 list
# Expected: brunix.AssistanceEngine
```

If `grpcurl` hangs or returns a connection error, the container is not running or the port is not mapped.

### Is Elasticsearch reachable?

```bash
curl -s http://localhost:9200/_cluster/health | python3 -m json.tool
# Expected: "status": "green" or "yellow"
```

### Is Ollama reachable?

```bash
curl -s http://localhost:11434/api/tags | python3 -m json.tool
# Expected: list of available models including qwen2.5:1.5b
```

### Is the embedding model loaded?

```bash
curl -s http://localhost:11434/api/tags | grep qwen3-0.6B-emb
# Expected: model entry present
```

### Is Langfuse reachable?

```bash
curl -s http://45.77.119.180/api/public/health
# Expected: {"status":"ok"}
```

---

## 2. Starting the Engine

### Prerequisites checklist

- [ ] Kubeconfig present at `./kubernetes/kubeconfig.yaml`
- [ ] `.env` file populated with all required variables (see `README.md`)
- [ ] All three kubectl tunnels active (see [§4](#4-tunnel-management))
- [ ] Docker daemon running

### Start command

```bash
cd Docker/
docker-compose up -d --build
```

### Verify startup

```bash
# Watch logs until you see "Brunix Engine initialized."
docker logs -f brunix-assistance-engine

# Expected log sequence:
# [ESEARCH] Connected: 8.x.x — index: avap-docs-test
# [ENGINE] listen on 50051 (gRPC)
# Brunix Engine initialized.
# [entrypoint] Starting OpenAI Proxy (HTTP :8000)...
```

**Startup typically takes 20–60 seconds** depending on Ollama model loading time.

---

## 3. Stopping & Restarting

```bash
# Graceful stop
docker-compose down

# Hard stop (if container is unresponsive)
docker stop brunix-assistance-engine
docker rm brunix-assistance-engine

# Restart only the engine (no rebuild)
docker-compose restart brunix-engine

# Rebuild and restart (after code changes)
docker-compose up -d --build
```

> ⚠️ **Restart clears all in-memory session history.** All active conversations will lose context.

---

## 4. Tunnel Management

All three tunnels must be active for the engine to function. Run each in a separate terminal or as a background process.

```bash
# Tunnel 1 — Ollama (LLM + embeddings)
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Tunnel 2 — Elasticsearch (vector knowledge base)
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Tunnel 3 — PostgreSQL (Langfuse observability)
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
```

### Check tunnel status

```bash
# List active port-forwards
ps aux | grep "kubectl port-forward"

# Alternatively
lsof -i :11434
lsof -i :9200
lsof -i :5432
```

### Tunnel dropped?

kubectl tunnels drop silently. Symptoms:
- Elasticsearch: `[ESEARCH] Cant Connect` in engine logs
- Ollama: requests timeout or return connection errors
- Langfuse: tracing data stops appearing in the dashboard

**Fix:** Re-run the affected tunnel command. The engine will reconnect automatically on the next request.

---

## 5. Incident Playbooks

### 5.1 Engine fails to start

**Symptom:** `docker-compose up` exits immediately, or container restarts in a loop.

**Diagnosis:**
```bash
docker logs brunix-assistance-engine 2>&1 | head -50
```

**Common causes and fixes:**

| Log message | Cause | Fix |
|---|---|---|
| `Cannot connect to Ollama` | Ollama tunnel not running | Start Tunnel 1 |
| `model 'qwen2.5:1.5b' not found` | Model not loaded in Ollama | See [§5.3](#53-ollama-unreachable--model-not-found) |
| `ELASTICSEARCH_URL not set` | Missing `.env` | Check `.env` file exists and is complete |
| `No module named 'brunix_pb2'` | Proto stubs not generated | Run `docker-compose up --build` |
| `Port 50051 already in use` | Another instance running | `docker stop brunix-assistance-engine && docker rm brunix-assistance-engine` |

---

### 5.2 Elasticsearch unreachable

**Symptom:** Log shows `[ESEARCH] Cant Connect`. Queries return empty context.

**Step 1 — Verify tunnel:**
```bash
curl -s http://localhost:9200/_cluster/health
```

**Step 2 — Restart tunnel if down:**
```bash
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
```

**Step 3 — Check index exists:**
```bash
curl -s http://localhost:9200/_cat/indices?v | grep avap
```

If the index is missing, the knowledge base has not been ingested. Run:
```bash
cd scripts/pipelines/flows/
python elasticsearch_ingestion.py
```

**Step 4 — Verify authentication:**
If your cluster uses authentication, confirm `ELASTICSEARCH_USER` + `ELASTICSEARCH_PASSWORD` or `ELASTICSEARCH_API_KEY` are set in `.env`.

---

### 5.3 Ollama unreachable / model not found

**Symptom:** Engine logs show connection errors to `http://host.docker.internal:11434`, or `validate_model_on_init=True` raises a model-not-found error on startup.

**Step 1 — Verify Ollama tunnel is active:**
```bash
curl -s http://localhost:11434/api/tags
```

**Step 2 — List available models:**
```bash
curl -s http://localhost:11434/api/tags | python3 -c "
import json, sys
data = json.load(sys.stdin)
for m in data.get('models', []):
    print(m['name'])
"
```

**Step 3 — Pull missing models if needed:**
```bash
# On the Devaron cluster (via kubectl exec or direct access):
ollama pull qwen2.5:1.5b
ollama pull qwen3-0.6B-emb:latest
```

**Step 4 — Restart engine** after models are available:
```bash
docker-compose restart brunix-engine
```

---

### 5.4 AskAgent returns `[ENG] Error`

**Symptom:** Client receives `{"text": "[ENG] Error: ...", "is_final": true}`.

**Diagnosis:**
```bash
docker logs brunix-assistance-engine 2>&1 | grep -A 10 "Error"
```

| Error substring | Cause | Fix |
|---|---|---|
| `Connection refused` to `11434` | Ollama tunnel down | Restart Tunnel 1 |
| `Connection refused` to `9200` | ES tunnel down | Restart Tunnel 2 |
| `Index not found` | ES index missing | Run ingestion pipeline |
| `context length exceeded` | Query + history too long for model | Reduce session history or use a larger context model |
| `Traceback` / `KeyError` | Code bug | Check full traceback, open GitHub Issue |

---

### 5.5 EvaluateRAG returns ANTHROPIC_API_KEY error

**Symptom:** `EvalResponse.status` = `"ANTHROPIC_API_KEY no configurada en .env"`.

**Fix:**
1. Add `ANTHROPIC_API_KEY=sk-ant-...` to your `.env` file.
2. Add `ANTHROPIC_MODEL=claude-sonnet-4-20250514` (optional, has default).
3. Restart the engine: `docker-compose restart brunix-engine`.

---

### 5.6 Container memory / OOM

**Symptom:** Container is killed by the OOM killer. `docker inspect brunix-assistance-engine` shows `OOMKilled: true`.

**Diagnosis:**
```bash
docker stats brunix-assistance-engine
```

**Common causes:**
- Large context window being passed to Ollama (many retrieved chunks × long document).
- Session history growing unbounded over a long-running session.

**Mitigation:**
- Set `mem_limit` in `docker-compose.yaml`:
  ```yaml
  services:
    brunix-engine:
      mem_limit: 4g
  ```
- Restart the container to clear session store.
- Consider reducing `k=8` in `hybrid_search_native` to limit context size.

---

### 5.7 Session history not persisting between requests

**Expected behaviour:** Sending two requests with the same `session_id` should maintain context.

**If Turn 2 does not seem to know about Turn 1:**

1. Confirm both requests use **identical** `session_id` strings (case-sensitive, no trailing spaces).
2. Confirm the engine was **not restarted** between the two requests (restart wipes `session_store`).
3. Check logs for `[AskAgentStream] conversation: N previous messages.` — if `N=0` on Turn 2, the session was not found.
4. Confirm the stream for Turn 1 was **fully consumed** (client read all messages including `is_final=true`) — the engine only persists history after the stream ends.

---

## 6. Log Reference

| Log prefix | Module | What it means |
|---|---|---|
| `[ESEARCH] Connected` | `server.py` | Elasticsearch OK on startup |
| `[ESEARCH] Cant Connect` | `server.py` | Elasticsearch unreachable on startup |
| `[ENGINE] listen on 50051` | `server.py` | gRPC server ready |
| `[AskAgent] session=... query=...` | `server.py` | New non-streaming request |
| `[AskAgent] conversation: N messages` | `server.py` | History loaded for session |
| `[AskAgentStream] done — chunks=N` | `server.py` | Stream completed, history saved |
| `[classify] raw=... -> TYPE` | `graph.py` | Query classification result |
| `[reformulate] -> '...'` | `graph.py` | Reformulated query |
| `[hybrid] BM25 -> N hits` | `graph.py` | BM25 retrieval result |
| `[hybrid] kNN -> N hits` | `graph.py` | kNN retrieval result |
| `[hybrid] RRF -> N final docs` | `graph.py` | After RRF fusion |
| `[retrieve] N docs, context len=X` | `graph.py` | Context assembled |
| `[generate] X chars` | `graph.py` | Non-streaming answer generated |
| `[eval] Iniciando: N preguntas` | `evaluate.py` | Evaluation started |
| `[eval] Completado — global=X` | `evaluate.py` | Evaluation finished |

---

## 7. Useful Commands

```bash
# Real-time log streaming
docker logs -f brunix-assistance-engine

# Filter for errors only
docker logs brunix-assistance-engine 2>&1 | grep -i error

# Check container resource usage
docker stats brunix-assistance-engine --no-stream

# Enter container for debugging
docker exec -it brunix-assistance-engine /bin/bash

# Send a test query
grpcurl -plaintext \
  -d '{"query": "What is AVAP?", "session_id": "test"}' \
  localhost:50052 brunix.AssistanceEngine/AskAgent

# Check ES index document count
curl -s "http://localhost:9200/avap-docs-test/_count" | python3 -m json.tool

# Check ES index mapping
curl -s "http://localhost:9200/avap-docs-test/_mapping" | python3 -m json.tool

# List active containers
docker ps --filter name=brunix

# Check port bindings
docker port brunix-assistance-engine
```

---

## 8. Escalation Path

| Severity | Condition | Action |
|---|---|---|
| P1 | Engine completely down, not recoverable in 15 min | Notify via Slack `#brunix-incidents` immediately. Tag CTO. |
| P2 | Degraded quality (bad answers) or evaluation score drops below 0.60 | Open GitHub Issue with full log output and evaluation report. |
| P3 | Tunnel instability, intermittent errors | Report in daily standup. Document in GitHub Issue within 24h. |
| P4 | Documentation gap or non-critical config issue | Open GitHub Issue with label `documentation` or `improvement`. |

**For all P1/P2 incidents, the GitHub Issue must include:**
1. Exact command that triggered the failure
2. Full terminal output / error log
3. Status of all three kubectl tunnels at the time of failure
4. Docker container status (`docker inspect brunix-assistance-engine`)