assistance-engine/docs/RUNBOOK.md

390 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Brunix Assistance Engine — Operations Runbook
> **Audience:** Engineers on-call, DevOps, and anyone debugging the Brunix Engine in a live environment.
> **Last updated:** 2026-03-18
---
## Table of Contents
1. [Health Checks](#1-health-checks)
2. [Starting the Engine](#2-starting-the-engine)
3. [Stopping & Restarting](#3-stopping--restarting)
4. [Tunnel Management](#4-tunnel-management)
5. [Incident Playbooks](#5-incident-playbooks)
- [Engine fails to start](#51-engine-fails-to-start)
- [Elasticsearch unreachable](#52-elasticsearch-unreachable)
- [Ollama unreachable / model not found](#53-ollama-unreachable--model-not-found)
- [AskAgent returns `[ENG] Error`](#54-askagent-returns-eng-error)
- [EvaluateRAG returns ANTHROPIC_API_KEY error](#55-evaluaterag-returns-anthropic_api_key-error)
- [Container memory / OOM](#56-container-memory--oom)
- [Session history not persisting between requests](#57-session-history-not-persisting-between-requests)
6. [Log Reference](#6-log-reference)
7. [Useful Commands](#7-useful-commands)
8. [Escalation Path](#8-escalation-path)
---
## 1. Health Checks
### Is the gRPC server up?
```bash
grpcurl -plaintext localhost:50052 list
# Expected: brunix.AssistanceEngine
```
If `grpcurl` hangs or returns a connection error, the container is not running or the port is not mapped.
### Is Elasticsearch reachable?
```bash
curl -s http://localhost:9200/_cluster/health | python3 -m json.tool
# Expected: "status": "green" or "yellow"
```
### Is Ollama reachable?
```bash
curl -s http://localhost:11434/api/tags | python3 -m json.tool
# Expected: list of available models including qwen2.5:1.5b
```
### Is the embedding model loaded?
```bash
curl -s http://localhost:11434/api/tags | grep qwen3-0.6B-emb
# Expected: model entry present
```
### Is Langfuse reachable?
```bash
curl -s http://45.77.119.180/api/public/health
# Expected: {"status":"ok"}
```
---
## 2. Starting the Engine
### Prerequisites checklist
- [ ] Kubeconfig present at `./kubernetes/kubeconfig.yaml`
- [ ] `.env` file populated with all required variables (see `README.md`)
- [ ] All three kubectl tunnels active (see [§4](#4-tunnel-management))
- [ ] Docker daemon running
### Start command
```bash
cd Docker/
docker-compose up -d --build
```
### Verify startup
```bash
# Watch logs until you see "Brunix Engine initialized."
docker logs -f brunix-assistance-engine
# Expected log sequence:
# [ESEARCH] Connected: 8.x.x — index: avap-docs-test
# [ENGINE] listen on 50051 (gRPC)
# Brunix Engine initialized.
# [entrypoint] Starting OpenAI Proxy (HTTP :8000)...
```
**Startup typically takes 2060 seconds** depending on Ollama model loading time.
---
## 3. Stopping & Restarting
```bash
# Graceful stop
docker-compose down
# Hard stop (if container is unresponsive)
docker stop brunix-assistance-engine
docker rm brunix-assistance-engine
# Restart only the engine (no rebuild)
docker-compose restart brunix-engine
# Rebuild and restart (after code changes)
docker-compose up -d --build
```
> ⚠️ **Restart clears all in-memory session history.** All active conversations will lose context.
---
## 4. Tunnel Management
All three tunnels must be active for the engine to function. Run each in a separate terminal or as a background process.
```bash
# Tunnel 1 — Ollama (LLM + embeddings)
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
# Tunnel 2 — Elasticsearch (vector knowledge base)
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
# Tunnel 3 — PostgreSQL (Langfuse observability)
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
```
### Check tunnel status
```bash
# List active port-forwards
ps aux | grep "kubectl port-forward"
# Alternatively
lsof -i :11434
lsof -i :9200
lsof -i :5432
```
### Tunnel dropped?
kubectl tunnels drop silently. Symptoms:
- Elasticsearch: `[ESEARCH] Cant Connect` in engine logs
- Ollama: requests timeout or return connection errors
- Langfuse: tracing data stops appearing in the dashboard
**Fix:** Re-run the affected tunnel command. The engine will reconnect automatically on the next request.
---
## 5. Incident Playbooks
### 5.1 Engine fails to start
**Symptom:** `docker-compose up` exits immediately, or container restarts in a loop.
**Diagnosis:**
```bash
docker logs brunix-assistance-engine 2>&1 | head -50
```
**Common causes and fixes:**
| Log message | Cause | Fix |
|---|---|---|
| `Cannot connect to Ollama` | Ollama tunnel not running | Start Tunnel 1 |
| `model 'qwen2.5:1.5b' not found` | Model not loaded in Ollama | See [§5.3](#53-ollama-unreachable--model-not-found) |
| `ELASTICSEARCH_URL not set` | Missing `.env` | Check `.env` file exists and is complete |
| `No module named 'brunix_pb2'` | Proto stubs not generated | Run `docker-compose up --build` |
| `Port 50051 already in use` | Another instance running | `docker stop brunix-assistance-engine && docker rm brunix-assistance-engine` |
---
### 5.2 Elasticsearch unreachable
**Symptom:** Log shows `[ESEARCH] Cant Connect`. Queries return empty context.
**Step 1 — Verify tunnel:**
```bash
curl -s http://localhost:9200/_cluster/health
```
**Step 2 — Restart tunnel if down:**
```bash
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
```
**Step 3 — Check index exists:**
```bash
curl -s http://localhost:9200/_cat/indices?v | grep avap
```
If the index is missing, the knowledge base has not been ingested. Run:
```bash
cd scripts/pipelines/flows/
python elasticsearch_ingestion.py
```
**Step 4 — Verify authentication:**
If your cluster uses authentication, confirm `ELASTICSEARCH_USER` + `ELASTICSEARCH_PASSWORD` or `ELASTICSEARCH_API_KEY` are set in `.env`.
---
### 5.3 Ollama unreachable / model not found
**Symptom:** Engine logs show connection errors to `http://host.docker.internal:11434`, or `validate_model_on_init=True` raises a model-not-found error on startup.
**Step 1 — Verify Ollama tunnel is active:**
```bash
curl -s http://localhost:11434/api/tags
```
**Step 2 — List available models:**
```bash
curl -s http://localhost:11434/api/tags | python3 -c "
import json, sys
data = json.load(sys.stdin)
for m in data.get('models', []):
print(m['name'])
"
```
**Step 3 — Pull missing models if needed:**
```bash
# On the Devaron cluster (via kubectl exec or direct access):
ollama pull qwen2.5:1.5b
ollama pull qwen3-0.6B-emb:latest
```
**Step 4 — Restart engine** after models are available:
```bash
docker-compose restart brunix-engine
```
---
### 5.4 AskAgent returns `[ENG] Error`
**Symptom:** Client receives `{"text": "[ENG] Error: ...", "is_final": true}`.
**Diagnosis:**
```bash
docker logs brunix-assistance-engine 2>&1 | grep -A 10 "Error"
```
| Error substring | Cause | Fix |
|---|---|---|
| `Connection refused` to `11434` | Ollama tunnel down | Restart Tunnel 1 |
| `Connection refused` to `9200` | ES tunnel down | Restart Tunnel 2 |
| `Index not found` | ES index missing | Run ingestion pipeline |
| `context length exceeded` | Query + history too long for model | Reduce session history or use a larger context model |
| `Traceback` / `KeyError` | Code bug | Check full traceback, open GitHub Issue |
---
### 5.5 EvaluateRAG returns ANTHROPIC_API_KEY error
**Symptom:** `EvalResponse.status` = `"ANTHROPIC_API_KEY no configurada en .env"`.
**Fix:**
1. Add `ANTHROPIC_API_KEY=sk-ant-...` to your `.env` file.
2. Add `ANTHROPIC_MODEL=claude-sonnet-4-20250514` (optional, has default).
3. Restart the engine: `docker-compose restart brunix-engine`.
---
### 5.6 Container memory / OOM
**Symptom:** Container is killed by the OOM killer. `docker inspect brunix-assistance-engine` shows `OOMKilled: true`.
**Diagnosis:**
```bash
docker stats brunix-assistance-engine
```
**Common causes:**
- Large context window being passed to Ollama (many retrieved chunks × long document).
- Session history growing unbounded over a long-running session.
**Mitigation:**
- Set `mem_limit` in `docker-compose.yaml`:
```yaml
services:
brunix-engine:
mem_limit: 4g
```
- Restart the container to clear session store.
- Consider reducing `k=8` in `hybrid_search_native` to limit context size.
---
### 5.7 Session history not persisting between requests
**Expected behaviour:** Sending two requests with the same `session_id` should maintain context.
**If Turn 2 does not seem to know about Turn 1:**
1. Confirm both requests use **identical** `session_id` strings (case-sensitive, no trailing spaces).
2. Confirm the engine was **not restarted** between the two requests (restart wipes `session_store`).
3. Check logs for `[AskAgentStream] conversation: N previous messages.` — if `N=0` on Turn 2, the session was not found.
4. Confirm the stream for Turn 1 was **fully consumed** (client read all messages including `is_final=true`) — the engine only persists history after the stream ends.
---
## 6. Log Reference
| Log prefix | Module | What it means |
|---|---|---|
| `[ESEARCH] Connected` | `server.py` | Elasticsearch OK on startup |
| `[ESEARCH] Cant Connect` | `server.py` | Elasticsearch unreachable on startup |
| `[ENGINE] listen on 50051` | `server.py` | gRPC server ready |
| `[AskAgent] session=... query=...` | `server.py` | New non-streaming request |
| `[AskAgent] conversation: N messages` | `server.py` | History loaded for session |
| `[AskAgentStream] done — chunks=N` | `server.py` | Stream completed, history saved |
| `[classify] raw=... -> TYPE` | `graph.py` | Query classification result |
| `[reformulate] -> '...'` | `graph.py` | Reformulated query |
| `[hybrid] BM25 -> N hits` | `graph.py` | BM25 retrieval result |
| `[hybrid] kNN -> N hits` | `graph.py` | kNN retrieval result |
| `[hybrid] RRF -> N final docs` | `graph.py` | After RRF fusion |
| `[retrieve] N docs, context len=X` | `graph.py` | Context assembled |
| `[generate] X chars` | `graph.py` | Non-streaming answer generated |
| `[eval] Iniciando: N preguntas` | `evaluate.py` | Evaluation started |
| `[eval] Completado — global=X` | `evaluate.py` | Evaluation finished |
---
## 7. Useful Commands
```bash
# Real-time log streaming
docker logs -f brunix-assistance-engine
# Filter for errors only
docker logs brunix-assistance-engine 2>&1 | grep -i error
# Check container resource usage
docker stats brunix-assistance-engine --no-stream
# Enter container for debugging
docker exec -it brunix-assistance-engine /bin/bash
# Send a test query
grpcurl -plaintext \
-d '{"query": "What is AVAP?", "session_id": "test"}' \
localhost:50052 brunix.AssistanceEngine/AskAgent
# Check ES index document count
curl -s "http://localhost:9200/avap-docs-test/_count" | python3 -m json.tool
# Check ES index mapping
curl -s "http://localhost:9200/avap-docs-test/_mapping" | python3 -m json.tool
# List active containers
docker ps --filter name=brunix
# Check port bindings
docker port brunix-assistance-engine
```
---
## 8. Escalation Path
| Severity | Condition | Action |
|---|---|---|
| P1 | Engine completely down, not recoverable in 15 min | Notify via Slack `#brunix-incidents` immediately. Tag CTO. |
| P2 | Degraded quality (bad answers) or evaluation score drops below 0.60 | Open GitHub Issue with full log output and evaluation report. |
| P3 | Tunnel instability, intermittent errors | Report in daily standup. Document in GitHub Issue within 24h. |
| P4 | Documentation gap or non-critical config issue | Open GitHub Issue with label `documentation` or `improvement`. |
**For all P1/P2 incidents, the GitHub Issue must include:**
1. Exact command that triggered the failure
2. Full terminal output / error log
3. Status of all three kubectl tunnels at the time of failure
4. Docker container status (`docker inspect brunix-assistance-engine`)