390 lines
12 KiB
Markdown
390 lines
12 KiB
Markdown
# Brunix Assistance Engine — Operations Runbook
|
||
|
||
> **Audience:** Engineers on-call, DevOps, and anyone debugging the Brunix Engine in a live environment.
|
||
> **Last updated:** 2026-03-18
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Health Checks](#1-health-checks)
|
||
2. [Starting the Engine](#2-starting-the-engine)
|
||
3. [Stopping & Restarting](#3-stopping--restarting)
|
||
4. [Tunnel Management](#4-tunnel-management)
|
||
5. [Incident Playbooks](#5-incident-playbooks)
|
||
- [Engine fails to start](#51-engine-fails-to-start)
|
||
- [Elasticsearch unreachable](#52-elasticsearch-unreachable)
|
||
- [Ollama unreachable / model not found](#53-ollama-unreachable--model-not-found)
|
||
- [AskAgent returns `[ENG] Error`](#54-askagent-returns-eng-error)
|
||
- [EvaluateRAG returns ANTHROPIC_API_KEY error](#55-evaluaterag-returns-anthropic_api_key-error)
|
||
- [Container memory / OOM](#56-container-memory--oom)
|
||
- [Session history not persisting between requests](#57-session-history-not-persisting-between-requests)
|
||
6. [Log Reference](#6-log-reference)
|
||
7. [Useful Commands](#7-useful-commands)
|
||
8. [Escalation Path](#8-escalation-path)
|
||
|
||
---
|
||
|
||
## 1. Health Checks
|
||
|
||
### Is the gRPC server up?
|
||
|
||
```bash
|
||
grpcurl -plaintext localhost:50052 list
|
||
# Expected: brunix.AssistanceEngine
|
||
```
|
||
|
||
If `grpcurl` hangs or returns a connection error, the container is not running or the port is not mapped.
|
||
|
||
### Is Elasticsearch reachable?
|
||
|
||
```bash
|
||
curl -s http://localhost:9200/_cluster/health | python3 -m json.tool
|
||
# Expected: "status": "green" or "yellow"
|
||
```
|
||
|
||
### Is Ollama reachable?
|
||
|
||
```bash
|
||
curl -s http://localhost:11434/api/tags | python3 -m json.tool
|
||
# Expected: list of available models including qwen2.5:1.5b
|
||
```
|
||
|
||
### Is the embedding model loaded?
|
||
|
||
```bash
|
||
curl -s http://localhost:11434/api/tags | grep qwen3-0.6B-emb
|
||
# Expected: model entry present
|
||
```
|
||
|
||
### Is Langfuse reachable?
|
||
|
||
```bash
|
||
curl -s http://45.77.119.180/api/public/health
|
||
# Expected: {"status":"ok"}
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Starting the Engine
|
||
|
||
### Prerequisites checklist
|
||
|
||
- [ ] Kubeconfig present at `./kubernetes/kubeconfig.yaml`
|
||
- [ ] `.env` file populated with all required variables (see `README.md`)
|
||
- [ ] All three kubectl tunnels active (see [§4](#4-tunnel-management))
|
||
- [ ] Docker daemon running
|
||
|
||
### Start command
|
||
|
||
```bash
|
||
cd Docker/
|
||
docker-compose up -d --build
|
||
```
|
||
|
||
### Verify startup
|
||
|
||
```bash
|
||
# Watch logs until you see "Brunix Engine initialized."
|
||
docker logs -f brunix-assistance-engine
|
||
|
||
# Expected log sequence:
|
||
# [ESEARCH] Connected: 8.x.x — index: avap-docs-test
|
||
# [ENGINE] listen on 50051 (gRPC)
|
||
# Brunix Engine initialized.
|
||
# [entrypoint] Starting OpenAI Proxy (HTTP :8000)...
|
||
```
|
||
|
||
**Startup typically takes 20–60 seconds** depending on Ollama model loading time.
|
||
|
||
---
|
||
|
||
## 3. Stopping & Restarting
|
||
|
||
```bash
|
||
# Graceful stop
|
||
docker-compose down
|
||
|
||
# Hard stop (if container is unresponsive)
|
||
docker stop brunix-assistance-engine
|
||
docker rm brunix-assistance-engine
|
||
|
||
# Restart only the engine (no rebuild)
|
||
docker-compose restart brunix-engine
|
||
|
||
# Rebuild and restart (after code changes)
|
||
docker-compose up -d --build
|
||
```
|
||
|
||
> ⚠️ **Restart clears all in-memory session history.** All active conversations will lose context.
|
||
|
||
---
|
||
|
||
## 4. Tunnel Management
|
||
|
||
All three tunnels must be active for the engine to function. Run each in a separate terminal or as a background process.
|
||
|
||
```bash
|
||
# Tunnel 1 — Ollama (LLM + embeddings)
|
||
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
|
||
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
|
||
|
||
# Tunnel 2 — Elasticsearch (vector knowledge base)
|
||
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
|
||
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
|
||
|
||
# Tunnel 3 — PostgreSQL (Langfuse observability)
|
||
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
|
||
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
|
||
```
|
||
|
||
### Check tunnel status
|
||
|
||
```bash
|
||
# List active port-forwards
|
||
ps aux | grep "kubectl port-forward"
|
||
|
||
# Alternatively
|
||
lsof -i :11434
|
||
lsof -i :9200
|
||
lsof -i :5432
|
||
```
|
||
|
||
### Tunnel dropped?
|
||
|
||
kubectl tunnels drop silently. Symptoms:
|
||
- Elasticsearch: `[ESEARCH] Cant Connect` in engine logs
|
||
- Ollama: requests timeout or return connection errors
|
||
- Langfuse: tracing data stops appearing in the dashboard
|
||
|
||
**Fix:** Re-run the affected tunnel command. The engine will reconnect automatically on the next request.
|
||
|
||
---
|
||
|
||
## 5. Incident Playbooks
|
||
|
||
### 5.1 Engine fails to start
|
||
|
||
**Symptom:** `docker-compose up` exits immediately, or container restarts in a loop.
|
||
|
||
**Diagnosis:**
|
||
```bash
|
||
docker logs brunix-assistance-engine 2>&1 | head -50
|
||
```
|
||
|
||
**Common causes and fixes:**
|
||
|
||
| Log message | Cause | Fix |
|
||
|---|---|---|
|
||
| `Cannot connect to Ollama` | Ollama tunnel not running | Start Tunnel 1 |
|
||
| `model 'qwen2.5:1.5b' not found` | Model not loaded in Ollama | See [§5.3](#53-ollama-unreachable--model-not-found) |
|
||
| `ELASTICSEARCH_URL not set` | Missing `.env` | Check `.env` file exists and is complete |
|
||
| `No module named 'brunix_pb2'` | Proto stubs not generated | Run `docker-compose up --build` |
|
||
| `Port 50051 already in use` | Another instance running | `docker stop brunix-assistance-engine && docker rm brunix-assistance-engine` |
|
||
|
||
---
|
||
|
||
### 5.2 Elasticsearch unreachable
|
||
|
||
**Symptom:** Log shows `[ESEARCH] Cant Connect`. Queries return empty context.
|
||
|
||
**Step 1 — Verify tunnel:**
|
||
```bash
|
||
curl -s http://localhost:9200/_cluster/health
|
||
```
|
||
|
||
**Step 2 — Restart tunnel if down:**
|
||
```bash
|
||
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
|
||
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
|
||
```
|
||
|
||
**Step 3 — Check index exists:**
|
||
```bash
|
||
curl -s http://localhost:9200/_cat/indices?v | grep avap
|
||
```
|
||
|
||
If the index is missing, the knowledge base has not been ingested. Run:
|
||
```bash
|
||
cd scripts/pipelines/flows/
|
||
python elasticsearch_ingestion.py
|
||
```
|
||
|
||
**Step 4 — Verify authentication:**
|
||
If your cluster uses authentication, confirm `ELASTICSEARCH_USER` + `ELASTICSEARCH_PASSWORD` or `ELASTICSEARCH_API_KEY` are set in `.env`.
|
||
|
||
---
|
||
|
||
### 5.3 Ollama unreachable / model not found
|
||
|
||
**Symptom:** Engine logs show connection errors to `http://host.docker.internal:11434`, or `validate_model_on_init=True` raises a model-not-found error on startup.
|
||
|
||
**Step 1 — Verify Ollama tunnel is active:**
|
||
```bash
|
||
curl -s http://localhost:11434/api/tags
|
||
```
|
||
|
||
**Step 2 — List available models:**
|
||
```bash
|
||
curl -s http://localhost:11434/api/tags | python3 -c "
|
||
import json, sys
|
||
data = json.load(sys.stdin)
|
||
for m in data.get('models', []):
|
||
print(m['name'])
|
||
"
|
||
```
|
||
|
||
**Step 3 — Pull missing models if needed:**
|
||
```bash
|
||
# On the Devaron cluster (via kubectl exec or direct access):
|
||
ollama pull qwen2.5:1.5b
|
||
ollama pull qwen3-0.6B-emb:latest
|
||
```
|
||
|
||
**Step 4 — Restart engine** after models are available:
|
||
```bash
|
||
docker-compose restart brunix-engine
|
||
```
|
||
|
||
---
|
||
|
||
### 5.4 AskAgent returns `[ENG] Error`
|
||
|
||
**Symptom:** Client receives `{"text": "[ENG] Error: ...", "is_final": true}`.
|
||
|
||
**Diagnosis:**
|
||
```bash
|
||
docker logs brunix-assistance-engine 2>&1 | grep -A 10 "Error"
|
||
```
|
||
|
||
| Error substring | Cause | Fix |
|
||
|---|---|---|
|
||
| `Connection refused` to `11434` | Ollama tunnel down | Restart Tunnel 1 |
|
||
| `Connection refused` to `9200` | ES tunnel down | Restart Tunnel 2 |
|
||
| `Index not found` | ES index missing | Run ingestion pipeline |
|
||
| `context length exceeded` | Query + history too long for model | Reduce session history or use a larger context model |
|
||
| `Traceback` / `KeyError` | Code bug | Check full traceback, open GitHub Issue |
|
||
|
||
---
|
||
|
||
### 5.5 EvaluateRAG returns ANTHROPIC_API_KEY error
|
||
|
||
**Symptom:** `EvalResponse.status` = `"ANTHROPIC_API_KEY no configurada en .env"`.
|
||
|
||
**Fix:**
|
||
1. Add `ANTHROPIC_API_KEY=sk-ant-...` to your `.env` file.
|
||
2. Add `ANTHROPIC_MODEL=claude-sonnet-4-20250514` (optional, has default).
|
||
3. Restart the engine: `docker-compose restart brunix-engine`.
|
||
|
||
---
|
||
|
||
### 5.6 Container memory / OOM
|
||
|
||
**Symptom:** Container is killed by the OOM killer. `docker inspect brunix-assistance-engine` shows `OOMKilled: true`.
|
||
|
||
**Diagnosis:**
|
||
```bash
|
||
docker stats brunix-assistance-engine
|
||
```
|
||
|
||
**Common causes:**
|
||
- Large context window being passed to Ollama (many retrieved chunks × long document).
|
||
- Session history growing unbounded over a long-running session.
|
||
|
||
**Mitigation:**
|
||
- Set `mem_limit` in `docker-compose.yaml`:
|
||
```yaml
|
||
services:
|
||
brunix-engine:
|
||
mem_limit: 4g
|
||
```
|
||
- Restart the container to clear session store.
|
||
- Consider reducing `k=8` in `hybrid_search_native` to limit context size.
|
||
|
||
---
|
||
|
||
### 5.7 Session history not persisting between requests
|
||
|
||
**Expected behaviour:** Sending two requests with the same `session_id` should maintain context.
|
||
|
||
**If Turn 2 does not seem to know about Turn 1:**
|
||
|
||
1. Confirm both requests use **identical** `session_id` strings (case-sensitive, no trailing spaces).
|
||
2. Confirm the engine was **not restarted** between the two requests (restart wipes `session_store`).
|
||
3. Check logs for `[AskAgentStream] conversation: N previous messages.` — if `N=0` on Turn 2, the session was not found.
|
||
4. Confirm the stream for Turn 1 was **fully consumed** (client read all messages including `is_final=true`) — the engine only persists history after the stream ends.
|
||
|
||
---
|
||
|
||
## 6. Log Reference
|
||
|
||
| Log prefix | Module | What it means |
|
||
|---|---|---|
|
||
| `[ESEARCH] Connected` | `server.py` | Elasticsearch OK on startup |
|
||
| `[ESEARCH] Cant Connect` | `server.py` | Elasticsearch unreachable on startup |
|
||
| `[ENGINE] listen on 50051` | `server.py` | gRPC server ready |
|
||
| `[AskAgent] session=... query=...` | `server.py` | New non-streaming request |
|
||
| `[AskAgent] conversation: N messages` | `server.py` | History loaded for session |
|
||
| `[AskAgentStream] done — chunks=N` | `server.py` | Stream completed, history saved |
|
||
| `[classify] raw=... -> TYPE` | `graph.py` | Query classification result |
|
||
| `[reformulate] -> '...'` | `graph.py` | Reformulated query |
|
||
| `[hybrid] BM25 -> N hits` | `graph.py` | BM25 retrieval result |
|
||
| `[hybrid] kNN -> N hits` | `graph.py` | kNN retrieval result |
|
||
| `[hybrid] RRF -> N final docs` | `graph.py` | After RRF fusion |
|
||
| `[retrieve] N docs, context len=X` | `graph.py` | Context assembled |
|
||
| `[generate] X chars` | `graph.py` | Non-streaming answer generated |
|
||
| `[eval] Iniciando: N preguntas` | `evaluate.py` | Evaluation started |
|
||
| `[eval] Completado — global=X` | `evaluate.py` | Evaluation finished |
|
||
|
||
---
|
||
|
||
## 7. Useful Commands
|
||
|
||
```bash
|
||
# Real-time log streaming
|
||
docker logs -f brunix-assistance-engine
|
||
|
||
# Filter for errors only
|
||
docker logs brunix-assistance-engine 2>&1 | grep -i error
|
||
|
||
# Check container resource usage
|
||
docker stats brunix-assistance-engine --no-stream
|
||
|
||
# Enter container for debugging
|
||
docker exec -it brunix-assistance-engine /bin/bash
|
||
|
||
# Send a test query
|
||
grpcurl -plaintext \
|
||
-d '{"query": "What is AVAP?", "session_id": "test"}' \
|
||
localhost:50052 brunix.AssistanceEngine/AskAgent
|
||
|
||
# Check ES index document count
|
||
curl -s "http://localhost:9200/avap-docs-test/_count" | python3 -m json.tool
|
||
|
||
# Check ES index mapping
|
||
curl -s "http://localhost:9200/avap-docs-test/_mapping" | python3 -m json.tool
|
||
|
||
# List active containers
|
||
docker ps --filter name=brunix
|
||
|
||
# Check port bindings
|
||
docker port brunix-assistance-engine
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Escalation Path
|
||
|
||
| Severity | Condition | Action |
|
||
|---|---|---|
|
||
| P1 | Engine completely down, not recoverable in 15 min | Notify via Slack `#brunix-incidents` immediately. Tag CTO. |
|
||
| P2 | Degraded quality (bad answers) or evaluation score drops below 0.60 | Open GitHub Issue with full log output and evaluation report. |
|
||
| P3 | Tunnel instability, intermittent errors | Report in daily standup. Document in GitHub Issue within 24h. |
|
||
| P4 | Documentation gap or non-critical config issue | Open GitHub Issue with label `documentation` or `improvement`. |
|
||
|
||
**For all P1/P2 incidents, the GitHub Issue must include:**
|
||
1. Exact command that triggered the failure
|
||
2. Full terminal output / error log
|
||
3. Status of all three kubectl tunnels at the time of failure
|
||
4. Docker container status (`docker inspect brunix-assistance-engine`)
|