# Brunix Assistance Engine — Operations Runbook > **Audience:** Engineers on-call, DevOps, and anyone debugging the Brunix Engine in a live environment. > **Last updated:** 2026-03-18 --- ## Table of Contents 1. [Health Checks](#1-health-checks) 2. [Starting the Engine](#2-starting-the-engine) 3. [Stopping & Restarting](#3-stopping--restarting) 4. [Tunnel Management](#4-tunnel-management) 5. [Incident Playbooks](#5-incident-playbooks) - [Engine fails to start](#51-engine-fails-to-start) - [Elasticsearch unreachable](#52-elasticsearch-unreachable) - [Ollama unreachable / model not found](#53-ollama-unreachable--model-not-found) - [AskAgent returns `[ENG] Error`](#54-askagent-returns-eng-error) - [EvaluateRAG returns ANTHROPIC_API_KEY error](#55-evaluaterag-returns-anthropic_api_key-error) - [Container memory / OOM](#56-container-memory--oom) - [Session history not persisting between requests](#57-session-history-not-persisting-between-requests) 6. [Log Reference](#6-log-reference) 7. [Useful Commands](#7-useful-commands) 8. [Escalation Path](#8-escalation-path) --- ## 1. Health Checks ### Is the gRPC server up? ```bash grpcurl -plaintext localhost:50052 list # Expected: brunix.AssistanceEngine ``` If `grpcurl` hangs or returns a connection error, the container is not running or the port is not mapped. ### Is Elasticsearch reachable? ```bash curl -s http://localhost:9200/_cluster/health | python3 -m json.tool # Expected: "status": "green" or "yellow" ``` ### Is Ollama reachable? ```bash curl -s http://localhost:11434/api/tags | python3 -m json.tool # Expected: list of available models including qwen2.5:1.5b ``` ### Is the embedding model loaded? ```bash curl -s http://localhost:11434/api/tags | grep qwen3-0.6B-emb # Expected: model entry present ``` ### Is Langfuse reachable? ```bash curl -s http://45.77.119.180/api/public/health # Expected: {"status":"ok"} ``` --- ## 2. Starting the Engine ### Prerequisites checklist - [ ] Kubeconfig present at `./kubernetes/kubeconfig.yaml` - [ ] `.env` file populated with all required variables (see `README.md`) - [ ] All three kubectl tunnels active (see [§4](#4-tunnel-management)) - [ ] Docker daemon running ### Start command ```bash cd Docker/ docker-compose up -d --build ``` ### Verify startup ```bash # Watch logs until you see "Brunix Engine initialized." docker logs -f brunix-assistance-engine # Expected log sequence: # [ESEARCH] Connected: 8.x.x — index: avap-docs-test # [ENGINE] listen on 50051 (gRPC) # Brunix Engine initialized. # [entrypoint] Starting OpenAI Proxy (HTTP :8000)... ``` **Startup typically takes 20–60 seconds** depending on Ollama model loading time. --- ## 3. Stopping & Restarting ```bash # Graceful stop docker-compose down # Hard stop (if container is unresponsive) docker stop brunix-assistance-engine docker rm brunix-assistance-engine # Restart only the engine (no rebuild) docker-compose restart brunix-engine # Rebuild and restart (after code changes) docker-compose up -d --build ``` > ⚠️ **Restart clears all in-memory session history.** All active conversations will lose context. --- ## 4. Tunnel Management All three tunnels must be active for the engine to function. Run each in a separate terminal or as a background process. ```bash # Tunnel 1 — Ollama (LLM + embeddings) kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \ -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml # Tunnel 2 — Elasticsearch (vector knowledge base) kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \ -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml # Tunnel 3 — PostgreSQL (Langfuse observability) kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \ -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml ``` ### Check tunnel status ```bash # List active port-forwards ps aux | grep "kubectl port-forward" # Alternatively lsof -i :11434 lsof -i :9200 lsof -i :5432 ``` ### Tunnel dropped? kubectl tunnels drop silently. Symptoms: - Elasticsearch: `[ESEARCH] Cant Connect` in engine logs - Ollama: requests timeout or return connection errors - Langfuse: tracing data stops appearing in the dashboard **Fix:** Re-run the affected tunnel command. The engine will reconnect automatically on the next request. --- ## 5. Incident Playbooks ### 5.1 Engine fails to start **Symptom:** `docker-compose up` exits immediately, or container restarts in a loop. **Diagnosis:** ```bash docker logs brunix-assistance-engine 2>&1 | head -50 ``` **Common causes and fixes:** | Log message | Cause | Fix | |---|---|---| | `Cannot connect to Ollama` | Ollama tunnel not running | Start Tunnel 1 | | `model 'qwen2.5:1.5b' not found` | Model not loaded in Ollama | See [§5.3](#53-ollama-unreachable--model-not-found) | | `ELASTICSEARCH_URL not set` | Missing `.env` | Check `.env` file exists and is complete | | `No module named 'brunix_pb2'` | Proto stubs not generated | Run `docker-compose up --build` | | `Port 50051 already in use` | Another instance running | `docker stop brunix-assistance-engine && docker rm brunix-assistance-engine` | --- ### 5.2 Elasticsearch unreachable **Symptom:** Log shows `[ESEARCH] Cant Connect`. Queries return empty context. **Step 1 — Verify tunnel:** ```bash curl -s http://localhost:9200/_cluster/health ``` **Step 2 — Restart tunnel if down:** ```bash kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \ -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml ``` **Step 3 — Check index exists:** ```bash curl -s http://localhost:9200/_cat/indices?v | grep avap ``` If the index is missing, the knowledge base has not been ingested. Run: ```bash cd scripts/pipelines/flows/ python elasticsearch_ingestion.py ``` **Step 4 — Verify authentication:** If your cluster uses authentication, confirm `ELASTICSEARCH_USER` + `ELASTICSEARCH_PASSWORD` or `ELASTICSEARCH_API_KEY` are set in `.env`. --- ### 5.3 Ollama unreachable / model not found **Symptom:** Engine logs show connection errors to `http://host.docker.internal:11434`, or `validate_model_on_init=True` raises a model-not-found error on startup. **Step 1 — Verify Ollama tunnel is active:** ```bash curl -s http://localhost:11434/api/tags ``` **Step 2 — List available models:** ```bash curl -s http://localhost:11434/api/tags | python3 -c " import json, sys data = json.load(sys.stdin) for m in data.get('models', []): print(m['name']) " ``` **Step 3 — Pull missing models if needed:** ```bash # On the Devaron cluster (via kubectl exec or direct access): ollama pull qwen2.5:1.5b ollama pull qwen3-0.6B-emb:latest ``` **Step 4 — Restart engine** after models are available: ```bash docker-compose restart brunix-engine ``` --- ### 5.4 AskAgent returns `[ENG] Error` **Symptom:** Client receives `{"text": "[ENG] Error: ...", "is_final": true}`. **Diagnosis:** ```bash docker logs brunix-assistance-engine 2>&1 | grep -A 10 "Error" ``` | Error substring | Cause | Fix | |---|---|---| | `Connection refused` to `11434` | Ollama tunnel down | Restart Tunnel 1 | | `Connection refused` to `9200` | ES tunnel down | Restart Tunnel 2 | | `Index not found` | ES index missing | Run ingestion pipeline | | `context length exceeded` | Query + history too long for model | Reduce session history or use a larger context model | | `Traceback` / `KeyError` | Code bug | Check full traceback, open GitHub Issue | --- ### 5.5 EvaluateRAG returns ANTHROPIC_API_KEY error **Symptom:** `EvalResponse.status` = `"ANTHROPIC_API_KEY no configurada en .env"`. **Fix:** 1. Add `ANTHROPIC_API_KEY=sk-ant-...` to your `.env` file. 2. Add `ANTHROPIC_MODEL=claude-sonnet-4-20250514` (optional, has default). 3. Restart the engine: `docker-compose restart brunix-engine`. --- ### 5.6 Container memory / OOM **Symptom:** Container is killed by the OOM killer. `docker inspect brunix-assistance-engine` shows `OOMKilled: true`. **Diagnosis:** ```bash docker stats brunix-assistance-engine ``` **Common causes:** - Large context window being passed to Ollama (many retrieved chunks × long document). - Session history growing unbounded over a long-running session. **Mitigation:** - Set `mem_limit` in `docker-compose.yaml`: ```yaml services: brunix-engine: mem_limit: 4g ``` - Restart the container to clear session store. - Consider reducing `k=8` in `hybrid_search_native` to limit context size. --- ### 5.7 Session history not persisting between requests **Expected behaviour:** Sending two requests with the same `session_id` should maintain context. **If Turn 2 does not seem to know about Turn 1:** 1. Confirm both requests use **identical** `session_id` strings (case-sensitive, no trailing spaces). 2. Confirm the engine was **not restarted** between the two requests (restart wipes `session_store`). 3. Check logs for `[AskAgentStream] conversation: N previous messages.` — if `N=0` on Turn 2, the session was not found. 4. Confirm the stream for Turn 1 was **fully consumed** (client read all messages including `is_final=true`) — the engine only persists history after the stream ends. --- ## 6. Log Reference | Log prefix | Module | What it means | |---|---|---| | `[ESEARCH] Connected` | `server.py` | Elasticsearch OK on startup | | `[ESEARCH] Cant Connect` | `server.py` | Elasticsearch unreachable on startup | | `[ENGINE] listen on 50051` | `server.py` | gRPC server ready | | `[AskAgent] session=... query=...` | `server.py` | New non-streaming request | | `[AskAgent] conversation: N messages` | `server.py` | History loaded for session | | `[AskAgentStream] done — chunks=N` | `server.py` | Stream completed, history saved | | `[classify] raw=... -> TYPE` | `graph.py` | Query classification result | | `[reformulate] -> '...'` | `graph.py` | Reformulated query | | `[hybrid] BM25 -> N hits` | `graph.py` | BM25 retrieval result | | `[hybrid] kNN -> N hits` | `graph.py` | kNN retrieval result | | `[hybrid] RRF -> N final docs` | `graph.py` | After RRF fusion | | `[retrieve] N docs, context len=X` | `graph.py` | Context assembled | | `[generate] X chars` | `graph.py` | Non-streaming answer generated | | `[eval] Iniciando: N preguntas` | `evaluate.py` | Evaluation started | | `[eval] Completado — global=X` | `evaluate.py` | Evaluation finished | --- ## 7. Useful Commands ```bash # Real-time log streaming docker logs -f brunix-assistance-engine # Filter for errors only docker logs brunix-assistance-engine 2>&1 | grep -i error # Check container resource usage docker stats brunix-assistance-engine --no-stream # Enter container for debugging docker exec -it brunix-assistance-engine /bin/bash # Send a test query grpcurl -plaintext \ -d '{"query": "What is AVAP?", "session_id": "test"}' \ localhost:50052 brunix.AssistanceEngine/AskAgent # Check ES index document count curl -s "http://localhost:9200/avap-docs-test/_count" | python3 -m json.tool # Check ES index mapping curl -s "http://localhost:9200/avap-docs-test/_mapping" | python3 -m json.tool # List active containers docker ps --filter name=brunix # Check port bindings docker port brunix-assistance-engine ``` --- ## 8. Escalation Path | Severity | Condition | Action | |---|---|---| | P1 | Engine completely down, not recoverable in 15 min | Notify via Slack `#brunix-incidents` immediately. Tag CTO. | | P2 | Degraded quality (bad answers) or evaluation score drops below 0.60 | Open GitHub Issue with full log output and evaluation report. | | P3 | Tunnel instability, intermittent errors | Report in daily standup. Document in GitHub Issue within 24h. | | P4 | Documentation gap or non-critical config issue | Open GitHub Issue with label `documentation` or `improvement`. | **For all P1/P2 incidents, the GitHub Issue must include:** 1. Exact command that triggered the failure 2. Full terminal output / error log 3. Status of all three kubectl tunnels at the time of failure 4. Docker container status (`docker inspect brunix-assistance-engine`)