12 KiB
Brunix Assistance Engine — Operations Runbook
Audience: Engineers on-call, DevOps, and anyone debugging the Brunix Engine in a live environment. Last updated: 2026-03-18
Table of Contents
- Health Checks
- Starting the Engine
- Stopping & Restarting
- Tunnel Management
- Incident Playbooks
- Log Reference
- Useful Commands
- Escalation Path
1. Health Checks
Is the gRPC server up?
grpcurl -plaintext localhost:50052 list
# Expected: brunix.AssistanceEngine
If grpcurl hangs or returns a connection error, the container is not running or the port is not mapped.
Is Elasticsearch reachable?
curl -s http://localhost:9200/_cluster/health | python3 -m json.tool
# Expected: "status": "green" or "yellow"
Is Ollama reachable?
curl -s http://localhost:11434/api/tags | python3 -m json.tool
# Expected: list of available models including qwen2.5:1.5b
Is the embedding model loaded?
curl -s http://localhost:11434/api/tags | grep qwen3-0.6B-emb
# Expected: model entry present
Is Langfuse reachable?
curl -s http://45.77.119.180/api/public/health
# Expected: {"status":"ok"}
2. Starting the Engine
Prerequisites checklist
- Kubeconfig present at
./kubernetes/kubeconfig.yaml .envfile populated with all required variables (seeREADME.md)- All three kubectl tunnels active (see §4)
- Docker daemon running
Start command
cd Docker/
docker-compose up -d --build
Verify startup
# Watch logs until you see "Brunix Engine initialized."
docker logs -f brunix-assistance-engine
# Expected log sequence:
# [ESEARCH] Connected: 8.x.x — index: avap-docs-test
# [ENGINE] listen on 50051 (gRPC)
# Brunix Engine initialized.
# [entrypoint] Starting OpenAI Proxy (HTTP :8000)...
Startup typically takes 20–60 seconds depending on Ollama model loading time.
3. Stopping & Restarting
# Graceful stop
docker-compose down
# Hard stop (if container is unresponsive)
docker stop brunix-assistance-engine
docker rm brunix-assistance-engine
# Restart only the engine (no rebuild)
docker-compose restart brunix-engine
# Rebuild and restart (after code changes)
docker-compose up -d --build
⚠️ Restart clears all in-memory session history. All active conversations will lose context.
4. Tunnel Management
All three tunnels must be active for the engine to function. Run each in a separate terminal or as a background process.
# Tunnel 1 — Ollama (LLM + embeddings)
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
# Tunnel 2 — Elasticsearch (vector knowledge base)
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
# Tunnel 3 — PostgreSQL (Langfuse observability)
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
Check tunnel status
# List active port-forwards
ps aux | grep "kubectl port-forward"
# Alternatively
lsof -i :11434
lsof -i :9200
lsof -i :5432
Tunnel dropped?
kubectl tunnels drop silently. Symptoms:
- Elasticsearch:
[ESEARCH] Cant Connectin engine logs - Ollama: requests timeout or return connection errors
- Langfuse: tracing data stops appearing in the dashboard
Fix: Re-run the affected tunnel command. The engine will reconnect automatically on the next request.
5. Incident Playbooks
5.1 Engine fails to start
Symptom: docker-compose up exits immediately, or container restarts in a loop.
Diagnosis:
docker logs brunix-assistance-engine 2>&1 | head -50
Common causes and fixes:
| Log message | Cause | Fix |
|---|---|---|
Cannot connect to Ollama |
Ollama tunnel not running | Start Tunnel 1 |
model 'qwen2.5:1.5b' not found |
Model not loaded in Ollama | See §5.3 |
ELASTICSEARCH_URL not set |
Missing .env |
Check .env file exists and is complete |
No module named 'brunix_pb2' |
Proto stubs not generated | Run docker-compose up --build |
Port 50051 already in use |
Another instance running | docker stop brunix-assistance-engine && docker rm brunix-assistance-engine |
5.2 Elasticsearch unreachable
Symptom: Log shows [ESEARCH] Cant Connect. Queries return empty context.
Step 1 — Verify tunnel:
curl -s http://localhost:9200/_cluster/health
Step 2 — Restart tunnel if down:
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
Step 3 — Check index exists:
curl -s http://localhost:9200/_cat/indices?v | grep avap
If the index is missing, the knowledge base has not been ingested. Run:
cd scripts/pipelines/flows/
python elasticsearch_ingestion.py
Step 4 — Verify authentication:
If your cluster uses authentication, confirm ELASTICSEARCH_USER + ELASTICSEARCH_PASSWORD or ELASTICSEARCH_API_KEY are set in .env.
5.3 Ollama unreachable / model not found
Symptom: Engine logs show connection errors to http://host.docker.internal:11434, or validate_model_on_init=True raises a model-not-found error on startup.
Step 1 — Verify Ollama tunnel is active:
curl -s http://localhost:11434/api/tags
Step 2 — List available models:
curl -s http://localhost:11434/api/tags | python3 -c "
import json, sys
data = json.load(sys.stdin)
for m in data.get('models', []):
print(m['name'])
"
Step 3 — Pull missing models if needed:
# On the Devaron cluster (via kubectl exec or direct access):
ollama pull qwen2.5:1.5b
ollama pull qwen3-0.6B-emb:latest
Step 4 — Restart engine after models are available:
docker-compose restart brunix-engine
5.4 AskAgent returns [ENG] Error
Symptom: Client receives {"text": "[ENG] Error: ...", "is_final": true}.
Diagnosis:
docker logs brunix-assistance-engine 2>&1 | grep -A 10 "Error"
| Error substring | Cause | Fix |
|---|---|---|
Connection refused to 11434 |
Ollama tunnel down | Restart Tunnel 1 |
Connection refused to 9200 |
ES tunnel down | Restart Tunnel 2 |
Index not found |
ES index missing | Run ingestion pipeline |
context length exceeded |
Query + history too long for model | Reduce session history or use a larger context model |
Traceback / KeyError |
Code bug | Check full traceback, open GitHub Issue |
5.5 EvaluateRAG returns ANTHROPIC_API_KEY error
Symptom: EvalResponse.status = "ANTHROPIC_API_KEY no configurada en .env".
Fix:
- Add
ANTHROPIC_API_KEY=sk-ant-...to your.envfile. - Add
ANTHROPIC_MODEL=claude-sonnet-4-20250514(optional, has default). - Restart the engine:
docker-compose restart brunix-engine.
5.6 Container memory / OOM
Symptom: Container is killed by the OOM killer. docker inspect brunix-assistance-engine shows OOMKilled: true.
Diagnosis:
docker stats brunix-assistance-engine
Common causes:
- Large context window being passed to Ollama (many retrieved chunks × long document).
- Session history growing unbounded over a long-running session.
Mitigation:
- Set
mem_limitindocker-compose.yaml:services: brunix-engine: mem_limit: 4g - Restart the container to clear session store.
- Consider reducing
k=8inhybrid_search_nativeto limit context size.
5.7 Session history not persisting between requests
Expected behaviour: Sending two requests with the same session_id should maintain context.
If Turn 2 does not seem to know about Turn 1:
- Confirm both requests use identical
session_idstrings (case-sensitive, no trailing spaces). - Confirm the engine was not restarted between the two requests (restart wipes
session_store). - Check logs for
[AskAgentStream] conversation: N previous messages.— ifN=0on Turn 2, the session was not found. - Confirm the stream for Turn 1 was fully consumed (client read all messages including
is_final=true) — the engine only persists history after the stream ends.
6. Log Reference
| Log prefix | Module | What it means |
|---|---|---|
[ESEARCH] Connected |
server.py |
Elasticsearch OK on startup |
[ESEARCH] Cant Connect |
server.py |
Elasticsearch unreachable on startup |
[ENGINE] listen on 50051 |
server.py |
gRPC server ready |
[AskAgent] session=... query=... |
server.py |
New non-streaming request |
[AskAgent] conversation: N messages |
server.py |
History loaded for session |
[AskAgentStream] done — chunks=N |
server.py |
Stream completed, history saved |
[classify] raw=... -> TYPE |
graph.py |
Query classification result |
[reformulate] -> '...' |
graph.py |
Reformulated query |
[hybrid] BM25 -> N hits |
graph.py |
BM25 retrieval result |
[hybrid] kNN -> N hits |
graph.py |
kNN retrieval result |
[hybrid] RRF -> N final docs |
graph.py |
After RRF fusion |
[retrieve] N docs, context len=X |
graph.py |
Context assembled |
[generate] X chars |
graph.py |
Non-streaming answer generated |
[eval] Iniciando: N preguntas |
evaluate.py |
Evaluation started |
[eval] Completado — global=X |
evaluate.py |
Evaluation finished |
7. Useful Commands
# Real-time log streaming
docker logs -f brunix-assistance-engine
# Filter for errors only
docker logs brunix-assistance-engine 2>&1 | grep -i error
# Check container resource usage
docker stats brunix-assistance-engine --no-stream
# Enter container for debugging
docker exec -it brunix-assistance-engine /bin/bash
# Send a test query
grpcurl -plaintext \
-d '{"query": "What is AVAP?", "session_id": "test"}' \
localhost:50052 brunix.AssistanceEngine/AskAgent
# Check ES index document count
curl -s "http://localhost:9200/avap-docs-test/_count" | python3 -m json.tool
# Check ES index mapping
curl -s "http://localhost:9200/avap-docs-test/_mapping" | python3 -m json.tool
# List active containers
docker ps --filter name=brunix
# Check port bindings
docker port brunix-assistance-engine
8. Escalation Path
| Severity | Condition | Action |
|---|---|---|
| P1 | Engine completely down, not recoverable in 15 min | Notify via Slack #brunix-incidents immediately. Tag CTO. |
| P2 | Degraded quality (bad answers) or evaluation score drops below 0.60 | Open GitHub Issue with full log output and evaluation report. |
| P3 | Tunnel instability, intermittent errors | Report in daily standup. Document in GitHub Issue within 24h. |
| P4 | Documentation gap or non-critical config issue | Open GitHub Issue with label documentation or improvement. |
For all P1/P2 incidents, the GitHub Issue must include:
- Exact command that triggered the failure
- Full terminal output / error log
- Status of all three kubectl tunnels at the time of failure
- Docker container status (
docker inspect brunix-assistance-engine)