assistance-engine/docs/RUNBOOK.md

12 KiB
Raw Blame History

Brunix Assistance Engine — Operations Runbook

Audience: Engineers on-call, DevOps, and anyone debugging the Brunix Engine in a live environment. Last updated: 2026-03-18


Table of Contents

  1. Health Checks
  2. Starting the Engine
  3. Stopping & Restarting
  4. Tunnel Management
  5. Incident Playbooks
  6. Log Reference
  7. Useful Commands
  8. Escalation Path

1. Health Checks

Is the gRPC server up?

grpcurl -plaintext localhost:50052 list
# Expected: brunix.AssistanceEngine

If grpcurl hangs or returns a connection error, the container is not running or the port is not mapped.

Is Elasticsearch reachable?

curl -s http://localhost:9200/_cluster/health | python3 -m json.tool
# Expected: "status": "green" or "yellow"

Is Ollama reachable?

curl -s http://localhost:11434/api/tags | python3 -m json.tool
# Expected: list of available models including qwen2.5:1.5b

Is the embedding model loaded?

curl -s http://localhost:11434/api/tags | grep qwen3-0.6B-emb
# Expected: model entry present

Is Langfuse reachable?

curl -s http://45.77.119.180/api/public/health
# Expected: {"status":"ok"}

2. Starting the Engine

Prerequisites checklist

  • Kubeconfig present at ./kubernetes/kubeconfig.yaml
  • .env file populated with all required variables (see README.md)
  • All three kubectl tunnels active (see §4)
  • Docker daemon running

Start command

cd Docker/
docker-compose up -d --build

Verify startup

# Watch logs until you see "Brunix Engine initialized."
docker logs -f brunix-assistance-engine

# Expected log sequence:
# [ESEARCH] Connected: 8.x.x — index: avap-docs-test
# [ENGINE] listen on 50051 (gRPC)
# Brunix Engine initialized.
# [entrypoint] Starting OpenAI Proxy (HTTP :8000)...

Startup typically takes 2060 seconds depending on Ollama model loading time.


3. Stopping & Restarting

# Graceful stop
docker-compose down

# Hard stop (if container is unresponsive)
docker stop brunix-assistance-engine
docker rm brunix-assistance-engine

# Restart only the engine (no rebuild)
docker-compose restart brunix-engine

# Rebuild and restart (after code changes)
docker-compose up -d --build

⚠️ Restart clears all in-memory session history. All active conversations will lose context.


4. Tunnel Management

All three tunnels must be active for the engine to function. Run each in a separate terminal or as a background process.

# Tunnel 1 — Ollama (LLM + embeddings)
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Tunnel 2 — Elasticsearch (vector knowledge base)
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Tunnel 3 — PostgreSQL (Langfuse observability)
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

Check tunnel status

# List active port-forwards
ps aux | grep "kubectl port-forward"

# Alternatively
lsof -i :11434
lsof -i :9200
lsof -i :5432

Tunnel dropped?

kubectl tunnels drop silently. Symptoms:

  • Elasticsearch: [ESEARCH] Cant Connect in engine logs
  • Ollama: requests timeout or return connection errors
  • Langfuse: tracing data stops appearing in the dashboard

Fix: Re-run the affected tunnel command. The engine will reconnect automatically on the next request.


5. Incident Playbooks

5.1 Engine fails to start

Symptom: docker-compose up exits immediately, or container restarts in a loop.

Diagnosis:

docker logs brunix-assistance-engine 2>&1 | head -50

Common causes and fixes:

Log message Cause Fix
Cannot connect to Ollama Ollama tunnel not running Start Tunnel 1
model 'qwen2.5:1.5b' not found Model not loaded in Ollama See §5.3
ELASTICSEARCH_URL not set Missing .env Check .env file exists and is complete
No module named 'brunix_pb2' Proto stubs not generated Run docker-compose up --build
Port 50051 already in use Another instance running docker stop brunix-assistance-engine && docker rm brunix-assistance-engine

5.2 Elasticsearch unreachable

Symptom: Log shows [ESEARCH] Cant Connect. Queries return empty context.

Step 1 — Verify tunnel:

curl -s http://localhost:9200/_cluster/health

Step 2 — Restart tunnel if down:

kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

Step 3 — Check index exists:

curl -s http://localhost:9200/_cat/indices?v | grep avap

If the index is missing, the knowledge base has not been ingested. Run:

cd scripts/pipelines/flows/
python elasticsearch_ingestion.py

Step 4 — Verify authentication:
If your cluster uses authentication, confirm ELASTICSEARCH_USER + ELASTICSEARCH_PASSWORD or ELASTICSEARCH_API_KEY are set in .env.


5.3 Ollama unreachable / model not found

Symptom: Engine logs show connection errors to http://host.docker.internal:11434, or validate_model_on_init=True raises a model-not-found error on startup.

Step 1 — Verify Ollama tunnel is active:

curl -s http://localhost:11434/api/tags

Step 2 — List available models:

curl -s http://localhost:11434/api/tags | python3 -c "
import json, sys
data = json.load(sys.stdin)
for m in data.get('models', []):
    print(m['name'])
"

Step 3 — Pull missing models if needed:

# On the Devaron cluster (via kubectl exec or direct access):
ollama pull qwen2.5:1.5b
ollama pull qwen3-0.6B-emb:latest

Step 4 — Restart engine after models are available:

docker-compose restart brunix-engine

5.4 AskAgent returns [ENG] Error

Symptom: Client receives {"text": "[ENG] Error: ...", "is_final": true}.

Diagnosis:

docker logs brunix-assistance-engine 2>&1 | grep -A 10 "Error"
Error substring Cause Fix
Connection refused to 11434 Ollama tunnel down Restart Tunnel 1
Connection refused to 9200 ES tunnel down Restart Tunnel 2
Index not found ES index missing Run ingestion pipeline
context length exceeded Query + history too long for model Reduce session history or use a larger context model
Traceback / KeyError Code bug Check full traceback, open GitHub Issue

5.5 EvaluateRAG returns ANTHROPIC_API_KEY error

Symptom: EvalResponse.status = "ANTHROPIC_API_KEY no configurada en .env".

Fix:

  1. Add ANTHROPIC_API_KEY=sk-ant-... to your .env file.
  2. Add ANTHROPIC_MODEL=claude-sonnet-4-20250514 (optional, has default).
  3. Restart the engine: docker-compose restart brunix-engine.

5.6 Container memory / OOM

Symptom: Container is killed by the OOM killer. docker inspect brunix-assistance-engine shows OOMKilled: true.

Diagnosis:

docker stats brunix-assistance-engine

Common causes:

  • Large context window being passed to Ollama (many retrieved chunks × long document).
  • Session history growing unbounded over a long-running session.

Mitigation:

  • Set mem_limit in docker-compose.yaml:
    services:
      brunix-engine:
        mem_limit: 4g
    
  • Restart the container to clear session store.
  • Consider reducing k=8 in hybrid_search_native to limit context size.

5.7 Session history not persisting between requests

Expected behaviour: Sending two requests with the same session_id should maintain context.

If Turn 2 does not seem to know about Turn 1:

  1. Confirm both requests use identical session_id strings (case-sensitive, no trailing spaces).
  2. Confirm the engine was not restarted between the two requests (restart wipes session_store).
  3. Check logs for [AskAgentStream] conversation: N previous messages. — if N=0 on Turn 2, the session was not found.
  4. Confirm the stream for Turn 1 was fully consumed (client read all messages including is_final=true) — the engine only persists history after the stream ends.

6. Log Reference

Log prefix Module What it means
[ESEARCH] Connected server.py Elasticsearch OK on startup
[ESEARCH] Cant Connect server.py Elasticsearch unreachable on startup
[ENGINE] listen on 50051 server.py gRPC server ready
[AskAgent] session=... query=... server.py New non-streaming request
[AskAgent] conversation: N messages server.py History loaded for session
[AskAgentStream] done — chunks=N server.py Stream completed, history saved
[classify] raw=... -> TYPE graph.py Query classification result
[reformulate] -> '...' graph.py Reformulated query
[hybrid] BM25 -> N hits graph.py BM25 retrieval result
[hybrid] kNN -> N hits graph.py kNN retrieval result
[hybrid] RRF -> N final docs graph.py After RRF fusion
[retrieve] N docs, context len=X graph.py Context assembled
[generate] X chars graph.py Non-streaming answer generated
[eval] Iniciando: N preguntas evaluate.py Evaluation started
[eval] Completado — global=X evaluate.py Evaluation finished

7. Useful Commands

# Real-time log streaming
docker logs -f brunix-assistance-engine

# Filter for errors only
docker logs brunix-assistance-engine 2>&1 | grep -i error

# Check container resource usage
docker stats brunix-assistance-engine --no-stream

# Enter container for debugging
docker exec -it brunix-assistance-engine /bin/bash

# Send a test query
grpcurl -plaintext \
  -d '{"query": "What is AVAP?", "session_id": "test"}' \
  localhost:50052 brunix.AssistanceEngine/AskAgent

# Check ES index document count
curl -s "http://localhost:9200/avap-docs-test/_count" | python3 -m json.tool

# Check ES index mapping
curl -s "http://localhost:9200/avap-docs-test/_mapping" | python3 -m json.tool

# List active containers
docker ps --filter name=brunix

# Check port bindings
docker port brunix-assistance-engine

8. Escalation Path

Severity Condition Action
P1 Engine completely down, not recoverable in 15 min Notify via Slack #brunix-incidents immediately. Tag CTO.
P2 Degraded quality (bad answers) or evaluation score drops below 0.60 Open GitHub Issue with full log output and evaluation report.
P3 Tunnel instability, intermittent errors Report in daily standup. Document in GitHub Issue within 24h.
P4 Documentation gap or non-critical config issue Open GitHub Issue with label documentation or improvement.

For all P1/P2 incidents, the GitHub Issue must include:

  1. Exact command that triggered the failure
  2. Full terminal output / error log
  3. Status of all three kubectl tunnels at the time of failure
  4. Docker container status (docker inspect brunix-assistance-engine)