12 KiB

Raw Blame History

Brunix Assistance Engine — Operations Runbook

Audience: Engineers on-call, DevOps, and anyone debugging the Brunix Engine in a live environment. Last updated: 2026-03-18

Health Checks
Starting the Engine
Stopping & Restarting
Tunnel Management
Incident Playbooks
Log Reference
Useful Commands
Escalation Path

1. Health Checks

Is the gRPC server up?

grpcurl -plaintext localhost:50052 list
# Expected: brunix.AssistanceEngine

If grpcurl hangs or returns a connection error, the container is not running or the port is not mapped.

Is Elasticsearch reachable?

curl -s http://localhost:9200/_cluster/health | python3 -m json.tool
# Expected: "status": "green" or "yellow"

Is Ollama reachable?

curl -s http://localhost:11434/api/tags | python3 -m json.tool
# Expected: list of available models including qwen2.5:1.5b

Is the embedding model loaded?

curl -s http://localhost:11434/api/tags | grep qwen3-0.6B-emb
# Expected: model entry present

Is Langfuse reachable?

curl -s http://45.77.119.180/api/public/health
# Expected: {"status":"ok"}

2. Starting the Engine

Prerequisites checklist

Kubeconfig present at ./kubernetes/kubeconfig.yaml
.env file populated with all required variables (see README.md)
All three kubectl tunnels active (see §4)
Docker daemon running

Start command

cd Docker/
docker-compose up -d --build

Verify startup

# Watch logs until you see "Brunix Engine initialized."
docker logs -f brunix-assistance-engine

# Expected log sequence:
# [ESEARCH] Connected: 8.x.x — index: avap-docs-test
# [ENGINE] listen on 50051 (gRPC)
# Brunix Engine initialized.
# [entrypoint] Starting OpenAI Proxy (HTTP :8000)...

Startup typically takes 20–60 seconds depending on Ollama model loading time.

3. Stopping & Restarting

# Graceful stop
docker-compose down

# Hard stop (if container is unresponsive)
docker stop brunix-assistance-engine
docker rm brunix-assistance-engine

# Restart only the engine (no rebuild)
docker-compose restart brunix-engine

# Rebuild and restart (after code changes)
docker-compose up -d --build

⚠️ Restart clears all in-memory session history. All active conversations will lose context.

4. Tunnel Management

All three tunnels must be active for the engine to function. Run each in a separate terminal or as a background process.

# Tunnel 1 — Ollama (LLM + embeddings)
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Tunnel 2 — Elasticsearch (vector knowledge base)
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Tunnel 3 — PostgreSQL (Langfuse observability)
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

Check tunnel status

# List active port-forwards
ps aux | grep "kubectl port-forward"

# Alternatively
lsof -i :11434
lsof -i :9200
lsof -i :5432

Tunnel dropped?

kubectl tunnels drop silently. Symptoms:

Elasticsearch: [ESEARCH] Cant Connect in engine logs
Ollama: requests timeout or return connection errors
Langfuse: tracing data stops appearing in the dashboard

Fix: Re-run the affected tunnel command. The engine will reconnect automatically on the next request.

5. Incident Playbooks

5.1 Engine fails to start

Symptom: docker-compose up exits immediately, or container restarts in a loop.

Diagnosis:

docker logs brunix-assistance-engine 2>&1 | head -50

Common causes and fixes:

Log message	Cause	Fix
`Cannot connect to Ollama`	Ollama tunnel not running	Start Tunnel 1
`model 'qwen2.5:1.5b' not found`	Model not loaded in Ollama	See §5.3
`ELASTICSEARCH_URL not set`	Missing `.env`	Check `.env` file exists and is complete
`No module named 'brunix_pb2'`	Proto stubs not generated	Run `docker-compose up --build`
`Port 50051 already in use`	Another instance running	`docker stop brunix-assistance-engine && docker rm brunix-assistance-engine`

5.2 Elasticsearch unreachable

Symptom: Log shows [ESEARCH] Cant Connect. Queries return empty context.

Step 1 — Verify tunnel:

curl -s http://localhost:9200/_cluster/health

Step 2 — Restart tunnel if down:

kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

Step 3 — Check index exists:

curl -s http://localhost:9200/_cat/indices?v | grep avap

If the index is missing, the knowledge base has not been ingested. Run:

cd scripts/pipelines/flows/
python elasticsearch_ingestion.py

Step 4 — Verify authentication:
If your cluster uses authentication, confirm ELASTICSEARCH_USER + ELASTICSEARCH_PASSWORD or ELASTICSEARCH_API_KEY are set in .env.

5.3 Ollama unreachable / model not found

Symptom: Engine logs show connection errors to http://host.docker.internal:11434, or validate_model_on_init=True raises a model-not-found error on startup.

Step 1 — Verify Ollama tunnel is active:

curl -s http://localhost:11434/api/tags

Step 2 — List available models:

curl -s http://localhost:11434/api/tags | python3 -c "
import json, sys
data = json.load(sys.stdin)
for m in data.get('models', []):
    print(m['name'])
"

Step 3 — Pull missing models if needed:

# On the Devaron cluster (via kubectl exec or direct access):
ollama pull qwen2.5:1.5b
ollama pull qwen3-0.6B-emb:latest

Step 4 — Restart engine after models are available:

docker-compose restart brunix-engine

5.4 AskAgent returns `[ENG] Error`

Symptom: Client receives {"text": "[ENG] Error: ...", "is_final": true}.

Diagnosis:

docker logs brunix-assistance-engine 2>&1 | grep -A 10 "Error"

Error substring	Cause	Fix
`Connection refused` to `11434`	Ollama tunnel down	Restart Tunnel 1
`Connection refused` to `9200`	ES tunnel down	Restart Tunnel 2
`Index not found`	ES index missing	Run ingestion pipeline
`context length exceeded`	Query + history too long for model	Reduce session history or use a larger context model
`Traceback` / `KeyError`	Code bug	Check full traceback, open GitHub Issue

5.5 EvaluateRAG returns ANTHROPIC_API_KEY error

Symptom: EvalResponse.status = "ANTHROPIC_API_KEY no configurada en .env".

Fix:

Add ANTHROPIC_API_KEY=sk-ant-... to your .env file.
Add ANTHROPIC_MODEL=claude-sonnet-4-20250514 (optional, has default).
Restart the engine: docker-compose restart brunix-engine.

5.6 Container memory / OOM

Symptom: Container is killed by the OOM killer. docker inspect brunix-assistance-engine shows OOMKilled: true.

Diagnosis:

docker stats brunix-assistance-engine

Common causes:

Large context window being passed to Ollama (many retrieved chunks × long document).
Session history growing unbounded over a long-running session.

Mitigation:

Set mem_limit in docker-compose.yaml:

services:
  brunix-engine:
    mem_limit: 4g

Restart the container to clear session store.
Consider reducing k=8 in hybrid_search_native to limit context size.

5.7 Session history not persisting between requests

Expected behaviour: Sending two requests with the same session_id should maintain context.

If Turn 2 does not seem to know about Turn 1:

Confirm both requests use identical session_id strings (case-sensitive, no trailing spaces).
Confirm the engine was not restarted between the two requests (restart wipes session_store).
Check logs for [AskAgentStream] conversation: N previous messages. — if N=0 on Turn 2, the session was not found.
Confirm the stream for Turn 1 was fully consumed (client read all messages including is_final=true) — the engine only persists history after the stream ends.

6. Log Reference

Log prefix	Module	What it means
`[ESEARCH] Connected`	`server.py`	Elasticsearch OK on startup
`[ESEARCH] Cant Connect`	`server.py`	Elasticsearch unreachable on startup
`[ENGINE] listen on 50051`	`server.py`	gRPC server ready
`[AskAgent] session=... query=...`	`server.py`	New non-streaming request
`[AskAgent] conversation: N messages`	`server.py`	History loaded for session
`[AskAgentStream] done — chunks=N`	`server.py`	Stream completed, history saved
`[classify] raw=... -> TYPE`	`graph.py`	Query classification result
`[reformulate] -> '...'`	`graph.py`	Reformulated query
`[hybrid] BM25 -> N hits`	`graph.py`	BM25 retrieval result
`[hybrid] kNN -> N hits`	`graph.py`	kNN retrieval result
`[hybrid] RRF -> N final docs`	`graph.py`	After RRF fusion
`[retrieve] N docs, context len=X`	`graph.py`	Context assembled
`[generate] X chars`	`graph.py`	Non-streaming answer generated
`[eval] Iniciando: N preguntas`	`evaluate.py`	Evaluation started
`[eval] Completado — global=X`	`evaluate.py`	Evaluation finished

7. Useful Commands

# Real-time log streaming
docker logs -f brunix-assistance-engine

# Filter for errors only
docker logs brunix-assistance-engine 2>&1 | grep -i error

# Check container resource usage
docker stats brunix-assistance-engine --no-stream

# Enter container for debugging
docker exec -it brunix-assistance-engine /bin/bash

# Send a test query
grpcurl -plaintext \
  -d '{"query": "What is AVAP?", "session_id": "test"}' \
  localhost:50052 brunix.AssistanceEngine/AskAgent

# Check ES index document count
curl -s "http://localhost:9200/avap-docs-test/_count" | python3 -m json.tool

# Check ES index mapping
curl -s "http://localhost:9200/avap-docs-test/_mapping" | python3 -m json.tool

# List active containers
docker ps --filter name=brunix

# Check port bindings
docker port brunix-assistance-engine

8. Escalation Path

Severity	Condition	Action
P1	Engine completely down, not recoverable in 15 min	Notify via Slack `#brunix-incidents` immediately. Tag CTO.
P2	Degraded quality (bad answers) or evaluation score drops below 0.60	Open GitHub Issue with full log output and evaluation report.
P3	Tunnel instability, intermittent errors	Report in daily standup. Document in GitHub Issue within 24h.
P4	Documentation gap or non-critical config issue	Open GitHub Issue with label `documentation` or `improvement`.

For all P1/P2 incidents, the GitHub Issue must include:

Exact command that triggered the failure
Full terminal output / error log
Status of all three kubectl tunnels at the time of failure
Docker container status (docker inspect brunix-assistance-engine)

12 KiB Raw Blame History Unescape Escape

Brunix Assistance Engine — Operations Runbook

Table of Contents

1. Health Checks

Is the gRPC server up?

Is Elasticsearch reachable?

Is Ollama reachable?

Is the embedding model loaded?

Is Langfuse reachable?

2. Starting the Engine

Prerequisites checklist

Start command

Verify startup

3. Stopping & Restarting

4. Tunnel Management

Check tunnel status

Tunnel dropped?

5. Incident Playbooks

5.1 Engine fails to start

5.2 Elasticsearch unreachable

5.3 Ollama unreachable / model not found

5.4 AskAgent returns [ENG] Error

5.5 EvaluateRAG returns ANTHROPIC_API_KEY error

5.6 Container memory / OOM

5.7 Session history not persisting between requests

6. Log Reference

7. Useful Commands

8. Escalation Path

12 KiB

Raw Blame History

5.4 AskAgent returns `[ENG] Error`