# Brunix Assistance Engine — Architecture Reference > **Audience:** Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment. > **Last updated:** 2026-03-18 > **Version:** 1.5.x --- ## Table of Contents 1. [System Overview](#1-system-overview) 2. [Component Inventory](#2-component-inventory) 3. [Request Lifecycle](#3-request-lifecycle) 4. [LangGraph Workflow](#4-langgraph-workflow) 5. [RAG Pipeline — Hybrid Search](#5-rag-pipeline--hybrid-search) 6. [Streaming Architecture (AskAgentStream)](#6-streaming-architecture-askagentstream) 7. [Evaluation Pipeline (EvaluateRAG)](#7-evaluation-pipeline-evaluaterag) 8. [Data Ingestion Pipeline](#8-data-ingestion-pipeline) 9. [Infrastructure Layout](#9-infrastructure-layout) 10. [Session State & Conversation Memory](#10-session-state--conversation-memory) 11. [Observability Stack](#11-observability-stack) 12. [Security Boundaries](#12-security-boundaries) 13. [Known Limitations & Future Work](#13-known-limitations--future-work) --- ## 1. System Overview The **Brunix Assistance Engine** is a stateful, streaming-capable AI service that answers questions about the AVAP programming language. It combines: - **gRPC** as the primary communication interface (port `50051` inside container, `50052` on host) - **LangGraph** for deterministic, multi-step agentic orchestration - **Hybrid RAG** (BM25 + kNN with RRF fusion) over an Elasticsearch vector index - **Ollama** as the local LLM and embedding backend - **RAGAS + Claude** as the automated evaluation judge A secondary **OpenAI-compatible HTTP proxy** (port `8000`) is served via FastAPI/Uvicorn, enabling integration with tools that expect the OpenAI API format. ``` ┌─────────────────────────────────────────────────────────────┐ │ External Clients │ │ grpcurl / App SDK │ OpenAI-compatible client │ └────────────┬────────────────┴──────────────┬────────────────┘ │ gRPC :50052 │ HTTP :8000 ▼ ▼ ┌────────────────────────────────────────────────────────────┐ │ Docker Container │ │ │ │ ┌─────────────────────┐ ┌──────────────────────────┐ │ │ │ server.py (gRPC) │ │ openai_proxy.py (HTTP) │ │ │ │ BrunixEngine │ │ FastAPI / Uvicorn │ │ │ └──────────┬──────────┘ └──────────────────────────┘ │ │ │ │ │ ┌──────────▼──────────────────────────────────────────┐ │ │ │ LangGraph Orchestration │ │ │ │ classify → reformulate → retrieve → generate │ │ │ └──────────────────────────┬───────────────────────────┘ │ │ │ │ │ ┌───────────────────┼────────────────────┐ │ │ ▼ ▼ ▼ │ │ Ollama (LLM) Ollama (Embed) Elasticsearch │ │ via tunnel via tunnel via tunnel │ └────────────────────────────────────────────────────────────┘ │ kubectl port-forward tunnels │ ▼ ▼ Devaron Cluster (Vultr Kubernetes) ollama-light-service:11434 brunix-vector-db:9200 brunix-postgres:5432 Langfuse UI ``` --- ## 2. Component Inventory | Component | File / Service | Responsibility | |---|---|---| | **gRPC Server** | `Docker/src/server.py` | Entry point. Implements the `AssistanceEngine` servicer. Initializes LLM, embeddings, ES client, and both graphs. | | **Full Graph** | `Docker/src/graph.py` → `build_graph()` | Complete workflow: classify → reformulate → retrieve → generate. Used by `AskAgent` and `EvaluateRAG`. | | **Prepare Graph** | `Docker/src/graph.py` → `build_prepare_graph()` | Partial workflow: classify → reformulate → retrieve. Does **not** call the LLM for generation. Used by `AskAgentStream` to enable manual token streaming. | | **Message Builder** | `Docker/src/graph.py` → `build_final_messages()` | Reconstructs the final prompt list from prepared state for `llm.stream()`. | | **Prompt Library** | `Docker/src/prompts.py` | Centralized definitions for `CLASSIFY`, `REFORMULATE`, `GENERATE`, `CODE_GENERATION`, and `CONVERSATIONAL` prompts. | | **Agent State** | `Docker/src/state.py` | `AgentState` TypedDict shared across all graph nodes. | | **Evaluation Suite** | `Docker/src/evaluate.py` | RAGAS-based pipeline. Uses the production retriever + Ollama LLM for generation, and Claude as the impartial judge. | | **OpenAI Proxy** | `Docker/src/openai_proxy.py` | FastAPI application that wraps `AskAgentStream` under an `/v1/chat/completions` endpoint. | | **LLM Factory** | `Docker/src/utils/llm_factory.py` | Provider-agnostic factory for chat models (Ollama, AWS Bedrock). | | **Embedding Factory** | `Docker/src/utils/emb_factory.py` | Provider-agnostic factory for embedding models (Ollama, HuggingFace). | | **Ingestion Pipeline** | `scripts/pipelines/flows/elasticsearch_ingestion.py` | Chunks and ingests AVAP documents into Elasticsearch with embeddings. | | **Dataset Generator** | `scripts/pipelines/flows/generate_mbap.py` | Generates synthetic MBPP-style AVAP problems using Claude. | | **MBPP Translator** | `scripts/pipelines/flows/translate_mbpp.py` | Translates MBPP Python dataset into AVAP equivalents. | --- ## 3. Request Lifecycle ### 3.1 `AskAgent` (non-streaming) ``` Client → gRPC AgentRequest{query, session_id} │ ├─ Load conversation history from session_store[session_id] ├─ Build initial_state = {messages: history + [user_msg], ...} │ └─ graph.invoke(initial_state) ├─ classify → query_type ∈ {RETRIEVAL, CODE_GENERATION, CONVERSATIONAL} ├─ reformulate → reformulated_query (keyword-optimized for semantic search) ├─ retrieve → context (top-8 hybrid RRF chunks from Elasticsearch) └─ generate → final AIMessage (llm.invoke) │ ├─ Persist updated history to session_store[session_id] └─ yield AgentResponse{text, avap_code="AVAP-2026", is_final=True} ``` ### 3.2 `AskAgentStream` (token streaming) ``` Client → gRPC AgentRequest{query, session_id} │ ├─ Load history from session_store[session_id] ├─ Build initial_state │ ├─ prepare_graph.invoke(initial_state) ← Phase 1: no LLM generation │ ├─ classify │ ├─ reformulate │ └─ retrieve (or skip_retrieve if CONVERSATIONAL) │ ├─ build_final_messages(prepared_state) ← Reconstruct prompt list │ └─ for chunk in llm.stream(final_messages): └─ yield AgentResponse{text=token, is_final=False} │ ├─ Persist full assembled response to session_store └─ yield AgentResponse{text="", is_final=True} ``` ### 3.3 `EvaluateRAG` ``` Client → gRPC EvalRequest{category?, limit?, index?} │ └─ evaluate.run_evaluation(...) ├─ Load golden_dataset.json ├─ Filter by category / limit ├─ For each question: │ ├─ retrieve_context (hybrid BM25+kNN, same as production) │ └─ generate_answer (Ollama LLM + GENERATE_PROMPT) ├─ Build RAGAS Dataset ├─ Run RAGAS metrics with Claude as judge: │ faithfulness / answer_relevancy / context_recall / context_precision └─ Compute global_score + verdict (EXCELLENT / ACCEPTABLE / INSUFFICIENT) │ └─ return EvalResponse{scores, global_score, verdict, details[]} ``` --- ## 4. LangGraph Workflow ### 4.1 Full Graph (`build_graph`) ``` ┌─────────────┐ │ classify │ └──────┬──────┘ │ ┌────────────────┼──────────────────┐ ▼ ▼ ▼ RETRIEVAL CODE_GENERATION CONVERSATIONAL │ │ │ └────────┬───────┘ │ ▼ ▼ ┌──────────────┐ ┌────────────────────────┐ │ reformulate │ │ respond_conversational │ └──────┬───────┘ └───────────┬────────────┘ ▼ │ ┌──────────────┐ │ │ retrieve │ │ └──────┬───────┘ │ │ │ ┌────────┴───────────┐ │ ▼ ▼ │ ┌──────────┐ ┌───────────────┐ │ │ generate │ │ generate_code │ │ └────┬─────┘ └───────┬───────┘ │ │ │ │ └────────────────────┴────────────────┘ │ END ``` ### 4.2 Prepare Graph (`build_prepare_graph`) Identical routing for classify, but generation nodes are replaced by `END`. The `CONVERSATIONAL` branch uses `skip_retrieve` (returns empty context without querying Elasticsearch). ### 4.3 Query Type Routing | `query_type` | Triggers retrieve? | Generation prompt | |---|---|---| | `RETRIEVAL` | Yes | `GENERATE_PROMPT` (explanation-focused) | | `CODE_GENERATION` | Yes | `CODE_GENERATION_PROMPT` (code-focused, returns AVAP blocks) | | `CONVERSATIONAL` | No | `CONVERSATIONAL_PROMPT` (reformulation of prior answer) | --- ## 5. RAG Pipeline — Hybrid Search The retrieval system (`hybrid_search_native`) fuses BM25 lexical search and kNN dense vector search using **Reciprocal Rank Fusion (RRF)**. ``` User query │ ├─ embeddings.embed_query(query) → query_vector [768-dim] │ ├─ ES multi_match (BM25) on fields [content^2, text^2] │ └─ top-k BM25 hits │ └─ ES knn on field [embedding], num_candidates = k×5 └─ top-k kNN hits │ ├─ RRF fusion: score(doc) = Σ 1/(rank + 60) │ └─ Top-8 documents → format_context() → context string ``` **RRF constant:** `60` (standard value; prevents high-rank documents from dominating while still rewarding consensus between both retrieval modes). **Chunk metadata** attached to each retrieved document: | Field | Description | |---|---| | `chunk_id` | Unique identifier within the index | | `source_file` | Origin document filename | | `doc_type` | `prose`, `code`, `code_example`, `bnf` | | `block_type` | AVAP block type: `function`, `if`, `startLoop`, `try` | | `section` | Document section/chapter heading | Documents of type `code`, `code_example`, `bnf`, or block type `function / if / startLoop / try` are tagged as `[AVAP CODE]` in the formatted context, signaling the LLM to treat them as executable syntax rather than prose. --- ## 6. Streaming Architecture (AskAgentStream) The two-phase streaming design is critical to understand: **Why not stream through LangGraph?** LangGraph's `stream()` method yields full state snapshots per node, not individual tokens. To achieve true per-token streaming to the gRPC client, the generation step is deliberately extracted from the graph and called directly via `llm.stream()`. **Phase 1 — Deterministic preparation (graph-managed):** - Classification, query reformulation, and retrieval run through `prepare_graph.invoke()`. - This phase runs synchronously and produces the complete context before any token is emitted to the client. **Phase 2 — Token streaming (manual):** - `build_final_messages()` reconstructs the exact prompt that `generate` / `generate_code` / `respond_conversational` would have used. - `llm.stream(final_messages)` yields one `AIMessageChunk` per token from Ollama. - Each token is immediately forwarded to the gRPC client as `AgentResponse{text=token, is_final=False}`. - After the stream ends, the full assembled text is persisted to `session_store`. **Backpressure:** gRPC streaming is flow-controlled by the client. If the client stops reading, the Ollama token stream will block at the `yield` point. No explicit buffer overflow protection is implemented (acceptable for the current single-client dev mode). --- ## 7. Evaluation Pipeline (EvaluateRAG) The evaluation suite implements an **offline RAG evaluation** pattern using RAGAS metrics. ### Judge model separation The production LLM (Ollama `qwen2.5:1.5b`) is used for **answer generation** — the same pipeline as production to measure real-world quality. Claude (`claude-sonnet-4-20250514`) is used as the **evaluation judge** — an independent, high-capability model that scores the generated answers against ground truth. ### RAGAS metrics | Metric | Measures | Input | |---|---|---| | `faithfulness` | Are claims in the answer supported by the retrieved context? | answer + contexts | | `answer_relevancy` | Is the answer relevant to the question? | answer + question | | `context_recall` | Does the retrieved context cover the ground truth? | contexts + ground_truth | | `context_precision` | Are the retrieved chunks useful (signal-to-noise)? | contexts + ground_truth | ### Global score & verdict ``` global_score = mean(non-zero metric scores) verdict: ≥ 0.80 → EXCELLENT ≥ 0.60 → ACCEPTABLE < 0.60 → INSUFFICIENT ``` ### Golden dataset Located at `Docker/src/golden_dataset.json`. Each entry follows this schema: ```json { "id": "avap-001", "category": "core_syntax", "question": "How do you declare a variable in AVAP?", "ground_truth": "Use addVar to declare a variable..." } ``` --- ## 8. Data Ingestion Pipeline Documents flow into the Elasticsearch index through two paths: ### Path A — AVAP documentation (structured markdown) ``` docs/LRM/avap.md docs/avap_language_github_docs/*.md docs/developer.avapframework.com/*.md │ ▼ scripts/pipelines/flows/elasticsearch_ingestion.py │ ├─ Load markdown files ├─ Chunk using scripts/pipelines/tasks/chunk.py │ (semantic chunking via Chonkie library) ├─ Generate embeddings via scripts/pipelines/tasks/embeddings.py │ (Ollama or HuggingFace embedding model) └─ Bulk index into Elasticsearch index: avap-docs-* (configurable via ELASTICSEARCH_INDEX) mapping: {content, embedding, source_file, doc_type, section, ...} ``` ### Path B — Synthetic AVAP code samples ``` docs/samples/*.avap │ ▼ scripts/pipelines/flows/generate_mbap.py │ ├─ Read AVAP LRM (docs/LRM/avap.md) ├─ Call Claude API to generate MBPP-style problems └─ Output synthetic_datasets/mbpp_avap.json (used for fine-tuning and few-shot examples) ``` --- ## 9. Infrastructure Layout ### Devaron Cluster (Vultr Kubernetes) | Service | K8s Name | Port | Purpose | |---|---|---|---| | LLM inference | `ollama-light-service` | `11434` | Text generation + embeddings | | Vector database | `brunix-vector-db` | `9200` | Elasticsearch 8.x | | Observability DB | `brunix-postgres` | `5432` | PostgreSQL for Langfuse | | Langfuse UI | — | `80` | `http://45.77.119.180` | ### Kubernetes tunnel commands ```bash # Terminal 1 — LLM kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \ -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml # Terminal 2 — Elasticsearch kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \ -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml # Terminal 3 — PostgreSQL (Langfuse) kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \ -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml ``` ### Port map summary | Port | Protocol | Service | Scope | |---|---|---|---| | `50051` | gRPC | Brunix Engine (inside container) | Internal | | `50052` | gRPC | Brunix Engine (host-mapped) | External | | `8000` | HTTP | OpenAI proxy | External | | `11434` | HTTP | Ollama (via tunnel) | Tunnel | | `9200` | HTTP | Elasticsearch (via tunnel) | Tunnel | | `5432` | TCP | PostgreSQL/Langfuse (via tunnel) | Tunnel | --- ## 10. Session State & Conversation Memory Conversation history is managed via an in-process dictionary: ```python session_store: dict[str, list] = defaultdict(list) # key: session_id (string, provided by client) # value: list of LangChain BaseMessage objects ``` **Characteristics:** - **In-memory only.** History is lost on container restart. - **No TTL or eviction.** Sessions grow unbounded for the lifetime of the process. - **Thread safety:** Python's GIL provides basic safety for the `ThreadPoolExecutor(max_workers=10)` gRPC server, but concurrent writes to the same `session_id` from two simultaneous requests are not explicitly protected. - **History window:** `format_history_for_classify()` uses only the last 6 messages for query classification to keep the classify prompt short and deterministic. > **Future work:** Replace `session_store` with a Redis-backed persistent store to survive restarts and support horizontal scaling. --- ## 11. Observability Stack ### Langfuse tracing The server integrates Langfuse for end-to-end LLM tracing. Every `AskAgent` / `AskAgentStream` request creates a trace that captures: - Input query and session ID - Each LangGraph node execution (classify, reformulate, retrieve, generate) - LLM token counts, latency, and cost - Final response **Access:** `http://45.77.119.180` — requires a project API key configured via `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY`. ### Logging Structured logging via Python's `logging` module, configured at `INFO` level. Log format: ``` [MODULE] context_info — key=value key=value ``` Key log markers: | Marker | Module | Meaning | |---|---|---| | `[ESEARCH]` | `server.py` | Elasticsearch connection status | | `[classify]` | `graph.py` | Query type decision + raw LLM output | | `[reformulate]` | `graph.py` | Reformulated query string | | `[hybrid]` | `graph.py` | BM25 / kNN hit counts and RRF result count | | `[retrieve]` | `graph.py` | Number of docs retrieved and context length | | `[generate]` | `graph.py` | Response character count | | `[AskAgentStream]` | `server.py` | Token count and total chars per stream | | `[eval]` | `evaluate.py` | Per-question retrieval and generation status | --- ## 12. Security Boundaries | Boundary | Current state | Risk | |---|---|---| | gRPC transport | **Insecure** (`add_insecure_port`) | Network interception possible. Acceptable in dev/tunnel setup; requires mTLS for production. | | Elasticsearch auth | Optional (user/pass or API key via env vars) | Index is accessible without auth if `ELASTICSEARCH_USER` and `ELASTICSEARCH_API_KEY` are unset. | | Container user | Non-root (`python:3.11-slim` default) | Low risk. Do not override with `root`. | | Secrets in env | Via `.env` / `docker-compose` env injection | Never commit real values. See [CONTRIBUTING.md](../CONTRIBUTING.md#6-environment-variables-policy). | | Session store | In-memory, no auth | Any caller with access to the gRPC port can read/write any session by guessing its ID. | | Kubeconfig | `./kubernetes/kubeconfig.yaml` (local only) | Grants cluster access. Never commit. Listed in `.gitignore`. | --- ## 13. Known Limitations & Future Work | Area | Limitation | Proposed solution | |---|---|---| | Session persistence | In-memory, lost on restart | Redis-backed `session_store` | | Horizontal scaling | `session_store` is per-process | Sticky sessions or external session store | | gRPC security | Insecure port | Add TLS + optional mTLS | | Elasticsearch auth | Not enforced if vars unset | Make auth required; fail-fast on startup | | Context window | Full history passed to generate; no truncation | Sliding window or summarization for long sessions | | Evaluation | Golden dataset must be manually maintained | Automated golden dataset refresh pipeline | | Rate limiting | None on gRPC server | Add interceptor-based rate limiter | | Health check | No gRPC health protocol | Implement `grpc.health.v1` |