464 lines
22 KiB
Markdown
464 lines
22 KiB
Markdown
# Brunix Assistance Engine — Architecture Reference
|
||
|
||
> **Audience:** Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment.
|
||
> **Last updated:** 2026-03-18
|
||
> **Version:** 1.5.x
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [System Overview](#1-system-overview)
|
||
2. [Component Inventory](#2-component-inventory)
|
||
3. [Request Lifecycle](#3-request-lifecycle)
|
||
4. [LangGraph Workflow](#4-langgraph-workflow)
|
||
5. [RAG Pipeline — Hybrid Search](#5-rag-pipeline--hybrid-search)
|
||
6. [Streaming Architecture (AskAgentStream)](#6-streaming-architecture-askagentstream)
|
||
7. [Evaluation Pipeline (EvaluateRAG)](#7-evaluation-pipeline-evaluaterag)
|
||
8. [Data Ingestion Pipeline](#8-data-ingestion-pipeline)
|
||
9. [Infrastructure Layout](#9-infrastructure-layout)
|
||
10. [Session State & Conversation Memory](#10-session-state--conversation-memory)
|
||
11. [Observability Stack](#11-observability-stack)
|
||
12. [Security Boundaries](#12-security-boundaries)
|
||
13. [Known Limitations & Future Work](#13-known-limitations--future-work)
|
||
|
||
---
|
||
|
||
## 1. System Overview
|
||
|
||
The **Brunix Assistance Engine** is a stateful, streaming-capable AI service that answers questions about the AVAP programming language. It combines:
|
||
|
||
- **gRPC** as the primary communication interface (port `50051` inside container, `50052` on host)
|
||
- **LangGraph** for deterministic, multi-step agentic orchestration
|
||
- **Hybrid RAG** (BM25 + kNN with RRF fusion) over an Elasticsearch vector index
|
||
- **Ollama** as the local LLM and embedding backend
|
||
- **RAGAS + Claude** as the automated evaluation judge
|
||
|
||
A secondary **OpenAI-compatible HTTP proxy** (port `8000`) is served via FastAPI/Uvicorn, enabling integration with tools that expect the OpenAI API format.
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ External Clients │
|
||
│ grpcurl / App SDK │ OpenAI-compatible client │
|
||
└────────────┬────────────────┴──────────────┬────────────────┘
|
||
│ gRPC :50052 │ HTTP :8000
|
||
▼ ▼
|
||
┌────────────────────────────────────────────────────────────┐
|
||
│ Docker Container │
|
||
│ │
|
||
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
|
||
│ │ server.py (gRPC) │ │ openai_proxy.py (HTTP) │ │
|
||
│ │ BrunixEngine │ │ FastAPI / Uvicorn │ │
|
||
│ └──────────┬──────────┘ └──────────────────────────┘ │
|
||
│ │ │
|
||
│ ┌──────────▼──────────────────────────────────────────┐ │
|
||
│ │ LangGraph Orchestration │ │
|
||
│ │ classify → reformulate → retrieve → generate │ │
|
||
│ └──────────────────────────┬───────────────────────────┘ │
|
||
│ │ │
|
||
│ ┌───────────────────┼────────────────────┐ │
|
||
│ ▼ ▼ ▼ │
|
||
│ Ollama (LLM) Ollama (Embed) Elasticsearch │
|
||
│ via tunnel via tunnel via tunnel │
|
||
└────────────────────────────────────────────────────────────┘
|
||
│ kubectl port-forward tunnels │
|
||
▼ ▼
|
||
Devaron Cluster (Vultr Kubernetes)
|
||
ollama-light-service:11434 brunix-vector-db:9200
|
||
brunix-postgres:5432 Langfuse UI
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Component Inventory
|
||
|
||
| Component | File / Service | Responsibility |
|
||
|---|---|---|
|
||
| **gRPC Server** | `Docker/src/server.py` | Entry point. Implements the `AssistanceEngine` servicer. Initializes LLM, embeddings, ES client, and both graphs. |
|
||
| **Full Graph** | `Docker/src/graph.py` → `build_graph()` | Complete workflow: classify → reformulate → retrieve → generate. Used by `AskAgent` and `EvaluateRAG`. |
|
||
| **Prepare Graph** | `Docker/src/graph.py` → `build_prepare_graph()` | Partial workflow: classify → reformulate → retrieve. Does **not** call the LLM for generation. Used by `AskAgentStream` to enable manual token streaming. |
|
||
| **Message Builder** | `Docker/src/graph.py` → `build_final_messages()` | Reconstructs the final prompt list from prepared state for `llm.stream()`. |
|
||
| **Prompt Library** | `Docker/src/prompts.py` | Centralized definitions for `CLASSIFY`, `REFORMULATE`, `GENERATE`, `CODE_GENERATION`, and `CONVERSATIONAL` prompts. |
|
||
| **Agent State** | `Docker/src/state.py` | `AgentState` TypedDict shared across all graph nodes. |
|
||
| **Evaluation Suite** | `Docker/src/evaluate.py` | RAGAS-based pipeline. Uses the production retriever + Ollama LLM for generation, and Claude as the impartial judge. |
|
||
| **OpenAI Proxy** | `Docker/src/openai_proxy.py` | FastAPI application that wraps `AskAgentStream` under an `/v1/chat/completions` endpoint. |
|
||
| **LLM Factory** | `Docker/src/utils/llm_factory.py` | Provider-agnostic factory for chat models (Ollama, AWS Bedrock). |
|
||
| **Embedding Factory** | `Docker/src/utils/emb_factory.py` | Provider-agnostic factory for embedding models (Ollama, HuggingFace). |
|
||
| **Ingestion Pipeline** | `scripts/pipelines/flows/elasticsearch_ingestion.py` | Chunks and ingests AVAP documents into Elasticsearch with embeddings. |
|
||
| **Dataset Generator** | `scripts/pipelines/flows/generate_mbap.py` | Generates synthetic MBPP-style AVAP problems using Claude. |
|
||
| **MBPP Translator** | `scripts/pipelines/flows/translate_mbpp.py` | Translates MBPP Python dataset into AVAP equivalents. |
|
||
|
||
---
|
||
|
||
## 3. Request Lifecycle
|
||
|
||
### 3.1 `AskAgent` (non-streaming)
|
||
|
||
```
|
||
Client → gRPC AgentRequest{query, session_id}
|
||
│
|
||
├─ Load conversation history from session_store[session_id]
|
||
├─ Build initial_state = {messages: history + [user_msg], ...}
|
||
│
|
||
└─ graph.invoke(initial_state)
|
||
├─ classify → query_type ∈ {RETRIEVAL, CODE_GENERATION, CONVERSATIONAL}
|
||
├─ reformulate → reformulated_query (keyword-optimized for semantic search)
|
||
├─ retrieve → context (top-8 hybrid RRF chunks from Elasticsearch)
|
||
└─ generate → final AIMessage (llm.invoke)
|
||
│
|
||
├─ Persist updated history to session_store[session_id]
|
||
└─ yield AgentResponse{text, avap_code="AVAP-2026", is_final=True}
|
||
```
|
||
|
||
### 3.2 `AskAgentStream` (token streaming)
|
||
|
||
```
|
||
Client → gRPC AgentRequest{query, session_id}
|
||
│
|
||
├─ Load history from session_store[session_id]
|
||
├─ Build initial_state
|
||
│
|
||
├─ prepare_graph.invoke(initial_state) ← Phase 1: no LLM generation
|
||
│ ├─ classify
|
||
│ ├─ reformulate
|
||
│ └─ retrieve (or skip_retrieve if CONVERSATIONAL)
|
||
│
|
||
├─ build_final_messages(prepared_state) ← Reconstruct prompt list
|
||
│
|
||
└─ for chunk in llm.stream(final_messages):
|
||
└─ yield AgentResponse{text=token, is_final=False}
|
||
│
|
||
├─ Persist full assembled response to session_store
|
||
└─ yield AgentResponse{text="", is_final=True}
|
||
```
|
||
|
||
### 3.3 `EvaluateRAG`
|
||
|
||
```
|
||
Client → gRPC EvalRequest{category?, limit?, index?}
|
||
│
|
||
└─ evaluate.run_evaluation(...)
|
||
├─ Load golden_dataset.json
|
||
├─ Filter by category / limit
|
||
├─ For each question:
|
||
│ ├─ retrieve_context (hybrid BM25+kNN, same as production)
|
||
│ └─ generate_answer (Ollama LLM + GENERATE_PROMPT)
|
||
├─ Build RAGAS Dataset
|
||
├─ Run RAGAS metrics with Claude as judge:
|
||
│ faithfulness / answer_relevancy / context_recall / context_precision
|
||
└─ Compute global_score + verdict (EXCELLENT / ACCEPTABLE / INSUFFICIENT)
|
||
│
|
||
└─ return EvalResponse{scores, global_score, verdict, details[]}
|
||
```
|
||
|
||
---
|
||
|
||
## 4. LangGraph Workflow
|
||
|
||
### 4.1 Full Graph (`build_graph`)
|
||
|
||
```
|
||
┌─────────────┐
|
||
│ classify │
|
||
└──────┬──────┘
|
||
│
|
||
┌────────────────┼──────────────────┐
|
||
▼ ▼ ▼
|
||
RETRIEVAL CODE_GENERATION CONVERSATIONAL
|
||
│ │ │
|
||
└────────┬───────┘ │
|
||
▼ ▼
|
||
┌──────────────┐ ┌────────────────────────┐
|
||
│ reformulate │ │ respond_conversational │
|
||
└──────┬───────┘ └───────────┬────────────┘
|
||
▼ │
|
||
┌──────────────┐ │
|
||
│ retrieve │ │
|
||
└──────┬───────┘ │
|
||
│ │
|
||
┌────────┴───────────┐ │
|
||
▼ ▼ │
|
||
┌──────────┐ ┌───────────────┐ │
|
||
│ generate │ │ generate_code │ │
|
||
└────┬─────┘ └───────┬───────┘ │
|
||
│ │ │
|
||
└────────────────────┴────────────────┘
|
||
│
|
||
END
|
||
```
|
||
|
||
### 4.2 Prepare Graph (`build_prepare_graph`)
|
||
|
||
Identical routing for classify, but generation nodes are replaced by `END`. The `CONVERSATIONAL` branch uses `skip_retrieve` (returns empty context without querying Elasticsearch).
|
||
|
||
### 4.3 Query Type Routing
|
||
|
||
| `query_type` | Triggers retrieve? | Generation prompt |
|
||
|---|---|---|
|
||
| `RETRIEVAL` | Yes | `GENERATE_PROMPT` (explanation-focused) |
|
||
| `CODE_GENERATION` | Yes | `CODE_GENERATION_PROMPT` (code-focused, returns AVAP blocks) |
|
||
| `CONVERSATIONAL` | No | `CONVERSATIONAL_PROMPT` (reformulation of prior answer) |
|
||
|
||
---
|
||
|
||
## 5. RAG Pipeline — Hybrid Search
|
||
|
||
The retrieval system (`hybrid_search_native`) fuses BM25 lexical search and kNN dense vector search using **Reciprocal Rank Fusion (RRF)**.
|
||
|
||
```
|
||
User query
|
||
│
|
||
├─ embeddings.embed_query(query) → query_vector [768-dim]
|
||
│
|
||
├─ ES multi_match (BM25) on fields [content^2, text^2]
|
||
│ └─ top-k BM25 hits
|
||
│
|
||
└─ ES knn on field [embedding], num_candidates = k×5
|
||
└─ top-k kNN hits
|
||
│
|
||
├─ RRF fusion: score(doc) = Σ 1/(rank + 60)
|
||
│
|
||
└─ Top-8 documents → format_context() → context string
|
||
```
|
||
|
||
**RRF constant:** `60` (standard value; prevents high-rank documents from dominating while still rewarding consensus between both retrieval modes).
|
||
|
||
**Chunk metadata** attached to each retrieved document:
|
||
|
||
| Field | Description |
|
||
|---|---|
|
||
| `chunk_id` | Unique identifier within the index |
|
||
| `source_file` | Origin document filename |
|
||
| `doc_type` | `prose`, `code`, `code_example`, `bnf` |
|
||
| `block_type` | AVAP block type: `function`, `if`, `startLoop`, `try` |
|
||
| `section` | Document section/chapter heading |
|
||
|
||
Documents of type `code`, `code_example`, `bnf`, or block type `function / if / startLoop / try` are tagged as `[AVAP CODE]` in the formatted context, signaling the LLM to treat them as executable syntax rather than prose.
|
||
|
||
---
|
||
|
||
## 6. Streaming Architecture (AskAgentStream)
|
||
|
||
The two-phase streaming design is critical to understand:
|
||
|
||
**Why not stream through LangGraph?**
|
||
LangGraph's `stream()` method yields full state snapshots per node, not individual tokens. To achieve true per-token streaming to the gRPC client, the generation step is deliberately extracted from the graph and called directly via `llm.stream()`.
|
||
|
||
**Phase 1 — Deterministic preparation (graph-managed):**
|
||
- Classification, query reformulation, and retrieval run through `prepare_graph.invoke()`.
|
||
- This phase runs synchronously and produces the complete context before any token is emitted to the client.
|
||
|
||
**Phase 2 — Token streaming (manual):**
|
||
- `build_final_messages()` reconstructs the exact prompt that `generate` / `generate_code` / `respond_conversational` would have used.
|
||
- `llm.stream(final_messages)` yields one `AIMessageChunk` per token from Ollama.
|
||
- Each token is immediately forwarded to the gRPC client as `AgentResponse{text=token, is_final=False}`.
|
||
- After the stream ends, the full assembled text is persisted to `session_store`.
|
||
|
||
**Backpressure:** gRPC streaming is flow-controlled by the client. If the client stops reading, the Ollama token stream will block at the `yield` point. No explicit buffer overflow protection is implemented (acceptable for the current single-client dev mode).
|
||
|
||
---
|
||
|
||
## 7. Evaluation Pipeline (EvaluateRAG)
|
||
|
||
The evaluation suite implements an **offline RAG evaluation** pattern using RAGAS metrics.
|
||
|
||
### Judge model separation
|
||
|
||
The production LLM (Ollama `qwen2.5:1.5b`) is used for **answer generation** — the same pipeline as production to measure real-world quality. Claude (`claude-sonnet-4-20250514`) is used as the **evaluation judge** — an independent, high-capability model that scores the generated answers against ground truth.
|
||
|
||
### RAGAS metrics
|
||
|
||
| Metric | Measures | Input |
|
||
|---|---|---|
|
||
| `faithfulness` | Are claims in the answer supported by the retrieved context? | answer + contexts |
|
||
| `answer_relevancy` | Is the answer relevant to the question? | answer + question |
|
||
| `context_recall` | Does the retrieved context cover the ground truth? | contexts + ground_truth |
|
||
| `context_precision` | Are the retrieved chunks useful (signal-to-noise)? | contexts + ground_truth |
|
||
|
||
### Global score & verdict
|
||
|
||
```
|
||
global_score = mean(non-zero metric scores)
|
||
|
||
verdict:
|
||
≥ 0.80 → EXCELLENT
|
||
≥ 0.60 → ACCEPTABLE
|
||
< 0.60 → INSUFFICIENT
|
||
```
|
||
|
||
### Golden dataset
|
||
|
||
Located at `Docker/src/golden_dataset.json`. Each entry follows this schema:
|
||
|
||
```json
|
||
{
|
||
"id": "avap-001",
|
||
"category": "core_syntax",
|
||
"question": "How do you declare a variable in AVAP?",
|
||
"ground_truth": "Use addVar to declare a variable..."
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Data Ingestion Pipeline
|
||
|
||
Documents flow into the Elasticsearch index through two paths:
|
||
|
||
### Path A — AVAP documentation (structured markdown)
|
||
|
||
```
|
||
docs/LRM/avap.md
|
||
docs/avap_language_github_docs/*.md
|
||
docs/developer.avapframework.com/*.md
|
||
│
|
||
▼
|
||
scripts/pipelines/flows/elasticsearch_ingestion.py
|
||
│
|
||
├─ Load markdown files
|
||
├─ Chunk using scripts/pipelines/tasks/chunk.py
|
||
│ (semantic chunking via Chonkie library)
|
||
├─ Generate embeddings via scripts/pipelines/tasks/embeddings.py
|
||
│ (Ollama or HuggingFace embedding model)
|
||
└─ Bulk index into Elasticsearch
|
||
index: avap-docs-* (configurable via ELASTICSEARCH_INDEX)
|
||
mapping: {content, embedding, source_file, doc_type, section, ...}
|
||
```
|
||
|
||
### Path B — Synthetic AVAP code samples
|
||
|
||
```
|
||
docs/samples/*.avap
|
||
│
|
||
▼
|
||
scripts/pipelines/flows/generate_mbap.py
|
||
│
|
||
├─ Read AVAP LRM (docs/LRM/avap.md)
|
||
├─ Call Claude API to generate MBPP-style problems
|
||
└─ Output synthetic_datasets/mbpp_avap.json
|
||
(used for fine-tuning and few-shot examples)
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Infrastructure Layout
|
||
|
||
### Devaron Cluster (Vultr Kubernetes)
|
||
|
||
| Service | K8s Name | Port | Purpose |
|
||
|---|---|---|---|
|
||
| LLM inference | `ollama-light-service` | `11434` | Text generation + embeddings |
|
||
| Vector database | `brunix-vector-db` | `9200` | Elasticsearch 8.x |
|
||
| Observability DB | `brunix-postgres` | `5432` | PostgreSQL for Langfuse |
|
||
| Langfuse UI | — | `80` | `http://45.77.119.180` |
|
||
|
||
### Kubernetes tunnel commands
|
||
|
||
```bash
|
||
# Terminal 1 — LLM
|
||
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
|
||
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
|
||
|
||
# Terminal 2 — Elasticsearch
|
||
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
|
||
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
|
||
|
||
# Terminal 3 — PostgreSQL (Langfuse)
|
||
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
|
||
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
|
||
```
|
||
|
||
### Port map summary
|
||
|
||
| Port | Protocol | Service | Scope |
|
||
|---|---|---|---|
|
||
| `50051` | gRPC | Brunix Engine (inside container) | Internal |
|
||
| `50052` | gRPC | Brunix Engine (host-mapped) | External |
|
||
| `8000` | HTTP | OpenAI proxy | External |
|
||
| `11434` | HTTP | Ollama (via tunnel) | Tunnel |
|
||
| `9200` | HTTP | Elasticsearch (via tunnel) | Tunnel |
|
||
| `5432` | TCP | PostgreSQL/Langfuse (via tunnel) | Tunnel |
|
||
|
||
---
|
||
|
||
## 10. Session State & Conversation Memory
|
||
|
||
Conversation history is managed via an in-process dictionary:
|
||
|
||
```python
|
||
session_store: dict[str, list] = defaultdict(list)
|
||
# key: session_id (string, provided by client)
|
||
# value: list of LangChain BaseMessage objects
|
||
```
|
||
|
||
**Characteristics:**
|
||
- **In-memory only.** History is lost on container restart.
|
||
- **No TTL or eviction.** Sessions grow unbounded for the lifetime of the process.
|
||
- **Thread safety:** Python's GIL provides basic safety for the `ThreadPoolExecutor(max_workers=10)` gRPC server, but concurrent writes to the same `session_id` from two simultaneous requests are not explicitly protected.
|
||
- **History window:** `format_history_for_classify()` uses only the last 6 messages for query classification to keep the classify prompt short and deterministic.
|
||
|
||
> **Future work:** Replace `session_store` with a Redis-backed persistent store to survive restarts and support horizontal scaling.
|
||
|
||
---
|
||
|
||
## 11. Observability Stack
|
||
|
||
### Langfuse tracing
|
||
|
||
The server integrates Langfuse for end-to-end LLM tracing. Every `AskAgent` / `AskAgentStream` request creates a trace that captures:
|
||
- Input query and session ID
|
||
- Each LangGraph node execution (classify, reformulate, retrieve, generate)
|
||
- LLM token counts, latency, and cost
|
||
- Final response
|
||
|
||
**Access:** `http://45.77.119.180` — requires a project API key configured via `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY`.
|
||
|
||
### Logging
|
||
|
||
Structured logging via Python's `logging` module, configured at `INFO` level. Log format:
|
||
|
||
```
|
||
[MODULE] context_info — key=value key=value
|
||
```
|
||
|
||
Key log markers:
|
||
|
||
| Marker | Module | Meaning |
|
||
|---|---|---|
|
||
| `[ESEARCH]` | `server.py` | Elasticsearch connection status |
|
||
| `[classify]` | `graph.py` | Query type decision + raw LLM output |
|
||
| `[reformulate]` | `graph.py` | Reformulated query string |
|
||
| `[hybrid]` | `graph.py` | BM25 / kNN hit counts and RRF result count |
|
||
| `[retrieve]` | `graph.py` | Number of docs retrieved and context length |
|
||
| `[generate]` | `graph.py` | Response character count |
|
||
| `[AskAgentStream]` | `server.py` | Token count and total chars per stream |
|
||
| `[eval]` | `evaluate.py` | Per-question retrieval and generation status |
|
||
|
||
---
|
||
|
||
## 12. Security Boundaries
|
||
|
||
| Boundary | Current state | Risk |
|
||
|---|---|---|
|
||
| gRPC transport | **Insecure** (`add_insecure_port`) | Network interception possible. Acceptable in dev/tunnel setup; requires mTLS for production. |
|
||
| Elasticsearch auth | Optional (user/pass or API key via env vars) | Index is accessible without auth if `ELASTICSEARCH_USER` and `ELASTICSEARCH_API_KEY` are unset. |
|
||
| Container user | Non-root (`python:3.11-slim` default) | Low risk. Do not override with `root`. |
|
||
| Secrets in env | Via `.env` / `docker-compose` env injection | Never commit real values. See [CONTRIBUTING.md](../CONTRIBUTING.md#6-environment-variables-policy). |
|
||
| Session store | In-memory, no auth | Any caller with access to the gRPC port can read/write any session by guessing its ID. |
|
||
| Kubeconfig | `./kubernetes/kubeconfig.yaml` (local only) | Grants cluster access. Never commit. Listed in `.gitignore`. |
|
||
|
||
---
|
||
|
||
## 13. Known Limitations & Future Work
|
||
|
||
| Area | Limitation | Proposed solution |
|
||
|---|---|---|
|
||
| Session persistence | In-memory, lost on restart | Redis-backed `session_store` |
|
||
| Horizontal scaling | `session_store` is per-process | Sticky sessions or external session store |
|
||
| gRPC security | Insecure port | Add TLS + optional mTLS |
|
||
| Elasticsearch auth | Not enforced if vars unset | Make auth required; fail-fast on startup |
|
||
| Context window | Full history passed to generate; no truncation | Sliding window or summarization for long sessions |
|
||
| Evaluation | Golden dataset must be manually maintained | Automated golden dataset refresh pipeline |
|
||
| Rate limiting | None on gRPC server | Add interceptor-based rate limiter |
|
||
| Health check | No gRPC health protocol | Implement `grpc.health.v1` |
|