assistance-engine/docs/ARCHITECTURE.md

# Brunix Assistance Engine — Architecture Reference

> **Audience:** Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment.
> **Last updated:** 2026-03-18
> **Version:** 1.5.x

---

## Table of Contents

1. [System Overview](#1-system-overview)
2. [Component Inventory](#2-component-inventory)
3. [Request Lifecycle](#3-request-lifecycle)
4. [LangGraph Workflow](#4-langgraph-workflow)
5. [RAG Pipeline — Hybrid Search](#5-rag-pipeline--hybrid-search)
6. [Streaming Architecture (AskAgentStream)](#6-streaming-architecture-askagentstream)
7. [Evaluation Pipeline (EvaluateRAG)](#7-evaluation-pipeline-evaluaterag)
8. [Data Ingestion Pipeline](#8-data-ingestion-pipeline)
9. [Infrastructure Layout](#9-infrastructure-layout)
10. [Session State & Conversation Memory](#10-session-state--conversation-memory)
11. [Observability Stack](#11-observability-stack)
12. [Security Boundaries](#12-security-boundaries)
13. [Known Limitations & Future Work](#13-known-limitations--future-work)

---

## 1. System Overview

The **Brunix Assistance Engine** is a stateful, streaming-capable AI service that answers questions about the AVAP programming language. It combines:

- **gRPC** as the primary communication interface (port `50051` inside container, `50052` on host)
- **LangGraph** for deterministic, multi-step agentic orchestration
- **Hybrid RAG** (BM25 + kNN with RRF fusion) over an Elasticsearch vector index
- **Ollama** as the local LLM and embedding backend
- **RAGAS + Claude** as the automated evaluation judge

A secondary **OpenAI-compatible HTTP proxy** (port `8000`) is served via FastAPI/Uvicorn, enabling integration with tools that expect the OpenAI API format.

```
┌─────────────────────────────────────────────────────────────┐
│                        External Clients                     │
│    grpcurl / App SDK        │     OpenAI-compatible client  │
└────────────┬────────────────┴──────────────┬────────────────┘
             │ gRPC :50052                    │ HTTP :8000
             ▼                                ▼
┌────────────────────────────────────────────────────────────┐
│                   Docker Container                         │
│                                                            │
│  ┌─────────────────────┐    ┌──────────────────────────┐  │
│  │   server.py (gRPC)  │    │  openai_proxy.py (HTTP)  │  │
│  │   BrunixEngine      │    │  FastAPI / Uvicorn       │  │
│  └──────────┬──────────┘    └──────────────────────────┘  │
│             │                                              │
│  ┌──────────▼──────────────────────────────────────────┐  │
│  │               LangGraph Orchestration                │  │
│  │   classify → reformulate → retrieve → generate      │  │
│  └──────────────────────────┬───────────────────────────┘  │
│                             │                              │
│         ┌───────────────────┼────────────────────┐        │
│         ▼                   ▼                    ▼        │
│   Ollama (LLM)       Ollama (Embed)        Elasticsearch  │
│   via tunnel         via tunnel            via tunnel     │
└────────────────────────────────────────────────────────────┘
             │ kubectl port-forward tunnels │
             ▼                             ▼
      Devaron Cluster (Vultr Kubernetes)
      ollama-light-service:11434   brunix-vector-db:9200
      brunix-postgres:5432         Langfuse UI
```

---

## 2. Component Inventory

| Component | File / Service | Responsibility |
|---|---|---|
| **gRPC Server** | `Docker/src/server.py` | Entry point. Implements the `AssistanceEngine` servicer. Initializes LLM, embeddings, ES client, and both graphs. |
| **Full Graph** | `Docker/src/graph.py` → `build_graph()` | Complete workflow: classify → reformulate → retrieve → generate. Used by `AskAgent` and `EvaluateRAG`. |
| **Prepare Graph** | `Docker/src/graph.py` → `build_prepare_graph()` | Partial workflow: classify → reformulate → retrieve. Does **not** call the LLM for generation. Used by `AskAgentStream` to enable manual token streaming. |
| **Message Builder** | `Docker/src/graph.py` → `build_final_messages()` | Reconstructs the final prompt list from prepared state for `llm.stream()`. |
| **Prompt Library** | `Docker/src/prompts.py` | Centralized definitions for `CLASSIFY`, `REFORMULATE`, `GENERATE`, `CODE_GENERATION`, and `CONVERSATIONAL` prompts. |
| **Agent State** | `Docker/src/state.py` | `AgentState` TypedDict shared across all graph nodes. |
| **Evaluation Suite** | `Docker/src/evaluate.py` | RAGAS-based pipeline. Uses the production retriever + Ollama LLM for generation, and Claude as the impartial judge. |
| **OpenAI Proxy** | `Docker/src/openai_proxy.py` | FastAPI application that wraps `AskAgentStream` under an `/v1/chat/completions` endpoint. |
| **LLM Factory** | `Docker/src/utils/llm_factory.py` | Provider-agnostic factory for chat models (Ollama, AWS Bedrock). |
| **Embedding Factory** | `Docker/src/utils/emb_factory.py` | Provider-agnostic factory for embedding models (Ollama, HuggingFace). |
| **Ingestion Pipeline** | `scripts/pipelines/flows/elasticsearch_ingestion.py` | Chunks and ingests AVAP documents into Elasticsearch with embeddings. |
| **Dataset Generator** | `scripts/pipelines/flows/generate_mbap.py` | Generates synthetic MBPP-style AVAP problems using Claude. |
| **MBPP Translator** | `scripts/pipelines/flows/translate_mbpp.py` | Translates MBPP Python dataset into AVAP equivalents. |

---

## 3. Request Lifecycle

### 3.1 `AskAgent` (non-streaming)

```
Client → gRPC AgentRequest{query, session_id}
  │
  ├─ Load conversation history from session_store[session_id]
  ├─ Build initial_state = {messages: history + [user_msg], ...}
  │
  └─ graph.invoke(initial_state)
       ├─ classify    → query_type ∈ {RETRIEVAL, CODE_GENERATION, CONVERSATIONAL}
       ├─ reformulate → reformulated_query (keyword-optimized for semantic search)
       ├─ retrieve    → context (top-8 hybrid RRF chunks from Elasticsearch)
       └─ generate    → final AIMessage (llm.invoke)
  │
  ├─ Persist updated history to session_store[session_id]
  └─ yield AgentResponse{text, avap_code="AVAP-2026", is_final=True}
```

### 3.2 `AskAgentStream` (token streaming)

```
Client → gRPC AgentRequest{query, session_id}
  │
  ├─ Load history from session_store[session_id]
  ├─ Build initial_state
  │
  ├─ prepare_graph.invoke(initial_state)   ← Phase 1: no LLM generation
  │    ├─ classify
  │    ├─ reformulate
  │    └─ retrieve (or skip_retrieve if CONVERSATIONAL)
  │
  ├─ build_final_messages(prepared_state)  ← Reconstruct prompt list
  │
  └─ for chunk in llm.stream(final_messages):
       └─ yield AgentResponse{text=token, is_final=False}
  │
  ├─ Persist full assembled response to session_store
  └─ yield AgentResponse{text="", is_final=True}
```

### 3.3 `EvaluateRAG`

```
Client → gRPC EvalRequest{category?, limit?, index?}
  │
  └─ evaluate.run_evaluation(...)
       ├─ Load golden_dataset.json
       ├─ Filter by category / limit
       ├─ For each question:
       │    ├─ retrieve_context (hybrid BM25+kNN, same as production)
       │    └─ generate_answer (Ollama LLM + GENERATE_PROMPT)
       ├─ Build RAGAS Dataset
       ├─ Run RAGAS metrics with Claude as judge:
       │    faithfulness / answer_relevancy / context_recall / context_precision
       └─ Compute global_score + verdict (EXCELLENT / ACCEPTABLE / INSUFFICIENT)
  │
  └─ return EvalResponse{scores, global_score, verdict, details[]}
```

---

## 4. LangGraph Workflow

### 4.1 Full Graph (`build_graph`)

```
                    ┌─────────────┐
                    │   classify  │
                    └──────┬──────┘
                           │
          ┌────────────────┼──────────────────┐
          ▼                ▼                   ▼
    RETRIEVAL        CODE_GENERATION     CONVERSATIONAL
          │                │                   │
          └────────┬───────┘                   │
                   ▼                           ▼
            ┌──────────────┐        ┌────────────────────────┐
            │  reformulate │        │  respond_conversational │
            └──────┬───────┘        └───────────┬────────────┘
                   ▼                            │
            ┌──────────────┐                   │
            │   retrieve   │                   │
            └──────┬───────┘                   │
                   │                           │
          ┌────────┴───────────┐               │
          ▼                    ▼               │
    ┌──────────┐      ┌───────────────┐        │
    │ generate │      │ generate_code │        │
    └────┬─────┘      └───────┬───────┘        │
         │                    │                │
         └────────────────────┴────────────────┘
                              │
                             END
```

### 4.2 Prepare Graph (`build_prepare_graph`)

Identical routing for classify, but generation nodes are replaced by `END`. The `CONVERSATIONAL` branch uses `skip_retrieve` (returns empty context without querying Elasticsearch).

### 4.3 Query Type Routing

| `query_type` | Triggers retrieve? | Generation prompt |
|---|---|---|
| `RETRIEVAL` | Yes | `GENERATE_PROMPT` (explanation-focused) |
| `CODE_GENERATION` | Yes | `CODE_GENERATION_PROMPT` (code-focused, returns AVAP blocks) |
| `CONVERSATIONAL` | No | `CONVERSATIONAL_PROMPT` (reformulation of prior answer) |

---

## 5. RAG Pipeline — Hybrid Search

The retrieval system (`hybrid_search_native`) fuses BM25 lexical search and kNN dense vector search using **Reciprocal Rank Fusion (RRF)**.

```
User query
    │
    ├─ embeddings.embed_query(query) → query_vector [768-dim]
    │
    ├─ ES multi_match (BM25) on fields [content^2, text^2]
    │    └─ top-k BM25 hits
    │
    └─ ES knn on field [embedding], num_candidates = k×5
         └─ top-k kNN hits
    │
    ├─ RRF fusion: score(doc) = Σ 1/(rank + 60)
    │
    └─ Top-8 documents → format_context() → context string
```

**RRF constant:** `60` (standard value; prevents high-rank documents from dominating while still rewarding consensus between both retrieval modes).

**Chunk metadata** attached to each retrieved document:

| Field | Description |
|---|---|
| `chunk_id` | Unique identifier within the index |
| `source_file` | Origin document filename |
| `doc_type` | `prose`, `code`, `code_example`, `bnf` |
| `block_type` | AVAP block type: `function`, `if`, `startLoop`, `try` |
| `section` | Document section/chapter heading |

Documents of type `code`, `code_example`, `bnf`, or block type `function / if / startLoop / try` are tagged as `[AVAP CODE]` in the formatted context, signaling the LLM to treat them as executable syntax rather than prose.

---

## 6. Streaming Architecture (AskAgentStream)

The two-phase streaming design is critical to understand:

**Why not stream through LangGraph?**
LangGraph's `stream()` method yields full state snapshots per node, not individual tokens. To achieve true per-token streaming to the gRPC client, the generation step is deliberately extracted from the graph and called directly via `llm.stream()`.

**Phase 1 — Deterministic preparation (graph-managed):**
- Classification, query reformulation, and retrieval run through `prepare_graph.invoke()`.
- This phase runs synchronously and produces the complete context before any token is emitted to the client.

**Phase 2 — Token streaming (manual):**
- `build_final_messages()` reconstructs the exact prompt that `generate` / `generate_code` / `respond_conversational` would have used.
- `llm.stream(final_messages)` yields one `AIMessageChunk` per token from Ollama.
- Each token is immediately forwarded to the gRPC client as `AgentResponse{text=token, is_final=False}`.
- After the stream ends, the full assembled text is persisted to `session_store`.

**Backpressure:** gRPC streaming is flow-controlled by the client. If the client stops reading, the Ollama token stream will block at the `yield` point. No explicit buffer overflow protection is implemented (acceptable for the current single-client dev mode).

---

## 7. Evaluation Pipeline (EvaluateRAG)

The evaluation suite implements an **offline RAG evaluation** pattern using RAGAS metrics.

### Judge model separation

The production LLM (Ollama `qwen2.5:1.5b`) is used for **answer generation** — the same pipeline as production to measure real-world quality. Claude (`claude-sonnet-4-20250514`) is used as the **evaluation judge** — an independent, high-capability model that scores the generated answers against ground truth.

### RAGAS metrics

| Metric | Measures | Input |
|---|---|---|
| `faithfulness` | Are claims in the answer supported by the retrieved context? | answer + contexts |
| `answer_relevancy` | Is the answer relevant to the question? | answer + question |
| `context_recall` | Does the retrieved context cover the ground truth? | contexts + ground_truth |
| `context_precision` | Are the retrieved chunks useful (signal-to-noise)? | contexts + ground_truth |

### Global score & verdict

```
global_score = mean(non-zero metric scores)

verdict:
  ≥ 0.80 → EXCELLENT
  ≥ 0.60 → ACCEPTABLE
  < 0.60 → INSUFFICIENT
```

### Golden dataset

Located at `Docker/src/golden_dataset.json`. Each entry follows this schema:

```json
{
  "id": "avap-001",
  "category": "core_syntax",
  "question": "How do you declare a variable in AVAP?",
  "ground_truth": "Use addVar to declare a variable..."
}
```

---

## 8. Data Ingestion Pipeline

Documents flow into the Elasticsearch index through two paths:

### Path A — AVAP documentation (structured markdown)

```
docs/LRM/avap.md
docs/avap_language_github_docs/*.md
docs/developer.avapframework.com/*.md
        │
        ▼
scripts/pipelines/flows/elasticsearch_ingestion.py
        │
        ├─ Load markdown files
        ├─ Chunk using scripts/pipelines/tasks/chunk.py
        │    (semantic chunking via Chonkie library)
        ├─ Generate embeddings via scripts/pipelines/tasks/embeddings.py
        │    (Ollama or HuggingFace embedding model)
        └─ Bulk index into Elasticsearch
             index: avap-docs-* (configurable via ELASTICSEARCH_INDEX)
             mapping: {content, embedding, source_file, doc_type, section, ...}
```

### Path B — Synthetic AVAP code samples

```
docs/samples/*.avap
        │
        ▼
scripts/pipelines/flows/generate_mbap.py
        │
        ├─ Read AVAP LRM (docs/LRM/avap.md)
        ├─ Call Claude API to generate MBPP-style problems
        └─ Output synthetic_datasets/mbpp_avap.json
             (used for fine-tuning and few-shot examples)
```

---

## 9. Infrastructure Layout

### Devaron Cluster (Vultr Kubernetes)

| Service | K8s Name | Port | Purpose |
|---|---|---|---|
| LLM inference | `ollama-light-service` | `11434` | Text generation + embeddings |
| Vector database | `brunix-vector-db` | `9200` | Elasticsearch 8.x |
| Observability DB | `brunix-postgres` | `5432` | PostgreSQL for Langfuse |
| Langfuse UI | — | `80` | `http://45.77.119.180` |

### Kubernetes tunnel commands

```bash
# Terminal 1 — LLM
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Terminal 2 — Elasticsearch
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Terminal 3 — PostgreSQL (Langfuse)
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
```

### Port map summary

| Port | Protocol | Service | Scope |
|---|---|---|---|
| `50051` | gRPC | Brunix Engine (inside container) | Internal |
| `50052` | gRPC | Brunix Engine (host-mapped) | External |
| `8000` | HTTP | OpenAI proxy | External |
| `11434` | HTTP | Ollama (via tunnel) | Tunnel |
| `9200` | HTTP | Elasticsearch (via tunnel) | Tunnel |
| `5432` | TCP | PostgreSQL/Langfuse (via tunnel) | Tunnel |

---

## 10. Session State & Conversation Memory

Conversation history is managed via an in-process dictionary:

```python
session_store: dict[str, list] = defaultdict(list)
# key: session_id (string, provided by client)
# value: list of LangChain BaseMessage objects
```

**Characteristics:**
- **In-memory only.** History is lost on container restart.
- **No TTL or eviction.** Sessions grow unbounded for the lifetime of the process.
- **Thread safety:** Python's GIL provides basic safety for the `ThreadPoolExecutor(max_workers=10)` gRPC server, but concurrent writes to the same `session_id` from two simultaneous requests are not explicitly protected.
- **History window:** `format_history_for_classify()` uses only the last 6 messages for query classification to keep the classify prompt short and deterministic.

> **Future work:** Replace `session_store` with a Redis-backed persistent store to survive restarts and support horizontal scaling.

---

## 11. Observability Stack

### Langfuse tracing

The server integrates Langfuse for end-to-end LLM tracing. Every `AskAgent` / `AskAgentStream` request creates a trace that captures:
- Input query and session ID
- Each LangGraph node execution (classify, reformulate, retrieve, generate)
- LLM token counts, latency, and cost
- Final response

**Access:** `http://45.77.119.180` — requires a project API key configured via `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY`.

### Logging

Structured logging via Python's `logging` module, configured at `INFO` level. Log format:

```
[MODULE] context_info — key=value key=value
```

Key log markers:

| Marker | Module | Meaning |
|---|---|---|
| `[ESEARCH]` | `server.py` | Elasticsearch connection status |
| `[classify]` | `graph.py` | Query type decision + raw LLM output |
| `[reformulate]` | `graph.py` | Reformulated query string |
| `[hybrid]` | `graph.py` | BM25 / kNN hit counts and RRF result count |
| `[retrieve]` | `graph.py` | Number of docs retrieved and context length |
| `[generate]` | `graph.py` | Response character count |
| `[AskAgentStream]` | `server.py` | Token count and total chars per stream |
| `[eval]` | `evaluate.py` | Per-question retrieval and generation status |

---

## 12. Security Boundaries

| Boundary | Current state | Risk |
|---|---|---|
| gRPC transport | **Insecure** (`add_insecure_port`) | Network interception possible. Acceptable in dev/tunnel setup; requires mTLS for production. |
| Elasticsearch auth | Optional (user/pass or API key via env vars) | Index is accessible without auth if `ELASTICSEARCH_USER` and `ELASTICSEARCH_API_KEY` are unset. |
| Container user | Non-root (`python:3.11-slim` default) | Low risk. Do not override with `root`. |
| Secrets in env | Via `.env` / `docker-compose` env injection | Never commit real values. See [CONTRIBUTING.md](../CONTRIBUTING.md#6-environment-variables-policy). |
| Session store | In-memory, no auth | Any caller with access to the gRPC port can read/write any session by guessing its ID. |
| Kubeconfig | `./kubernetes/kubeconfig.yaml` (local only) | Grants cluster access. Never commit. Listed in `.gitignore`. |

---

## 13. Known Limitations & Future Work

| Area | Limitation | Proposed solution |
|---|---|---|
| Session persistence | In-memory, lost on restart | Redis-backed `session_store` |
| Horizontal scaling | `session_store` is per-process | Sticky sessions or external session store |
| gRPC security | Insecure port | Add TLS + optional mTLS |
| Elasticsearch auth | Not enforced if vars unset | Make auth required; fail-fast on startup |
| Context window | Full history passed to generate; no truncation | Sliding window or summarization for long sessions |
| Evaluation | Golden dataset must be manually maintained | Automated golden dataset refresh pipeline |
| Rate limiting | None on gRPC server | Add interceptor-based rate limiter |
| Health check | No gRPC health protocol | Implement `grpc.health.v1` |