assistance-engine/docs/ARCHITECTURE.md

464 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Brunix Assistance Engine — Architecture Reference
> **Audience:** Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment.
> **Last updated:** 2026-03-18
> **Version:** 1.5.x
---
## Table of Contents
1. [System Overview](#1-system-overview)
2. [Component Inventory](#2-component-inventory)
3. [Request Lifecycle](#3-request-lifecycle)
4. [LangGraph Workflow](#4-langgraph-workflow)
5. [RAG Pipeline — Hybrid Search](#5-rag-pipeline--hybrid-search)
6. [Streaming Architecture (AskAgentStream)](#6-streaming-architecture-askagentstream)
7. [Evaluation Pipeline (EvaluateRAG)](#7-evaluation-pipeline-evaluaterag)
8. [Data Ingestion Pipeline](#8-data-ingestion-pipeline)
9. [Infrastructure Layout](#9-infrastructure-layout)
10. [Session State & Conversation Memory](#10-session-state--conversation-memory)
11. [Observability Stack](#11-observability-stack)
12. [Security Boundaries](#12-security-boundaries)
13. [Known Limitations & Future Work](#13-known-limitations--future-work)
---
## 1. System Overview
The **Brunix Assistance Engine** is a stateful, streaming-capable AI service that answers questions about the AVAP programming language. It combines:
- **gRPC** as the primary communication interface (port `50051` inside container, `50052` on host)
- **LangGraph** for deterministic, multi-step agentic orchestration
- **Hybrid RAG** (BM25 + kNN with RRF fusion) over an Elasticsearch vector index
- **Ollama** as the local LLM and embedding backend
- **RAGAS + Claude** as the automated evaluation judge
A secondary **OpenAI-compatible HTTP proxy** (port `8000`) is served via FastAPI/Uvicorn, enabling integration with tools that expect the OpenAI API format.
```
┌─────────────────────────────────────────────────────────────┐
│ External Clients │
│ grpcurl / App SDK │ OpenAI-compatible client │
└────────────┬────────────────┴──────────────┬────────────────┘
│ gRPC :50052 │ HTTP :8000
▼ ▼
┌────────────────────────────────────────────────────────────┐
│ Docker Container │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ server.py (gRPC) │ │ openai_proxy.py (HTTP) │ │
│ │ BrunixEngine │ │ FastAPI / Uvicorn │ │
│ └──────────┬──────────┘ └──────────────────────────┘ │
│ │ │
│ ┌──────────▼──────────────────────────────────────────┐ │
│ │ LangGraph Orchestration │ │
│ │ classify → reformulate → retrieve → generate │ │
│ └──────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌───────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ Ollama (LLM) Ollama (Embed) Elasticsearch │
│ via tunnel via tunnel via tunnel │
└────────────────────────────────────────────────────────────┘
│ kubectl port-forward tunnels │
▼ ▼
Devaron Cluster (Vultr Kubernetes)
ollama-light-service:11434 brunix-vector-db:9200
brunix-postgres:5432 Langfuse UI
```
---
## 2. Component Inventory
| Component | File / Service | Responsibility |
|---|---|---|
| **gRPC Server** | `Docker/src/server.py` | Entry point. Implements the `AssistanceEngine` servicer. Initializes LLM, embeddings, ES client, and both graphs. |
| **Full Graph** | `Docker/src/graph.py``build_graph()` | Complete workflow: classify → reformulate → retrieve → generate. Used by `AskAgent` and `EvaluateRAG`. |
| **Prepare Graph** | `Docker/src/graph.py``build_prepare_graph()` | Partial workflow: classify → reformulate → retrieve. Does **not** call the LLM for generation. Used by `AskAgentStream` to enable manual token streaming. |
| **Message Builder** | `Docker/src/graph.py``build_final_messages()` | Reconstructs the final prompt list from prepared state for `llm.stream()`. |
| **Prompt Library** | `Docker/src/prompts.py` | Centralized definitions for `CLASSIFY`, `REFORMULATE`, `GENERATE`, `CODE_GENERATION`, and `CONVERSATIONAL` prompts. |
| **Agent State** | `Docker/src/state.py` | `AgentState` TypedDict shared across all graph nodes. |
| **Evaluation Suite** | `Docker/src/evaluate.py` | RAGAS-based pipeline. Uses the production retriever + Ollama LLM for generation, and Claude as the impartial judge. |
| **OpenAI Proxy** | `Docker/src/openai_proxy.py` | FastAPI application that wraps `AskAgentStream` under an `/v1/chat/completions` endpoint. |
| **LLM Factory** | `Docker/src/utils/llm_factory.py` | Provider-agnostic factory for chat models (Ollama, AWS Bedrock). |
| **Embedding Factory** | `Docker/src/utils/emb_factory.py` | Provider-agnostic factory for embedding models (Ollama, HuggingFace). |
| **Ingestion Pipeline** | `scripts/pipelines/flows/elasticsearch_ingestion.py` | Chunks and ingests AVAP documents into Elasticsearch with embeddings. |
| **Dataset Generator** | `scripts/pipelines/flows/generate_mbap.py` | Generates synthetic MBPP-style AVAP problems using Claude. |
| **MBPP Translator** | `scripts/pipelines/flows/translate_mbpp.py` | Translates MBPP Python dataset into AVAP equivalents. |
---
## 3. Request Lifecycle
### 3.1 `AskAgent` (non-streaming)
```
Client → gRPC AgentRequest{query, session_id}
├─ Load conversation history from session_store[session_id]
├─ Build initial_state = {messages: history + [user_msg], ...}
└─ graph.invoke(initial_state)
├─ classify → query_type ∈ {RETRIEVAL, CODE_GENERATION, CONVERSATIONAL}
├─ reformulate → reformulated_query (keyword-optimized for semantic search)
├─ retrieve → context (top-8 hybrid RRF chunks from Elasticsearch)
└─ generate → final AIMessage (llm.invoke)
├─ Persist updated history to session_store[session_id]
└─ yield AgentResponse{text, avap_code="AVAP-2026", is_final=True}
```
### 3.2 `AskAgentStream` (token streaming)
```
Client → gRPC AgentRequest{query, session_id}
├─ Load history from session_store[session_id]
├─ Build initial_state
├─ prepare_graph.invoke(initial_state) ← Phase 1: no LLM generation
│ ├─ classify
│ ├─ reformulate
│ └─ retrieve (or skip_retrieve if CONVERSATIONAL)
├─ build_final_messages(prepared_state) ← Reconstruct prompt list
└─ for chunk in llm.stream(final_messages):
└─ yield AgentResponse{text=token, is_final=False}
├─ Persist full assembled response to session_store
└─ yield AgentResponse{text="", is_final=True}
```
### 3.3 `EvaluateRAG`
```
Client → gRPC EvalRequest{category?, limit?, index?}
└─ evaluate.run_evaluation(...)
├─ Load golden_dataset.json
├─ Filter by category / limit
├─ For each question:
│ ├─ retrieve_context (hybrid BM25+kNN, same as production)
│ └─ generate_answer (Ollama LLM + GENERATE_PROMPT)
├─ Build RAGAS Dataset
├─ Run RAGAS metrics with Claude as judge:
│ faithfulness / answer_relevancy / context_recall / context_precision
└─ Compute global_score + verdict (EXCELLENT / ACCEPTABLE / INSUFFICIENT)
└─ return EvalResponse{scores, global_score, verdict, details[]}
```
---
## 4. LangGraph Workflow
### 4.1 Full Graph (`build_graph`)
```
┌─────────────┐
│ classify │
└──────┬──────┘
┌────────────────┼──────────────────┐
▼ ▼ ▼
RETRIEVAL CODE_GENERATION CONVERSATIONAL
│ │ │
└────────┬───────┘ │
▼ ▼
┌──────────────┐ ┌────────────────────────┐
│ reformulate │ │ respond_conversational │
└──────┬───────┘ └───────────┬────────────┘
▼ │
┌──────────────┐ │
│ retrieve │ │
└──────┬───────┘ │
│ │
┌────────┴───────────┐ │
▼ ▼ │
┌──────────┐ ┌───────────────┐ │
│ generate │ │ generate_code │ │
└────┬─────┘ └───────┬───────┘ │
│ │ │
└────────────────────┴────────────────┘
END
```
### 4.2 Prepare Graph (`build_prepare_graph`)
Identical routing for classify, but generation nodes are replaced by `END`. The `CONVERSATIONAL` branch uses `skip_retrieve` (returns empty context without querying Elasticsearch).
### 4.3 Query Type Routing
| `query_type` | Triggers retrieve? | Generation prompt |
|---|---|---|
| `RETRIEVAL` | Yes | `GENERATE_PROMPT` (explanation-focused) |
| `CODE_GENERATION` | Yes | `CODE_GENERATION_PROMPT` (code-focused, returns AVAP blocks) |
| `CONVERSATIONAL` | No | `CONVERSATIONAL_PROMPT` (reformulation of prior answer) |
---
## 5. RAG Pipeline — Hybrid Search
The retrieval system (`hybrid_search_native`) fuses BM25 lexical search and kNN dense vector search using **Reciprocal Rank Fusion (RRF)**.
```
User query
├─ embeddings.embed_query(query) → query_vector [768-dim]
├─ ES multi_match (BM25) on fields [content^2, text^2]
│ └─ top-k BM25 hits
└─ ES knn on field [embedding], num_candidates = k×5
└─ top-k kNN hits
├─ RRF fusion: score(doc) = Σ 1/(rank + 60)
└─ Top-8 documents → format_context() → context string
```
**RRF constant:** `60` (standard value; prevents high-rank documents from dominating while still rewarding consensus between both retrieval modes).
**Chunk metadata** attached to each retrieved document:
| Field | Description |
|---|---|
| `chunk_id` | Unique identifier within the index |
| `source_file` | Origin document filename |
| `doc_type` | `prose`, `code`, `code_example`, `bnf` |
| `block_type` | AVAP block type: `function`, `if`, `startLoop`, `try` |
| `section` | Document section/chapter heading |
Documents of type `code`, `code_example`, `bnf`, or block type `function / if / startLoop / try` are tagged as `[AVAP CODE]` in the formatted context, signaling the LLM to treat them as executable syntax rather than prose.
---
## 6. Streaming Architecture (AskAgentStream)
The two-phase streaming design is critical to understand:
**Why not stream through LangGraph?**
LangGraph's `stream()` method yields full state snapshots per node, not individual tokens. To achieve true per-token streaming to the gRPC client, the generation step is deliberately extracted from the graph and called directly via `llm.stream()`.
**Phase 1 — Deterministic preparation (graph-managed):**
- Classification, query reformulation, and retrieval run through `prepare_graph.invoke()`.
- This phase runs synchronously and produces the complete context before any token is emitted to the client.
**Phase 2 — Token streaming (manual):**
- `build_final_messages()` reconstructs the exact prompt that `generate` / `generate_code` / `respond_conversational` would have used.
- `llm.stream(final_messages)` yields one `AIMessageChunk` per token from Ollama.
- Each token is immediately forwarded to the gRPC client as `AgentResponse{text=token, is_final=False}`.
- After the stream ends, the full assembled text is persisted to `session_store`.
**Backpressure:** gRPC streaming is flow-controlled by the client. If the client stops reading, the Ollama token stream will block at the `yield` point. No explicit buffer overflow protection is implemented (acceptable for the current single-client dev mode).
---
## 7. Evaluation Pipeline (EvaluateRAG)
The evaluation suite implements an **offline RAG evaluation** pattern using RAGAS metrics.
### Judge model separation
The production LLM (Ollama `qwen2.5:1.5b`) is used for **answer generation** — the same pipeline as production to measure real-world quality. Claude (`claude-sonnet-4-20250514`) is used as the **evaluation judge** — an independent, high-capability model that scores the generated answers against ground truth.
### RAGAS metrics
| Metric | Measures | Input |
|---|---|---|
| `faithfulness` | Are claims in the answer supported by the retrieved context? | answer + contexts |
| `answer_relevancy` | Is the answer relevant to the question? | answer + question |
| `context_recall` | Does the retrieved context cover the ground truth? | contexts + ground_truth |
| `context_precision` | Are the retrieved chunks useful (signal-to-noise)? | contexts + ground_truth |
### Global score & verdict
```
global_score = mean(non-zero metric scores)
verdict:
≥ 0.80 → EXCELLENT
≥ 0.60 → ACCEPTABLE
< 0.60 → INSUFFICIENT
```
### Golden dataset
Located at `Docker/src/golden_dataset.json`. Each entry follows this schema:
```json
{
"id": "avap-001",
"category": "core_syntax",
"question": "How do you declare a variable in AVAP?",
"ground_truth": "Use addVar to declare a variable..."
}
```
---
## 8. Data Ingestion Pipeline
Documents flow into the Elasticsearch index through two paths:
### Path A — AVAP documentation (structured markdown)
```
docs/LRM/avap.md
docs/avap_language_github_docs/*.md
docs/developer.avapframework.com/*.md
scripts/pipelines/flows/elasticsearch_ingestion.py
├─ Load markdown files
├─ Chunk using scripts/pipelines/tasks/chunk.py
│ (semantic chunking via Chonkie library)
├─ Generate embeddings via scripts/pipelines/tasks/embeddings.py
│ (Ollama or HuggingFace embedding model)
└─ Bulk index into Elasticsearch
index: avap-docs-* (configurable via ELASTICSEARCH_INDEX)
mapping: {content, embedding, source_file, doc_type, section, ...}
```
### Path B — Synthetic AVAP code samples
```
docs/samples/*.avap
scripts/pipelines/flows/generate_mbap.py
├─ Read AVAP LRM (docs/LRM/avap.md)
├─ Call Claude API to generate MBPP-style problems
└─ Output synthetic_datasets/mbpp_avap.json
(used for fine-tuning and few-shot examples)
```
---
## 9. Infrastructure Layout
### Devaron Cluster (Vultr Kubernetes)
| Service | K8s Name | Port | Purpose |
|---|---|---|---|
| LLM inference | `ollama-light-service` | `11434` | Text generation + embeddings |
| Vector database | `brunix-vector-db` | `9200` | Elasticsearch 8.x |
| Observability DB | `brunix-postgres` | `5432` | PostgreSQL for Langfuse |
| Langfuse UI | — | `80` | `http://45.77.119.180` |
### Kubernetes tunnel commands
```bash
# Terminal 1 — LLM
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
# Terminal 2 — Elasticsearch
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
# Terminal 3 — PostgreSQL (Langfuse)
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
-n brunix --kubeconfig ./kubernetes/kubeconfig.yaml
```
### Port map summary
| Port | Protocol | Service | Scope |
|---|---|---|---|
| `50051` | gRPC | Brunix Engine (inside container) | Internal |
| `50052` | gRPC | Brunix Engine (host-mapped) | External |
| `8000` | HTTP | OpenAI proxy | External |
| `11434` | HTTP | Ollama (via tunnel) | Tunnel |
| `9200` | HTTP | Elasticsearch (via tunnel) | Tunnel |
| `5432` | TCP | PostgreSQL/Langfuse (via tunnel) | Tunnel |
---
## 10. Session State & Conversation Memory
Conversation history is managed via an in-process dictionary:
```python
session_store: dict[str, list] = defaultdict(list)
# key: session_id (string, provided by client)
# value: list of LangChain BaseMessage objects
```
**Characteristics:**
- **In-memory only.** History is lost on container restart.
- **No TTL or eviction.** Sessions grow unbounded for the lifetime of the process.
- **Thread safety:** Python's GIL provides basic safety for the `ThreadPoolExecutor(max_workers=10)` gRPC server, but concurrent writes to the same `session_id` from two simultaneous requests are not explicitly protected.
- **History window:** `format_history_for_classify()` uses only the last 6 messages for query classification to keep the classify prompt short and deterministic.
> **Future work:** Replace `session_store` with a Redis-backed persistent store to survive restarts and support horizontal scaling.
---
## 11. Observability Stack
### Langfuse tracing
The server integrates Langfuse for end-to-end LLM tracing. Every `AskAgent` / `AskAgentStream` request creates a trace that captures:
- Input query and session ID
- Each LangGraph node execution (classify, reformulate, retrieve, generate)
- LLM token counts, latency, and cost
- Final response
**Access:** `http://45.77.119.180` — requires a project API key configured via `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY`.
### Logging
Structured logging via Python's `logging` module, configured at `INFO` level. Log format:
```
[MODULE] context_info — key=value key=value
```
Key log markers:
| Marker | Module | Meaning |
|---|---|---|
| `[ESEARCH]` | `server.py` | Elasticsearch connection status |
| `[classify]` | `graph.py` | Query type decision + raw LLM output |
| `[reformulate]` | `graph.py` | Reformulated query string |
| `[hybrid]` | `graph.py` | BM25 / kNN hit counts and RRF result count |
| `[retrieve]` | `graph.py` | Number of docs retrieved and context length |
| `[generate]` | `graph.py` | Response character count |
| `[AskAgentStream]` | `server.py` | Token count and total chars per stream |
| `[eval]` | `evaluate.py` | Per-question retrieval and generation status |
---
## 12. Security Boundaries
| Boundary | Current state | Risk |
|---|---|---|
| gRPC transport | **Insecure** (`add_insecure_port`) | Network interception possible. Acceptable in dev/tunnel setup; requires mTLS for production. |
| Elasticsearch auth | Optional (user/pass or API key via env vars) | Index is accessible without auth if `ELASTICSEARCH_USER` and `ELASTICSEARCH_API_KEY` are unset. |
| Container user | Non-root (`python:3.11-slim` default) | Low risk. Do not override with `root`. |
| Secrets in env | Via `.env` / `docker-compose` env injection | Never commit real values. See [CONTRIBUTING.md](../CONTRIBUTING.md#6-environment-variables-policy). |
| Session store | In-memory, no auth | Any caller with access to the gRPC port can read/write any session by guessing its ID. |
| Kubeconfig | `./kubernetes/kubeconfig.yaml` (local only) | Grants cluster access. Never commit. Listed in `.gitignore`. |
---
## 13. Known Limitations & Future Work
| Area | Limitation | Proposed solution |
|---|---|---|
| Session persistence | In-memory, lost on restart | Redis-backed `session_store` |
| Horizontal scaling | `session_store` is per-process | Sticky sessions or external session store |
| gRPC security | Insecure port | Add TLS + optional mTLS |
| Elasticsearch auth | Not enforced if vars unset | Make auth required; fail-fast on startup |
| Context window | Full history passed to generate; no truncation | Sliding window or summarization for long sessions |
| Evaluation | Golden dataset must be manually maintained | Automated golden dataset refresh pipeline |
| Rate limiting | None on gRPC server | Add interceptor-based rate limiter |
| Health check | No gRPC health protocol | Implement `grpc.health.v1` |