28 KiB
Brunix Assistance Engine — Architecture Reference
Audience: Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment. Last updated: 2026-03-20 Version: 1.6.x
Table of Contents
- System Overview
- Component Inventory
- Request Lifecycle
- LangGraph Workflow
- RAG Pipeline — Hybrid Search
- Editor Context Pipeline
- Streaming Architecture (AskAgentStream)
- Evaluation Pipeline (EvaluateRAG)
- Data Ingestion Pipeline
- Infrastructure Layout
- Session State & Conversation Memory
- Observability Stack
- Security Boundaries
- Known Limitations & Future Work
1. System Overview
The Brunix Assistance Engine is a stateful, streaming-capable AI service that answers questions about the AVAP programming language. It combines:
- gRPC as the primary communication interface (port
50051inside container,50052on host) - LangGraph for deterministic, multi-step agentic orchestration
- Hybrid RAG (BM25 + kNN with RRF fusion) over an Elasticsearch vector index
- Ollama as the local LLM and embedding backend
- RAGAS + Claude as the automated evaluation judge
- Editor context injection — the VS Code extension can send active file content and selected code alongside each query; the engine decides whether to use it based on the user's intent
A secondary OpenAI-compatible HTTP proxy (port 8000) is served via FastAPI/Uvicorn, enabling integration with tools that expect the OpenAI API format.
┌─────────────────────────────────────────────────────────────┐
│ External Clients │
│ grpcurl / App SDK │ OpenAI-compatible client │
│ VS Code extension │ (continue.dev, LiteLLM) │
└────────────┬────────────────┴──────────────┬────────────────┘
│ gRPC :50052 │ HTTP :8000
▼ ▼
┌────────────────────────────────────────────────────────────┐
│ Docker Container │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ server.py (gRPC) │ │ openai_proxy.py (HTTP) │ │
│ │ BrunixEngine │ │ FastAPI / Uvicorn │ │
│ └──────────┬──────────┘ └──────────────────────────┘ │
│ │ │
│ ┌──────────▼──────────────────────────────────────────┐ │
│ │ LangGraph Orchestration │ │
│ │ classify → reformulate → retrieve → generate │ │
│ └──────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌───────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ Ollama (LLM) Ollama (Embed) Elasticsearch │
│ via tunnel via tunnel via tunnel │
└────────────────────────────────────────────────────────────┘
│ kubectl port-forward tunnels │
▼ ▼
Devaron Cluster (Vultr Kubernetes)
ollama-light-service:11434 brunix-vector-db:9200
brunix-postgres:5432 Langfuse UI
2. Component Inventory
| Component | File / Service | Responsibility |
|---|---|---|
| gRPC Server | Docker/src/server.py |
Entry point. Implements the AssistanceEngine servicer. Initializes LLM, embeddings, ES client, and both graphs. Decodes Base64 editor context fields from incoming requests. |
| Full Graph | Docker/src/graph.py → build_graph() |
Complete workflow: classify → reformulate → retrieve → generate. Used by AskAgent and EvaluateRAG. |
| Prepare Graph | Docker/src/graph.py → build_prepare_graph() |
Partial workflow: classify → reformulate → retrieve. Does not call the LLM for generation. Used by AskAgentStream to enable manual token streaming. |
| Message Builder | Docker/src/graph.py → build_final_messages() |
Reconstructs the final prompt list from prepared state for llm.stream(). Injects editor context when use_editor_context is True. |
| Prompt Library | Docker/src/prompts.py |
Centralized definitions for CLASSIFY, REFORMULATE, GENERATE, CODE_GENERATION, and CONVERSATIONAL prompts. |
| Agent State | Docker/src/state.py |
AgentState TypedDict shared across all graph nodes. Includes editor context fields and use_editor_context flag. |
| Evaluation Suite | Docker/src/evaluate.py |
RAGAS-based pipeline. Uses the production retriever + Ollama LLM for generation, and Claude as the impartial judge. |
| OpenAI Proxy | Docker/src/openai_proxy.py |
FastAPI application that wraps AskAgent / AskAgentStream under OpenAI and Ollama compatible endpoints. Parses editor context from the user field. |
| LLM Factory | Docker/src/utils/llm_factory.py |
Provider-agnostic factory for chat models (Ollama, AWS Bedrock). |
| Embedding Factory | Docker/src/utils/emb_factory.py |
Provider-agnostic factory for embedding models (Ollama, HuggingFace). |
| Ingestion Pipeline | scripts/pipelines/flows/elasticsearch_ingestion.py |
Chunks and ingests AVAP documents into Elasticsearch with embeddings. |
| AVAP Chunker | scripts/pipelines/ingestion/avap_chunker.py |
Semantic chunker for .avap source files using avap_config.json as grammar. |
| Unit Tests | Docker/tests/test_prd_0002.py |
40 unit tests covering editor context parsing, Base64 decoding, classifier output, reformulate anchor, and injection logic. |
3. Request Lifecycle
3.1 AskAgent (non-streaming)
Client → gRPC AgentRequest{query, session_id, editor_content*, selected_text*, extra_context*, user_info*}
│ (* Base64-encoded; user_info is JSON string)
│
├─ Decode Base64 fields (editor_content, selected_text, extra_context)
├─ Load conversation history from session_store[session_id]
├─ Build initial_state = {messages, session_id, editor_content, selected_text, extra_context, user_info}
│
└─ graph.invoke(initial_state)
├─ classify → query_type ∈ {RETRIEVAL, CODE_GENERATION, CONVERSATIONAL}
│ use_editor_context ∈ {True, False}
├─ reformulate → reformulated_query
│ (anchored to selected_text if use_editor_context=True)
├─ retrieve → context (top-8 hybrid RRF chunks)
└─ generate → final AIMessage
(editor context injected only if use_editor_context=True)
│
├─ Persist updated history to session_store[session_id]
└─ yield AgentResponse{text, avap_code="AVAP-2026", is_final=True}
3.2 AskAgentStream (token streaming)
Client → gRPC AgentRequest{query, session_id, editor_content*, selected_text*, extra_context*, user_info*}
│
├─ Decode Base64 fields
├─ Load history from session_store[session_id]
├─ Build initial_state
│
├─ prepare_graph.invoke(initial_state) ← Phase 1: no LLM generation
│ ├─ classify → query_type + use_editor_context
│ ├─ reformulate
│ └─ retrieve (or skip_retrieve if CONVERSATIONAL)
│
├─ build_final_messages(prepared_state) ← Reconstruct prompt with editor context if flagged
│
└─ for chunk in llm.stream(final_messages):
└─ yield AgentResponse{text=token, is_final=False}
│
├─ Persist full assembled response to session_store
└─ yield AgentResponse{text="", is_final=True}
3.3 HTTP Proxy → gRPC
Client → POST /v1/chat/completions {messages, stream, session_id, user}
│
├─ Extract query from last user message in messages[]
├─ Read session_id from dedicated field (NOT from user)
├─ Parse user field as JSON → {editor_content, selected_text, extra_context, user_info}
│
├─ stream=false → _invoke_blocking() → AskAgent gRPC call
└─ stream=true → _iter_stream() → AskAgentStream gRPC call → SSE token stream
3.4 EvaluateRAG
Client → gRPC EvalRequest{category?, limit?, index?}
│
└─ evaluate.run_evaluation(...)
├─ Load golden_dataset.json
├─ Filter by category / limit
├─ For each question:
│ ├─ retrieve_context (hybrid BM25+kNN, same as production)
│ └─ generate_answer (Ollama LLM + GENERATE_PROMPT)
├─ Build RAGAS Dataset
├─ Run RAGAS metrics with Claude as judge
└─ Compute global_score + verdict
│
└─ return EvalResponse{scores, global_score, verdict, details[]}
4. LangGraph Workflow
4.1 Agent State
class AgentState(TypedDict):
messages: Annotated[list, add_messages] # conversation history
session_id: str
query_type: str # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL
reformulated_query: str
context: str # formatted RAG context string
editor_content: str # decoded from Base64
selected_text: str # decoded from Base64
extra_context: str # decoded from Base64
user_info: str # JSON string: {"dev_id", "project_id", "org_id"}
use_editor_context: bool # set by classifier — True only if query explicitly refers to editor
4.2 Full Graph (build_graph)
┌─────────────┐
│ classify │ ← sees: query + history + selected_text (if present)
│ │ outputs: query_type + use_editor_context
└──────┬──────┘
│
┌────────────────┼──────────────────┐
▼ ▼ ▼
RETRIEVAL CODE_GENERATION CONVERSATIONAL
│ │ │
└────────┬───────┘ │
▼ ▼
┌──────────────┐ ┌────────────────────────┐
│ reformulate │ │ respond_conversational │
│ │ └───────────┬────────────┘
│ if use_editor│ │
│ anchor query │ │
│ to selected │ │
└──────┬───────┘ │
▼ │
┌──────────────┐ │
│ retrieve │ │
└──────┬───────┘ │
│ │
┌────────┴───────────┐ │
▼ ▼ │
┌──────────┐ ┌───────────────┐ │
│ generate │ │ generate_code │ │
│ │ │ │ │
│ injects │ │ injects editor│ │
│ editor │ │ context only │ │
│ context │ │ if flag=True │ │
│ if flag │ └───────┬───────┘ │
└────┬─────┘ │ │
│ │ │
└────────────────────┴────────────────┘
│
END
4.3 Prepare Graph (build_prepare_graph)
Identical routing for classify, but generation nodes are replaced by END. The CONVERSATIONAL branch uses skip_retrieve (returns empty context). The use_editor_context flag is set here and carried forward into build_final_messages.
4.4 Classifier — Two-Token Output
The classifier outputs exactly two tokens separated by a space:
<query_type> <editor_signal>
Examples:
RETRIEVAL NO_EDITOR
CODE_GENERATION EDITOR
CONVERSATIONAL NO_EDITOR
EDITOR is set only when the user message explicitly refers to editor code using expressions like "this code", "este codigo", "fix this", "que hace esto", "explain this", etc. General AVAP questions, code generation requests, and conversational follow-ups always return NO_EDITOR.
4.5 Query Type Routing
query_type |
Triggers retrieve? | Generation prompt | Editor context injected? |
|---|---|---|---|
RETRIEVAL |
Yes | GENERATE_PROMPT |
Only if use_editor_context=True |
CODE_GENERATION |
Yes | CODE_GENERATION_PROMPT |
Only if use_editor_context=True |
CONVERSATIONAL |
No | CONVERSATIONAL_PROMPT |
Never |
4.6 Reformulator — Mode-Aware & Language-Preserving
The reformulator receives [MODE: <query_type>] prepended to the query:
- MODE RETRIEVAL: Compresses the query into compact keywords. Does NOT expand with AVAP commands. Preserves original language — Spanish queries stay in Spanish, English queries stay in English.
- MODE CODE_GENERATION: Applies the AVAP command expansion mapping (registerEndpoint, addParam, ormAccessSelect, etc.).
- MODE CONVERSATIONAL: Returns the query as-is.
Language preservation is critical for BM25 retrieval — the AVAP LRM is written in Spanish, so a Spanish query must reach the retriever in Spanish for lexical matching to work correctly.
5. RAG Pipeline — Hybrid Search
The retrieval system (hybrid_search_native) fuses BM25 lexical search and kNN dense vector search using Reciprocal Rank Fusion (RRF).
User query (reformulated, language-preserved)
│
├─ embeddings.embed_query(query) → query_vector [1024-dim]
│
├─ ES bool query:
│ ├─ must: multi_match (BM25) on [content^2, text^2]
│ └─ should: boost spec/narrative doc_types (2.0x / 1.5x)
│ └─ top-k BM25 hits
│
└─ ES knn on field [embedding], num_candidates = k×5
└─ top-k kNN hits
│
├─ RRF fusion: score(doc) = Σ 1/(rank + 60)
│
└─ Top-8 documents → format_context() → context string
RRF constant: 60 (standard value).
doc_type boost: spec and narrative chunks receive a score boost in the BM25 query to prioritize definitional and explanatory content over raw code examples when the query is about meaning or documentation.
Chunk metadata attached to each retrieved document:
| Field | Description |
|---|---|
chunk_id |
Unique identifier within the index |
source_file |
Origin document filename |
doc_type |
spec, code, code_example, bnf |
block_type |
AVAP block type: narrative, function, if, startLoop, try, etc. |
section |
Document section/chapter heading |
6. Editor Context Pipeline
The editor context pipeline (PRD-0002) allows the VS Code extension to send the user's active editor state alongside every query. The engine uses this context only when the user explicitly refers to their code.
Transport
Editor context travels differently depending on the client protocol:
Via gRPC directly (AgentRequest fields 3–6):
editor_content(field 3) — Base64-encoded full file contentselected_text(field 4) — Base64-encoded selected textextra_context(field 5) — Base64-encoded free-form contextuser_info(field 6) — JSON string{"dev_id":…,"project_id":…,"org_id":…}
Via HTTP proxy (OpenAI /v1/chat/completions):
- Transported in the standard
userfield as a JSON string - Same four keys, same encodings
- The proxy parses, extracts, and forwards to gRPC
Pipeline
AgentRequest arrives
│
├─ server.py: Base64 decode editor_content, selected_text, extra_context
├─ user_info passed as-is (JSON string)
│
└─ initial_state populated with all four fields
│
▼
classify node:
├─ If selected_text present → injected into classify prompt as <editor_selection>
├─ LLM outputs: RETRIEVAL EDITOR or RETRIEVAL NO_EDITOR (etc.)
└─ use_editor_context = True if second token == EDITOR
│
▼
reformulate node:
├─ If use_editor_context=True AND selected_text present:
│ anchor = selected_text + "\n\nUser question: " + query
│ → LLM reformulates using selected code as primary signal
└─ Else: reformulate query as normal
│
▼
retrieve node: (unchanged — uses reformulated_query)
│
▼
generate / generate_code node:
├─ If use_editor_context=True:
│ prompt = <selected_code> + <editor_file> + <extra_context> + RAG_prompt
│ Priority: selected_text > editor_content > RAG context > extra_context
└─ Else: standard RAG prompt — no editor content injected
Intent detection examples
| User message | use_editor_context |
Reason |
|---|---|---|
| "Que significa AVAP?" | False |
General definition question |
| "dame un API de hello world" | False |
Code generation, no editor reference |
| "que hace este codigo?" | True |
Explicit reference to "this code" |
| "fix this" | True |
Explicit reference to current selection |
| "como mejoro esto?" | True |
Explicit reference to current context |
| "how does addVar work?" | False |
Documentation question, no editor reference |
7. Streaming Architecture (AskAgentStream)
The two-phase streaming design is critical to understand:
Why not stream through LangGraph?
LangGraph's stream() method yields full state snapshots per node, not individual tokens. To achieve true per-token streaming to the gRPC client, the generation step is deliberately extracted from the graph and called directly via llm.stream().
Phase 1 — Deterministic preparation (graph-managed):
Classification, query reformulation, and retrieval run through prepare_graph.invoke(). This phase runs synchronously and produces the complete context before any token is emitted to the client. Editor context classification also happens here — use_editor_context is set in the prepared state.
Phase 2 — Token streaming (manual):
build_final_messages() reconstructs the exact prompt, injecting editor context if use_editor_context is True. llm.stream(final_messages) yields one AIMessageChunk per token from Ollama. Each token is immediately forwarded as AgentResponse{text=token, is_final=False}.
Backpressure: gRPC streaming is flow-controlled by the client. If the client stops reading, the Ollama token stream will block at the yield point.
8. Evaluation Pipeline (EvaluateRAG)
The evaluation suite implements an offline RAG evaluation pattern using RAGAS metrics.
Judge model separation
The production LLM (Ollama qwen2.5:1.5b) is used for answer generation — the same pipeline as production to measure real-world quality. Claude (claude-sonnet-4-20250514) is used as the evaluation judge — an independent, high-capability model that scores the generated answers against ground truth.
RAGAS metrics
| Metric | Measures | Input |
|---|---|---|
faithfulness |
Are claims in the answer supported by the retrieved context? | answer + contexts |
answer_relevancy |
Is the answer relevant to the question? | answer + question |
context_recall |
Does the retrieved context cover the ground truth? | contexts + ground_truth |
context_precision |
Are the retrieved chunks useful (signal-to-noise)? | contexts + ground_truth |
Global score & verdict
global_score = mean(non-zero metric scores)
verdict:
≥ 0.80 → EXCELLENT
≥ 0.60 → ACCEPTABLE
< 0.60 → INSUFFICIENT
Golden dataset
Located at Docker/src/golden_dataset.json. Each entry:
{
"id": "avap-001",
"category": "core_syntax",
"question": "How do you declare a variable in AVAP?",
"ground_truth": "Use addVar to declare a variable..."
}
Note: The golden dataset does not include editor-context queries. EvaluateRAG measures the RAG pipeline in isolation. A separate editor-context golden dataset is planned as future work once the VS Code extension is validated.
9. Data Ingestion Pipeline
Documents flow into the Elasticsearch index through two paths:
Path A — AVAP documentation (structured markdown)
docs/LRM/avap.md
docs/avap_language_github_docs/*.md
docs/developer.avapframework.com/*.md
│
▼
scripts/pipelines/flows/elasticsearch_ingestion.py
│
├─ Load markdown files
├─ Chunk using scripts/pipelines/tasks/chunk.py
├─ Generate embeddings via scripts/pipelines/tasks/embeddings.py
└─ Bulk index into Elasticsearch
Path B — AVAP native code chunker
docs/samples/*.avap
│
▼
scripts/pipelines/ingestion/avap_chunker.py
│ (grammar: scripts/pipelines/ingestion/avap_config.json v2.0)
│
├─ Lexer strips comments and string contents
├─ Block detection (function, if, startLoop, try)
├─ Statement classification (30 types + catch-all)
├─ Semantic tag assignment (18 boolean tags)
└─ Output: JSONL chunks → avap_ingestor.py → Elasticsearch
10. Infrastructure Layout
Devaron Cluster (Vultr Kubernetes)
| Service | K8s Name | Port | Purpose |
|---|---|---|---|
| LLM inference | ollama-light-service |
11434 |
Text generation + embeddings |
| Vector database | brunix-vector-db |
9200 |
Elasticsearch 8.x |
| Observability DB | brunix-postgres |
5432 |
PostgreSQL for Langfuse |
| Langfuse UI | — | 80 |
http://45.77.119.180 |
Port map summary
| Port | Protocol | Service | Scope |
|---|---|---|---|
50051 |
gRPC | Brunix Engine (inside container) | Internal |
50052 |
gRPC | Brunix Engine (host-mapped) | External |
8000 |
HTTP | OpenAI proxy | External |
11434 |
HTTP | Ollama (via tunnel) | Tunnel |
9200 |
HTTP | Elasticsearch (via tunnel) | Tunnel |
5432 |
TCP | PostgreSQL/Langfuse (via tunnel) | Tunnel |
11. Session State & Conversation Memory
Conversation history is managed via an in-process dictionary:
session_store: dict[str, list] = defaultdict(list)
# key: session_id (string, provided by client)
# value: list of LangChain BaseMessage objects
Characteristics:
- In-memory only. History is lost on container restart.
- No TTL or eviction. Sessions grow unbounded for the lifetime of the process.
- Thread safety: Python's GIL provides basic safety for the
ThreadPoolExecutor(max_workers=10)gRPC server, but concurrent writes to the samesession_idfrom two simultaneous requests are not explicitly protected. - History window:
format_history_for_classify()uses only the last 6 messages for query classification.
Future work: Replace
session_storewith a Redis-backed persistent store to survive restarts and support horizontal scaling.
12. Observability Stack
Langfuse tracing
Every AskAgent / AskAgentStream request creates a trace capturing input query, session ID, each LangGraph node execution, LLM token counts, latency, and final response.
Access: http://45.77.119.180
Logging
Key log markers:
| Marker | Module | Meaning |
|---|---|---|
[ESEARCH] |
server.py |
Elasticsearch connection status |
[classify] |
graph.py |
Query type + use_editor_context flag + raw LLM output |
[reformulate] |
graph.py |
Reformulated query string + whether selected_text was used as anchor |
[hybrid] |
graph.py |
BM25 / kNN hit counts and RRF result count |
[retrieve] |
graph.py |
Number of docs retrieved and context length |
[generate] |
graph.py |
Response character count |
[AskAgent] |
server.py |
editor and selected flags, query preview |
[AskAgentStream] |
server.py |
Token count and total chars per stream |
[base64] |
server.py |
Warning when a Base64 field fails to decode |
13. Security Boundaries
| Boundary | Current state | Risk |
|---|---|---|
| gRPC transport | Insecure (add_insecure_port) |
Network interception possible. Acceptable in dev/tunnel setup; requires mTLS for production. |
| Elasticsearch auth | Optional (user/pass or API key via env vars) | Index is accessible without auth if vars are unset. |
| Editor context | Transmitted in plaintext (Base64 is encoding, not encryption) | File contents visible to anyone intercepting gRPC traffic. Requires TLS for production. |
| Container user | Non-root (python:3.11-slim default) |
Low risk. Do not override with root. |
| Secrets in env | Via .env / docker-compose env injection |
Never commit real values. |
| Session store | In-memory, no auth | Any caller with gRPC access can read/write any session by guessing its ID. |
user_info |
JSON string, no validation | dev_id, project_id, org_id are not authenticated — passed as metadata only. |
14. Known Limitations & Future Work
| Area | Limitation | Proposed solution |
|---|---|---|
| Session persistence | In-memory, lost on restart | Redis-backed session_store |
| Horizontal scaling | session_store is per-process |
Sticky sessions or external session store |
| gRPC security | Insecure port | Add TLS + optional mTLS |
| Editor context security | Base64 is not encryption | TLS required before sending real file contents |
user_info auth |
Not validated or authenticated | JWT or API key validation on user_info fields |
| Elasticsearch auth | Not enforced if vars unset | Make auth required; fail-fast on startup |
| Context window | Full history passed to generate; no truncation | Sliding window or summarization for long sessions |
| Evaluation | Golden dataset has no editor-context queries | Build dedicated editor-context golden dataset after VS Code validation |
| Rate limiting | None on gRPC server | Add interceptor-based rate limiter |
| Health check | No gRPC health protocol | Implement grpc.health.v1 |