21 KiB
Brunix Assistance Engine — Architecture Reference
Audience: Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment. Last updated: 2026-04-09 Version: 1.8.x Architect: Rafael Ruiz (CTO, 101OBEX Corp) Related ADRs: ADR-0001 · ADR-0002 · ADR-0003 · ADR-0004 · ADR-0005 · ADR-0006 · ADR-0007 · ADR-0008 Related PRDs: PRD-0001 · PRD-0002 · PRD-0003
Table of Contents
- System Classification
- Infrastructure Layout
- Component Inventory
- Request Lifecycle
- Query Classification & Routing
- LangGraph Workflow
- RAG Pipeline — Hybrid Search
- Streaming Architecture
- Editor Context Pipeline
- Platform Query Pipeline
- Session State & Intent History
- Evaluation Pipeline
- Data Ingestion Pipeline
- Observability Stack
- Known Limitations & Future Work
1. System Classification
The Brunix Assistance Engine is a Modular Agentic RAG system.
| Characteristic | Classification | Evidence |
|---|---|---|
| Pipeline structure | Agentic RAG | LangGraph classifier routes each query to a different pipeline — not a fixed retrieve-then-generate chain |
| Retrieval strategy | Advanced RAG | Hybrid BM25+kNN with RRF fusion, query reformulation before retrieval, two-phase streaming |
| Architecture style | Modular RAG | Each node (classify, reformulate, retrieve, generate) is independently replaceable; routing contract is implementation-independent |
| Self-evaluation | Not yet active | CONFIDENCE_PROMPT_TEMPLATE exists but is not wired into the graph — Self-RAG capability is scaffolded |
What it is not: naive RAG (no fixed pipeline), Graph RAG (no knowledge graph), Self-RAG (no confidence feedback loop, yet).
The classifier is the central architectural element. It determines which pipeline executes for each request — including whether retrieval happens at all.
2. Infrastructure Layout
graph TD
subgraph Clients ["External Clients"]
VSC["VS Code Extension"]
AVS["AVS Platform"]
DEV["grpcurl / tests"]
end
subgraph Docker ["Docker — brunix-assistance-engine"]
PROXY["openai_proxy.py\n:8000 HTTP/SSE\nOpenAI + Ollama compatible"]
SERVER["server.py\n:50051 gRPC internal\nAskAgent · AskAgentStream · EvaluateRAG"]
GRAPH["graph.py — LangGraph\nclassify → route → execute"]
end
subgraph Mac ["Developer Machine (macOS)"]
Docker
subgraph Tunnels ["kubectl port-forward tunnels"]
T1["localhost:11434 → Ollama"]
T2["localhost:9200 → Elasticsearch"]
T3["localhost:5432 → Postgres"]
end
end
subgraph Vultr ["Vultr — Devaron K8s Cluster"]
OL["ollama-light-service\nqwen3:1.7b · qwen3:0.6b · bge-m3"]
ES["brunix-vector-db\nElasticsearch :9200\navap-knowledge-v2"]
PG["brunix-postgres\nPostgres :5432\nLangfuse data"]
LF["Langfuse UI\n45.77.119.180:80\nObservability + tracing"]
end
VSC -->|"HTTP/SSE :8000"| PROXY
AVS -->|"HTTP/SSE :8000"| PROXY
DEV -->|"gRPC :50052"| SERVER
PROXY -->|"gRPC internal"| SERVER
SERVER --> GRAPH
GRAPH -->|"host.docker.internal:11434"| T1
GRAPH -->|"host.docker.internal:9200"| T2
SERVER -->|"host.docker.internal:5432"| T3
T1 -->|"secure tunnel"| OL
T2 -->|"secure tunnel"| ES
T3 -->|"secure tunnel"| PG
SERVER -->|"traces HTTP"| LF
DEV -->|"browser direct"| LF
Key networking detail: Docker does not talk to the kubectl tunnels directly. The path is:
Docker container
→ host.docker.internal (resolves to macOS host via extra_hosts)
→ kubectl port-forward (active on macOS)
→ Vultr K8s service
Langfuse is the exception — it has a public IP (45.77.119.180) and is accessed directly, without a tunnel.
3. Component Inventory
| Component | File | Role |
|---|---|---|
| gRPC server | server.py |
Entry point for all AI requests. Manages session store, model selection, and state initialization |
| HTTP proxy | openai_proxy.py |
OpenAI + Ollama compatible HTTP layer. Translates REST → gRPC |
| LangGraph orchestrator | graph.py |
Builds and executes the agentic routing graph |
| Prompt definitions | prompts.py |
All prompt templates in one place: classifier, reformulator, generators, platform |
| Agent state | state.py |
AgentState TypedDict shared across all graph nodes |
| LLM factory | utils/llm_factory.py |
Provider-agnostic model instantiation (Ollama, OpenAI, Anthropic, Bedrock) |
| Embedding factory | utils/emb_factory.py |
Provider-agnostic embedding model instantiation |
| Evaluation pipeline | evaluate.py |
RAGAS evaluation with Claude as judge |
| Proto contract | protos/brunix.proto |
Source of truth for the gRPC API |
Model slots:
| Slot | Env var | Used for | Current model |
|---|---|---|---|
| Main | OLLAMA_MODEL_NAME |
RETRIEVAL, CODE_GENERATION, classification |
qwen3:1.7b |
| Conversational | OLLAMA_MODEL_NAME_CONVERSATIONAL |
CONVERSATIONAL, PLATFORM |
qwen3:0.6b |
| Embeddings | OLLAMA_EMB_MODEL_NAME |
Query embedding, document indexing | bge-m3 |
| Evaluation judge | ANTHROPIC_MODEL |
RAGAS scoring | claude-sonnet-4-20250514 |
4. Request Lifecycle
Two entry paths, both ending at the same LangGraph:
sequenceDiagram
participant C as Client
participant P as openai_proxy.py :8000
participant S as server.py :50051
participant G as graph.py
participant O as Ollama (via tunnel)
participant E as Elasticsearch (via tunnel)
C->>P: POST /v1/chat/completions
P->>P: parse user field (editor_content, selected_text, extra_context, user_info)
P->>S: gRPC AskAgent / AskAgentStream
S->>S: base64 decode context fields
S->>S: load session_store + classify_history_store
S->>G: invoke graph with AgentState
G->>O: classify (LLM call)
O-->>G: query_type + use_editor_context
alt RETRIEVAL or CODE_GENERATION
G->>O: reformulate query
O-->>G: reformulated_query
G->>O: embed reformulated_query
O-->>G: query_vector
G->>E: BM25 search + kNN search
E-->>G: ranked chunks (RRF fusion)
G->>O: generate with context
O-->>G: response
else CONVERSATIONAL
G->>O: respond_conversational (no retrieval)
O-->>G: response
else PLATFORM
G->>O: respond_platform (no retrieval, uses extra_context)
O-->>G: response
end
G-->>S: final_state
S->>S: update session_store + classify_history_store
S-->>P: AgentResponse stream
P-->>C: SSE / JSON response
5. Query Classification & Routing
The classifier is the most critical node. It determines the entire execution path for a request.
Taxonomy
| Type | Intent | RAG | Model slot | Prompt |
|---|---|---|---|---|
RETRIEVAL |
Understand AVAP language concepts | Yes | main | GENERATE_PROMPT |
CODE_GENERATION |
Produce working AVAP code | Yes | main | CODE_GENERATION_PROMPT |
CONVERSATIONAL |
Rephrase or continue prior answer | No | conversational | CONVERSATIONAL_PROMPT |
PLATFORM |
Account, metrics, usage, billing | No | conversational | PLATFORM_PROMPT |
Classification pipeline
flowchart TD
Q[Incoming query] --> FP{Fast-path check\n_is_platform_query}
FP -->|known platform prefix| PLATFORM[PLATFORM\nno LLM call]
FP -->|no match| LLM[LLM Classifier\nqwen3:1.7b]
LLM --> IH[Intent history\nlast 6 entries as context]
IH --> OUT[Output: TYPE + EDITOR/NO_EDITOR]
OUT --> R[RETRIEVAL]
OUT --> C[CODE_GENERATION]
OUT --> V[CONVERSATIONAL]
OUT --> PL[PLATFORM]
Intent history — solving anchoring bias
The classifier does not receive raw conversation messages. It receives a compact trace of prior classifications:
[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"
Why: A 1.7B model receiving full message history computes P(type | history) instead of P(type | message_content) — it biases toward the dominant type of the session. The intent history gives the classifier enough context to resolve ambiguous references ("this", "esto") without the topical noise that causes anchoring.
Rule enforced in prompt: <history_rule> — the distribution of previous intents must not influence the prior probability of the current classification.
Routing contract (from ADR-0008)
| Rule | Description | Priority |
|---|---|---|
| RC-01 | Known platform prefix → PLATFORM without LLM |
Highest |
| RC-02 | Usage metrics / quota data in message → PLATFORM |
High |
| RC-03 | History resolves references only, never predicts type | Medium |
| RC-04 | PLATFORM and CONVERSATIONAL never touch Elasticsearch |
Medium |
| RC-05 | RETRIEVAL/CODE_GENERATION → main model; CONVERSATIONAL/PLATFORM → conversational model |
Medium |
| RC-06 | Intent history capped at 6 entries | Low |
6. LangGraph Workflow
Two graphs are built at startup and reused across all requests:
build_graph — used by AskAgent (non-streaming)
flowchart TD
START([start]) --> CL[classify]
CL -->|RETRIEVAL| RF[reformulate]
CL -->|CODE_GENERATION| RF
CL -->|CONVERSATIONAL| RC[respond_conversational]
CL -->|PLATFORM| RP[respond_platform]
RF --> RT[retrieve]
RT -->|RETRIEVAL| GE[generate]
RT -->|CODE_GENERATION| GC[generate_code]
GE --> END([end])
GC --> END
RC --> END
RP --> END
build_prepare_graph — used by AskAgentStream (streaming)
This graph only runs the preparation phase. The generation is handled outside the graph by llm.stream() to enable true token-by-token streaming.
flowchart TD
START([start]) --> CL[classify]
CL -->|RETRIEVAL| RF[reformulate]
CL -->|CODE_GENERATION| RF
CL -->|CONVERSATIONAL| SK[skip_retrieve]
CL -->|PLATFORM| SK
RF --> RT[retrieve]
RT --> END([end])
SK --> END
After prepare_graph returns, server.py calls build_final_messages(prepared) to construct the prompt and then streams directly from the selected LLM.
7. RAG Pipeline — Hybrid Search
Only RETRIEVAL and CODE_GENERATION queries reach this pipeline.
flowchart TD
Q[reformulated_query] --> EMB[embed_query\nbge-m3]
Q --> BM25[BM25 search\nElasticsearch multi_match\nfields: content^2 text^2\nfuzziness: AUTO]
EMB --> KNN[kNN search\nElasticsearch\nfield: embedding\nk=8 num_candidates=40]
BM25 --> RRF[RRF Fusion\n1 / rank+60 per hit\ncombined score]
KNN --> RRF
RRF --> RANK[Ranked docs\ntop-8]
RANK --> FMT[format_context\nchunk headers: id type block section source\nAVAP CODE flag for code blocks]
FMT --> CTX[context string\ninjected into generation prompt]
Why hybrid search: BM25 is strong for exact AVAP command names (registerEndpoint, addVar) which are rare in embedding space. kNN captures semantic similarity. RRF fusion combines both without requiring score normalization.
Elasticsearch index schema:
| Field | Type | Description |
|---|---|---|
content / text |
text |
Chunk text (BM25 searchable) |
embedding |
dense_vector |
bge-m3 embedding (kNN searchable) |
doc_type |
keyword |
code, spec, code_example, bnf |
block_type |
keyword |
function, if, startLoop, try, etc. |
section |
keyword |
Document section heading |
source_file |
keyword |
Origin file |
chunk_id |
keyword |
Unique chunk identifier |
8. Streaming Architecture
AskAgentStream implements true token-by-token streaming. It does not buffer the full response.
sequenceDiagram
participant C as Client
participant S as server.py
participant PG as prepare_graph
participant LLM as Ollama LLM
C->>S: AskAgentStream(request)
S->>PG: invoke prepare_graph(initial_state)
Note over PG: classify → reformulate → retrieve<br/>(or skip_retrieve for CONV/PLATFORM)
PG-->>S: prepared state (query_type, context, messages)
S->>S: build_final_messages(prepared)
S->>S: select active_llm based on query_type
loop token streaming
S->>LLM: active_llm.stream(final_messages)
LLM-->>S: chunk.content
S-->>C: AgentResponse(text=token, is_final=false)
end
S-->>C: AgentResponse(text="", is_final=true)
S->>S: update session_store + classify_history_store
Model selection in stream path:
active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm
9. Editor Context Pipeline
Implemented in PRD-0002. The VS Code extension can send three optional context fields with each query.
flowchart TD
REQ[AgentRequest] --> B64[base64 decode\neditor_content\nselected_text\nextra_context]
B64 --> STATE[AgentState]
STATE --> CL[classify node\n_build_classify_prompt]
CL --> ED{use_editor_context?}
ED -->|EDITOR| INJ[inject into prompt\n1 selected_text — highest priority\n2 editor_content — file context\n3 RAG chunks\n4 extra_context]
ED -->|NO_EDITOR| NONINJ[standard prompt\nno editor content injected]
INJ --> GEN[generation node]
NONINJ --> GEN
Classifier output format: Two tokens — TYPE EDITOR_SIGNAL
Examples: RETRIEVAL NO_EDITOR, CODE_GENERATION EDITOR, PLATFORM NO_EDITOR
EDITOR is set only when the user explicitly refers to the code in their editor: "this code", "fix this", "que hace esto", "explain this selection".
10. Platform Query Pipeline
Implemented in PRD-0003. Queries about account, metrics, usage, or billing bypass RAG entirely.
flowchart TD
Q[Incoming query] --> FP{_is_platform_query?}
FP -->|yes — known prefix| SKIP[skip classifier LLM\nroute = PLATFORM]
FP -->|no| CL[LLM classifier]
CL -->|PLATFORM| ROUTE[route to respond_platform]
SKIP --> ROUTE
ROUTE --> PROMPT["PLATFORM_PROMPT\n+ extra_context injection\n+ user_info available"]
PROMPT --> LLM[qwen3:0.6b\nconversational model slot]
LLM --> RESP[response]
style SKIP fill:#2d6a4f,color:#fff
style ROUTE fill:#2d6a4f,color:#fff
No Elasticsearch call is made for PLATFORM queries. The data is already in the request via extra_context and user_info injected by the caller (AVS Platform).
11. Session State & Intent History
Two stores are maintained per session in memory:
flowchart LR
subgraph Stores ["In-memory stores (server.py)"]
SS["session_store\ndict[session_id → list[BaseMessage]]\nfull conversation history\nused by generation nodes"]
CHS["classify_history_store\ndict[session_id → list[ClassifyEntry]]\ncompact intent trace\nused by classifier"]
end
subgraph Entry ["ClassifyEntry (state.py)"]
CE["type: str\nRETRIEVAL | CODE_GENERATION\nCONVERSATIONAL | PLATFORM\n─────────────────\ntopic: str\n60-char query snippet"]
end
CHS --> CE
classify_history_store is also a data flywheel. Every session generates labeled (topic, type) pairs automatically. When sufficient sessions accumulate (~500), this store can be exported to train the Layer 2 embedding classifier described in ADR-0008 Future Path — eliminating the need for the LLM classifier on the majority of requests.
AgentState fields
class AgentState(TypedDict):
# Core
messages: Annotated[list, add_messages] # full conversation
session_id: str
query_type: str # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM
reformulated_query: str
context: str # RAG retrieved context
# Classifier intent history
classify_history: list[ClassifyEntry] # compact trace, persisted across turns
# Editor context (PRD-0002)
editor_content: str # base64 decoded
selected_text: str # base64 decoded
extra_context: str # base64 decoded
user_info: str # JSON: {dev_id, project_id, org_id}
use_editor_context: bool # set by classifier
12. Evaluation Pipeline
EvaluateRAG runs an automated quality benchmark using RAGAS with Claude as the judge. It does not share infrastructure with the main request pipeline.
flowchart TD
GD[golden_dataset.json\n50 Q&A pairs with ground_truth] --> FILTER[category + limit filter]
FILTER --> LOOP[for each question]
LOOP --> RET[retrieve_context\nhybrid BM25+kNN\nsame as main pipeline]
RET --> GEN[generate_answer\nusing main LLM]
GEN --> ROW[dataset row\nquestion · answer · contexts · ground_truth]
ROW --> DS[HuggingFace Dataset]
DS --> RAGAS[ragas.evaluate\nfaithfulness\nanswer_relevancy\ncontext_recall\ncontext_precision]
RAGAS --> JUDGE[Claude judge\nclaude-sonnet-4-20250514\nRateLimitedChatAnthropic\n3s delay between calls]
JUDGE --> SCORES[per-metric scores]
SCORES --> GLOBAL[global_score = mean of valid metrics]
GLOBAL --> VERDICT{verdict}
VERDICT -->|≥ 0.80| EX[EXCELLENT]
VERDICT -->|≥ 0.60| AC[ACCEPTABLE]
VERDICT -->|< 0.60| IN[INSUFFICIENT]
Important: EvaluateRAG uses RateLimitedChatAnthropic — a subclass that injects a 3-second delay between calls to respect Anthropic API rate limits. RAGAS is configured with max_workers=1.
13. Data Ingestion Pipeline
Two independent pipelines populate the Elasticsearch index. Both require the Ollama tunnel active on localhost:11434 and Elasticsearch on localhost:9200.
flowchart TD
subgraph PipelineA ["Pipeline A — Chonkie (recommended)"]
DA[docs/**/*.md\ndocs/**/*.avap] --> CA[Chonkie\nMarkdownChef + TokenChunker]
CA --> EA[OllamaEmbeddings\nbatch embed]
EA --> ESA[Elasticsearch bulk index]
end
subgraph PipelineB ["Pipeline B — AVAP Native"]
DB[docs/**/*.avap\ndocs/**/*.md] --> CB[avap_chunker.py\nGenericLexer + LanguageConfig\nblock detection + semantic tags\nMinHash LSH dedup]
CB --> JSONL[chunks.jsonl]
JSONL --> EB[avap_ingestor.py\nasync producer/consumer\nOllamaAsyncEmbedder\nbatch=8]
EB --> ESB[Elasticsearch async_bulk\nbatch=50\nDeadLetterQueue on failure]
end
ESA --> IDX[(avap-knowledge-v2\nElasticsearch index)]
ESB --> IDX
Pipeline B produces richer metadata — block type, section, semantic tags (uses_orm, uses_http, uses_auth, etc.), complexity score, and MinHash deduplication. Use it for .avap files that need full semantic analysis.
14. Observability Stack
flowchart LR
S[server.py] -->|LANGFUSE_HOST\nLANGFUSE_PUBLIC_KEY\nLANGFUSE_SECRET_KEY| LF[Langfuse\n45.77.119.180:80]
LF --> TR[Traces\nper-request spans]
LF --> ME[Metrics\nlatency · token counts]
LF --> EV[Evaluation scores\nfaithfulness · relevancy]
Langfuse is accessed directly via public IP — no kubectl tunnel required. The engine sends traces automatically on every request when the Langfuse environment variables are set.
15. Known Limitations & Future Work
Active tactical debt
| Item | Description | ADR |
|---|---|---|
| LLM classifier | Generative model doing discriminative work — non-deterministic, pays full inference cost for a 4-class label | ADR-0008 |
| RC-02 is soft | Platform data signal enforced via prompt <platform_priority_rule>, not code — can be overridden by model |
ADR-0008 |
classify_history not exported |
Data flywheel accumulates but has no export mechanism yet | ADR-0008 |
user_info unused |
dev_id, project_id, org_id are in state but not consumed by any graph node |
PRD-0002 |
CONFIDENCE_PROMPT_TEMPLATE unused |
Self-RAG capability is scaffolded in prompts.py but not wired into the graph |
— |
Roadmap (ADR-0008 Future Path)
flowchart LR
P0["Now\nLLM classifier\n~95% of traffic"] -->
P1["Phase 1\nExport classify_history_store\nlabeled dataset"] -->
P2["Phase 2\nEmbedding classifier Layer 2\nbge-m3 + logistic regression\n~1ms CPU"] -->
P3["Phase 3\nCaller-declared query_type\nproto field 7"] -->
P4["Phase 4\nLLM classifier = anomaly handler\n<2% of traffic"]
Target steady-state: the LLM classifier handles fewer than 2% of requests — only genuinely ambiguous queries that neither hard rules nor the trained embedding classifier can resolve with confidence.