# Brunix Assistance Engine — Architecture Reference > **Audience:** Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment. > **Last updated:** 2026-04-10 > **Version:** 1.9.x > **Architect:** Rafael Ruiz (CTO, 101OBEX Corp) > **Related ADRs:** ADR-0001 · ADR-0002 · ADR-0003 · ADR-0004 · ADR-0005 · ADR-0006 · ADR-0007 · ADR-0008 > **Related PRDs:** PRD-0001 · PRD-0002 · PRD-0003 --- ## Table of Contents 1. [System Classification](#1-system-classification) 2. [Infrastructure Layout](#2-infrastructure-layout) 3. [Component Inventory](#3-component-inventory) 4. [Request Lifecycle](#4-request-lifecycle) 5. [Query Classification & Routing](#5-query-classification--routing) 6. [LangGraph Workflow](#6-langgraph-workflow) 7. [RAG Pipeline — Hybrid Search](#7-rag-pipeline--hybrid-search) 8. [Streaming Architecture](#8-streaming-architecture) 9. [Editor Context Pipeline](#9-editor-context-pipeline) 10. [Platform Query Pipeline](#10-platform-query-pipeline) 11. [Session State & Intent History](#11-session-state--intent-history) 12. [Evaluation Pipeline](#12-evaluation-pipeline) 13. [Data Ingestion Pipeline](#13-data-ingestion-pipeline) 14. [Observability Stack](#14-observability-stack) 15. [Known Limitations & Future Work](#15-known-limitations--future-work) --- ## 1. System Classification The Brunix Assistance Engine is a **Modular Agentic RAG** system. | Characteristic | Classification | Evidence | |---|---|---| | Pipeline structure | **Agentic RAG** | LangGraph classifier routes each query to a different pipeline — not a fixed retrieve-then-generate chain | | Retrieval strategy | **Advanced RAG** | Hybrid BM25+kNN with RRF fusion, query reformulation before retrieval, two-phase streaming | | Architecture style | **Modular RAG** | Each node (classify, reformulate, retrieve, generate) is independently replaceable; routing contract is implementation-independent | | Self-evaluation | Not yet active | `CONFIDENCE_PROMPT_TEMPLATE` exists but is not wired into the graph — Self-RAG capability is scaffolded | **What it is not:** naive RAG (no fixed pipeline), Graph RAG (no knowledge graph), Self-RAG (no confidence feedback loop, yet). The classifier is the central architectural element. It determines which pipeline executes for each request — including whether retrieval happens at all. --- ## 2. Infrastructure Layout ```mermaid graph TD subgraph Clients ["External Clients"] VSC["VS Code Extension"] AVS["AVS Platform"] DEV["grpcurl / tests"] end subgraph Docker ["Docker — brunix-assistance-engine"] PROXY["openai_proxy.py\n:8000 HTTP/SSE\nOpenAI + Ollama compatible"] SERVER["server.py\n:50051 gRPC internal\nAskAgent · AskAgentStream · EvaluateRAG"] GRAPH["graph.py — LangGraph\nclassify → route → execute"] end subgraph Mac ["Developer Machine (macOS)"] Docker subgraph Tunnels ["kubectl port-forward tunnels"] T1["localhost:11434 → Ollama"] T2["localhost:9200 → Elasticsearch"] T3["localhost:5432 → Postgres"] end end subgraph Vultr ["Vultr — Devaron K8s Cluster"] OL["ollama-light-service\nqwen3:1.7b · qwen3:0.6b · bge-m3"] ES["brunix-vector-db\nElasticsearch :9200\navap-knowledge-v2"] PG["brunix-postgres\nPostgres :5432\nLangfuse data"] LF["Langfuse UI\n45.77.119.180:80\nObservability + tracing"] end VSC -->|"HTTP/SSE :8000"| PROXY AVS -->|"HTTP/SSE :8000"| PROXY DEV -->|"gRPC :50052"| SERVER PROXY -->|"gRPC internal"| SERVER SERVER --> GRAPH GRAPH -->|"host.docker.internal:11434"| T1 GRAPH -->|"host.docker.internal:9200"| T2 SERVER -->|"host.docker.internal:5432"| T3 T1 -->|"secure tunnel"| OL T2 -->|"secure tunnel"| ES T3 -->|"secure tunnel"| PG SERVER -->|"traces HTTP"| LF DEV -->|"browser direct"| LF ``` **Key networking detail:** Docker does not talk to the kubectl tunnels directly. The path is: ``` Docker container → host.docker.internal (resolves to macOS host via extra_hosts) → kubectl port-forward (active on macOS) → Vultr K8s service ``` Langfuse is the exception — it has a public IP (`45.77.119.180`) and is accessed directly, without a tunnel. --- ## 3. Component Inventory | Component | File | Role | |---|---|---| | gRPC server | `server.py` | Entry point for all AI requests. Manages session store, model selection, and state initialization | | HTTP proxy | `openai_proxy.py` | OpenAI + Ollama compatible HTTP layer. Translates REST → gRPC | | LangGraph orchestrator | `graph.py` | Builds and executes the agentic routing graph. Hosts L1, L2, and L3 classifier layers | | Prompt definitions | `prompts.py` | All prompt templates in one place: classifier, reformulator, generators, platform | | Agent state | `state.py` | `AgentState` TypedDict shared across all graph nodes | | LLM factory | `utils/llm_factory.py` | Provider-agnostic model instantiation (Ollama, OpenAI, Anthropic, Bedrock) | | Embedding factory | `utils/emb_factory.py` | Provider-agnostic embedding model instantiation | | Classifier export | `utils/classifier_export.py` | Exports `classify_history_store` to labeled JSONL when threshold is reached. Data flywheel for Layer 2 retraining | | Evaluation pipeline | `evaluate.py` | RAGAS evaluation with Claude as judge | | Proto contract | `protos/brunix.proto` | Source of truth for the gRPC API | | Classifier training | `scripts/pipelines/classifier/train_classifier.py` | Offline script. Embeds labeled queries with bge-m3, trains LogisticRegression, serializes model | **Model slots:** | Slot | Env var | Used for | Current model | |---|---|---|---| | Main | `OLLAMA_MODEL_NAME` | `RETRIEVAL`, `CODE_GENERATION`, classification | `qwen3:1.7b` | | Conversational | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | `CONVERSATIONAL`, `PLATFORM` | `qwen3:0.6b` | | Embeddings | `OLLAMA_EMB_MODEL_NAME` | Query embedding, document indexing | `bge-m3` | | Evaluation judge | `ANTHROPIC_MODEL` | RAGAS scoring | `claude-sonnet-4-20250514` | --- ## 4. Request Lifecycle Two entry paths, both ending at the same LangGraph: ```mermaid sequenceDiagram participant C as Client participant P as openai_proxy.py :8000 participant S as server.py :50051 participant G as graph.py participant O as Ollama (via tunnel) participant E as Elasticsearch (via tunnel) C->>P: POST /v1/chat/completions P->>P: parse user field (editor_content, selected_text, extra_context, user_info) P->>S: gRPC AskAgent / AskAgentStream S->>S: base64 decode context fields S->>S: load session_store + classify_history_store S->>G: invoke graph with AgentState G->>O: classify (LLM call) O-->>G: query_type + use_editor_context alt RETRIEVAL or CODE_GENERATION G->>O: reformulate query O-->>G: reformulated_query G->>O: embed reformulated_query O-->>G: query_vector G->>E: BM25 search + kNN search E-->>G: ranked chunks (RRF fusion) G->>O: generate with context O-->>G: response else CONVERSATIONAL G->>O: respond_conversational (no retrieval) O-->>G: response else PLATFORM G->>O: respond_platform (no retrieval, uses extra_context) O-->>G: response end G-->>S: final_state S->>S: update session_store + classify_history_store S-->>P: AgentResponse stream P-->>C: SSE / JSON response ``` --- ## 5. Query Classification & Routing The classifier is the most critical node. It determines the entire execution path for a request. ### Taxonomy | Type | Intent | RAG | Model slot | Prompt | |---|---|---|---|---| | `RETRIEVAL` | Understand AVAP language concepts | Yes | main | `GENERATE_PROMPT` | | `CODE_GENERATION` | Produce working AVAP code | Yes | main | `CODE_GENERATION_PROMPT` | | `CONVERSATIONAL` | Rephrase or continue prior answer | No | conversational | `CONVERSATIONAL_PROMPT` | | `PLATFORM` | Account, metrics, usage, billing | No | conversational | `PLATFORM_PROMPT` | ### Classification pipeline Three-layer pipeline. Each layer is only invoked if the previous one cannot produce a confident answer. ```mermaid flowchart TD Q([Incoming query]) --> CD{Caller-declared\nquery_type?} CD -->|proto field 7 set| DECL[Use declared type\nbypass all layers] CD -->|empty| L1 L1{"Layer 1\nHard rules\nRC-01 · RC-02\nO(1) deterministic"} L1 -->|match| R1[Classification result] L1 -->|no match| L2 L2{"Layer 2\nEmbedding classifier\nbge-m3 + LogisticRegression\n~1ms · CPU · no LLM"} L2 -->|confidence ≥ 0.85| R1 L2 -->|confidence < 0.85| L3 L3{"Layer 3\nLLM classifier\nqwen3:1.7b\nfallback for ambiguous queries"} L3 --> R1 DECL --> R1 R1 --> RT[RETRIEVAL] R1 --> CG[CODE_GENERATION] R1 --> CV[CONVERSATIONAL] R1 --> PL[PLATFORM] ``` **Caller-declared type (Phase 3):** When the calling system already knows the query type — for example, the AVS Platform generating a `PLATFORM` prompt — it sets `query_type` in the proto request. All three classifier layers are skipped entirely. **Layer 2** is loaded from `CLASSIFIER_MODEL_PATH` (`/data/classifier_model.pkl`) at startup. If the file does not exist, the engine starts normally and uses Layer 3 only. ### Intent history — solving anchoring bias The classifier does not receive raw conversation messages. It receives a compact trace of prior classifications: ``` [RETRIEVAL] "What is addVar in AVAP?" [CODE_GENERATION] "Write an API endpoint that retur" [PLATFORM] "You have a project usage percentag" ``` **Why:** A 1.7B model receiving full message history computes `P(type | history)` instead of `P(type | message_content)` — it biases toward the dominant type of the session. The intent history gives the classifier enough context to resolve ambiguous references (`"this"`, `"esto"`) without the topical noise that causes anchoring. **Rule enforced in prompt:** `` — the distribution of previous intents must not influence the prior probability of the current classification. ### Routing contract (from ADR-0008) | Rule | Description | Priority | |---|---|---| | RC-01 | Known platform prefix → `PLATFORM` without LLM | Highest | | RC-02 | Usage metrics / quota data in message → `PLATFORM` | High | | RC-03 | History resolves references only, never predicts type | Medium | | RC-04 | `PLATFORM` and `CONVERSATIONAL` never touch Elasticsearch | Medium | | RC-05 | `RETRIEVAL`/`CODE_GENERATION` → main model; `CONVERSATIONAL`/`PLATFORM` → conversational model | Medium | | RC-06 | Intent history capped at 6 entries | Low | --- ## 6. LangGraph Workflow Two graphs are built at startup and reused across all requests: ### `build_graph` — used by `AskAgent` (non-streaming) ```mermaid flowchart TD START([start]) --> CL[classify] CL -->|RETRIEVAL| RF[reformulate] CL -->|CODE_GENERATION| RF CL -->|CONVERSATIONAL| RC[respond_conversational] CL -->|PLATFORM| RP[respond_platform] RF --> RT[retrieve] RT -->|RETRIEVAL| GE[generate] RT -->|CODE_GENERATION| GC[generate_code] GE --> END([end]) GC --> END RC --> END RP --> END ``` ### `build_prepare_graph` — used by `AskAgentStream` (streaming) This graph only runs the preparation phase. The generation is handled outside the graph by `llm.stream()` to enable true token-by-token streaming. ```mermaid flowchart TD START([start]) --> CL[classify] CL -->|RETRIEVAL| RF[reformulate] CL -->|CODE_GENERATION| RF CL -->|CONVERSATIONAL| SK[skip_retrieve] CL -->|PLATFORM| SK RF --> RT[retrieve] RT --> END([end]) SK --> END ``` After `prepare_graph` returns, `server.py` calls `build_final_messages(prepared)` to construct the prompt and then streams directly from the selected LLM. --- ## 7. RAG Pipeline — Hybrid Search Only `RETRIEVAL` and `CODE_GENERATION` queries reach this pipeline. ```mermaid flowchart TD Q[reformulated_query] --> EMB[embed_query\nbge-m3] Q --> BM25[BM25 search\nElasticsearch multi_match\nfields: content^2 text^2\nfuzziness: AUTO] EMB --> KNN[kNN search\nElasticsearch\nfield: embedding\nk=8 num_candidates=40] BM25 --> RRF[RRF Fusion\n1 / rank+60 per hit\ncombined score] KNN --> RRF RRF --> RANK[Ranked docs\ntop-8] RANK --> FMT[format_context\nchunk headers: id type block section source\nAVAP CODE flag for code blocks] FMT --> CTX[context string\ninjected into generation prompt] ``` **Why hybrid search:** BM25 is strong for exact AVAP command names (`registerEndpoint`, `addVar`) which are rare in embedding space. kNN captures semantic similarity. RRF fusion combines both without requiring score normalization. **Elasticsearch index schema:** | Field | Type | Description | |---|---|---| | `content` / `text` | `text` | Chunk text (BM25 searchable) | | `embedding` | `dense_vector` | bge-m3 embedding (kNN searchable) | | `doc_type` | `keyword` | `code`, `spec`, `code_example`, `bnf` | | `block_type` | `keyword` | `function`, `if`, `startLoop`, `try`, etc. | | `section` | `keyword` | Document section heading | | `source_file` | `keyword` | Origin file | | `chunk_id` | `keyword` | Unique chunk identifier | --- ## 8. Streaming Architecture `AskAgentStream` implements true token-by-token streaming. It does not buffer the full response. ```mermaid sequenceDiagram participant C as Client participant S as server.py participant PG as prepare_graph participant LLM as Ollama LLM C->>S: AskAgentStream(request) S->>PG: invoke prepare_graph(initial_state) Note over PG: classify → reformulate → retrieve
(or skip_retrieve for CONV/PLATFORM) PG-->>S: prepared state (query_type, context, messages) S->>S: build_final_messages(prepared) S->>S: select active_llm based on query_type loop token streaming S->>LLM: active_llm.stream(final_messages) LLM-->>S: chunk.content S-->>C: AgentResponse(text=token, is_final=false) end S-->>C: AgentResponse(text="", is_final=true) S->>S: update session_store + classify_history_store ``` **Model selection in stream path:** ```python active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm ``` --- ## 9. Editor Context Pipeline Implemented in PRD-0002. The VS Code extension can send three optional context fields with each query. ```mermaid flowchart TD REQ[AgentRequest] --> B64[base64 decode\neditor_content\nselected_text\nextra_context] B64 --> STATE[AgentState] STATE --> CL[classify node\n_build_classify_prompt] CL --> ED{use_editor_context?} ED -->|EDITOR| INJ[inject into prompt\n1 selected_text — highest priority\n2 editor_content — file context\n3 RAG chunks\n4 extra_context] ED -->|NO_EDITOR| NONINJ[standard prompt\nno editor content injected] INJ --> GEN[generation node] NONINJ --> GEN ``` **Classifier output format:** Two tokens — `TYPE EDITOR_SIGNAL` Examples: `RETRIEVAL NO_EDITOR`, `CODE_GENERATION EDITOR`, `PLATFORM NO_EDITOR` `EDITOR` is set only when the user explicitly refers to the code in their editor: *"this code"*, *"fix this"*, *"que hace esto"*, *"explain this selection"*. --- ## 10. Platform Query Pipeline Implemented in PRD-0003. Queries about account, metrics, usage, or billing bypass RAG entirely. ```mermaid flowchart TD Q[Incoming query] --> CD{query_type\ndeclared in proto?} CD -->|yes| DECL[bypass all layers\nroute = declared type] CD -->|no| FP FP{Layer 1\n_is_platform_query?\nRC-01 · RC-02} FP -->|yes| SKIP[skip L2 + L3\nroute = PLATFORM] FP -->|no| L2[Layer 2 + Layer 3\nnormal classification] L2 -->|PLATFORM| ROUTE DECL -->|PLATFORM| ROUTE SKIP --> ROUTE[route to respond_platform] ROUTE --> PROMPT["PLATFORM_PROMPT\n+ extra_context injection\n+ user_info available"] PROMPT --> LLM[qwen3:0.6b\nconversational model slot] LLM --> RESP[response] style DECL fill:#2d6a4f,color:#fff style SKIP fill:#2d6a4f,color:#fff style ROUTE fill:#2d6a4f,color:#fff ``` **No Elasticsearch call is made for PLATFORM queries.** The data is already in the request via `extra_context` and `user_info` injected by the caller (AVS Platform). --- ## 11. Session State & Intent History Two stores are maintained per session in memory: ```mermaid flowchart LR subgraph Stores ["In-memory stores (server.py)"] SS["session_store\ndict[session_id → list[BaseMessage]]\nfull conversation history\nused by generation nodes"] CHS["classify_history_store\ndict[session_id → list[ClassifyEntry]]\ncompact intent trace\nused by classifier"] end subgraph Entry ["ClassifyEntry (state.py)"] CE["type: str\nRETRIEVAL | CODE_GENERATION\nCONVERSATIONAL | PLATFORM\n─────────────────\ntopic: str\n60-char query snippet"] end CHS --> CE ``` **`classify_history_store` is the data flywheel for Layer 2 retraining.** Every session generates labeled `(topic, type)` pairs automatically. `utils/classifier_export.py` exports them to JSONL when `CLASSIFIER_EXPORT_THRESHOLD` sessions accumulate (default: 500), and flushes on shutdown. These files feed directly into `scripts/pipelines/classifier/train_classifier.py` for retraining Layer 2 with production data — no manual labeling required. ### AgentState fields ```python class AgentState(TypedDict): # Core messages: Annotated[list, add_messages] # full conversation session_id: str query_type: str # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM reformulated_query: str context: str # RAG retrieved context # Classifier intent history classify_history: list[ClassifyEntry] # compact trace, persisted across turns # Editor context (PRD-0002) editor_content: str # base64 decoded selected_text: str # base64 decoded extra_context: str # base64 decoded user_info: str # JSON: {dev_id, project_id, org_id} use_editor_context: bool # set by classifier ``` --- ## 12. Evaluation Pipeline `EvaluateRAG` runs an automated quality benchmark using RAGAS with Claude as the judge. It does not share infrastructure with the main request pipeline. ```mermaid flowchart TD GD[golden_dataset.json\n50 Q&A pairs with ground_truth] --> FILTER[category + limit filter] FILTER --> LOOP[for each question] LOOP --> RET[retrieve_context\nhybrid BM25+kNN\nsame as main pipeline] RET --> GEN[generate_answer\nusing main LLM] GEN --> ROW[dataset row\nquestion · answer · contexts · ground_truth] ROW --> DS[HuggingFace Dataset] DS --> RAGAS[ragas.evaluate\nfaithfulness\nanswer_relevancy\ncontext_recall\ncontext_precision] RAGAS --> JUDGE[Claude judge\nclaude-sonnet-4-20250514\nRateLimitedChatAnthropic\n3s delay between calls] JUDGE --> SCORES[per-metric scores] SCORES --> GLOBAL[global_score = mean of valid metrics] GLOBAL --> VERDICT{verdict} VERDICT -->|≥ 0.80| EX[EXCELLENT] VERDICT -->|≥ 0.60| AC[ACCEPTABLE] VERDICT -->|< 0.60| IN[INSUFFICIENT] ``` **Important:** `EvaluateRAG` uses `RateLimitedChatAnthropic` — a subclass that injects a 3-second delay between calls to respect Anthropic API rate limits. RAGAS is configured with `max_workers=1`. --- ## 13. Data Ingestion Pipeline Two independent pipelines populate the Elasticsearch index. Both require the Ollama tunnel active on `localhost:11434` and Elasticsearch on `localhost:9200`. ```mermaid flowchart TD subgraph PipelineA ["Pipeline A — Chonkie (recommended)"] DA[docs/**/*.md\ndocs/**/*.avap] --> CA[Chonkie\nMarkdownChef + TokenChunker] CA --> EA[OllamaEmbeddings\nbatch embed] EA --> ESA[Elasticsearch bulk index] end subgraph PipelineB ["Pipeline B — AVAP Native"] DB[docs/**/*.avap\ndocs/**/*.md] --> CB[avap_chunker.py\nGenericLexer + LanguageConfig\nblock detection + semantic tags\nMinHash LSH dedup] CB --> JSONL[chunks.jsonl] JSONL --> EB[avap_ingestor.py\nasync producer/consumer\nOllamaAsyncEmbedder\nbatch=8] EB --> ESB[Elasticsearch async_bulk\nbatch=50\nDeadLetterQueue on failure] end ESA --> IDX[(avap-knowledge-v2\nElasticsearch index)] ESB --> IDX ``` **Pipeline B produces richer metadata** — block type, section, semantic tags (`uses_orm`, `uses_http`, `uses_auth`, etc.), complexity score, and MinHash deduplication. Use it for `.avap` files that need full semantic analysis. --- ## 14. Observability Stack ```mermaid flowchart LR S[server.py] -->|LANGFUSE_HOST\nLANGFUSE_PUBLIC_KEY\nLANGFUSE_SECRET_KEY| LF[Langfuse\n45.77.119.180:80] LF --> TR[Traces\nper-request spans] LF --> ME[Metrics\nlatency · token counts] LF --> EV[Evaluation scores\nfaithfulness · relevancy] ``` Langfuse is accessed directly via public IP — no kubectl tunnel required. The engine sends traces automatically on every request when the Langfuse environment variables are set. --- ## 15. Known Limitations & Future Work ### Active tactical debt | Item | Description | ADR | |---|---|---| | LLM classifier (Layer 3) | Still the dominant path until Layer 2 is retrained with production data and confidence improves | ADR-0008 | | `user_info` unused | `dev_id`, `project_id`, `org_id` are in state but not consumed by any graph node | PRD-0002 | | `CONFIDENCE_PROMPT_TEMPLATE` unused | Self-RAG capability is scaffolded in `prompts.py` but not wired into the graph | — | | Layer 2 seed dataset only | Current model trained on 94 hand-crafted examples. Must be retrained with production exports to reduce L3 fallback rate | ADR-0008 | ### ADR-0008 implementation status ```mermaid flowchart LR P0["Bootstrap\nLLM classifier\n✅ Implemented"] --> P1["Phase 1\nData flywheel\nclassifier_export.py\n✅ Complete"] --> P2["Phase 2\nLayer 2 — bge-m3\n+ LogisticRegression\n✅ Complete"] --> P3["Phase 3\nCaller-declared type\nproto field 7\n✅ Complete"] --> P4["Phase 4\nLLM = anomaly handler\n<2% traffic\n⏳ Production outcome"] ``` Phase 4 is not a code deliverable — it is a production state reached when the platform client sends `query_type` for generated prompts and Layer 2 confidence improves with production data. Monitor with: ```bash docker logs brunix-assistance-engine 2>&1 | grep "classifier/L" | grep -c "L3" ```