assistance-engine/docs/ARCHITECTURE.md

21 KiB

Brunix Assistance Engine — Architecture Reference

Audience: Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment. Last updated: 2026-04-09 Version: 1.8.x


Table of Contents

  1. System Classification
  2. Infrastructure Layout
  3. Component Inventory
  4. Request Lifecycle
  5. Query Classification & Routing
  6. LangGraph Workflow
  7. RAG Pipeline — Hybrid Search
  8. Streaming Architecture
  9. Editor Context Pipeline
  10. Platform Query Pipeline
  11. Session State & Intent History
  12. Evaluation Pipeline
  13. Data Ingestion Pipeline
  14. Observability Stack
  15. Known Limitations & Future Work

1. System Classification

The Brunix Assistance Engine is a Modular Agentic RAG system.

Characteristic Classification Evidence
Pipeline structure Agentic RAG LangGraph classifier routes each query to a different pipeline — not a fixed retrieve-then-generate chain
Retrieval strategy Advanced RAG Hybrid BM25+kNN with RRF fusion, query reformulation before retrieval, two-phase streaming
Architecture style Modular RAG Each node (classify, reformulate, retrieve, generate) is independently replaceable; routing contract is implementation-independent
Self-evaluation Not yet active CONFIDENCE_PROMPT_TEMPLATE exists but is not wired into the graph — Self-RAG capability is scaffolded

What it is not: naive RAG (no fixed pipeline), Graph RAG (no knowledge graph), Self-RAG (no confidence feedback loop, yet).

The classifier is the central architectural element. It determines which pipeline executes for each request — including whether retrieval happens at all.


2. Infrastructure Layout

graph TD
    subgraph Clients ["External Clients"]
        VSC["VS Code Extension"]
        AVS["AVS Platform"]
        DEV["grpcurl / tests"]
    end

    subgraph Docker ["Docker — brunix-assistance-engine"]
        PROXY["openai_proxy.py\n:8000 HTTP/SSE\nOpenAI + Ollama compatible"]
        SERVER["server.py\n:50051 gRPC internal\nAskAgent · AskAgentStream · EvaluateRAG"]
        GRAPH["graph.py — LangGraph\nclassify → route → execute"]
    end

    subgraph Mac ["Developer Machine (macOS)"]
        Docker
        subgraph Tunnels ["kubectl port-forward tunnels"]
            T1["localhost:11434 → Ollama"]
            T2["localhost:9200  → Elasticsearch"]
            T3["localhost:5432  → Postgres"]
        end
    end

    subgraph Vultr ["Vultr — Devaron K8s Cluster"]
        OL["ollama-light-service\nqwen3:1.7b · qwen3:0.6b · bge-m3"]
        ES["brunix-vector-db\nElasticsearch :9200\navap-knowledge-v2"]
        PG["brunix-postgres\nPostgres :5432\nLangfuse data"]
        LF["Langfuse UI\n45.77.119.180:80\nObservability + tracing"]
    end

    VSC -->|"HTTP/SSE :8000"| PROXY
    AVS -->|"HTTP/SSE :8000"| PROXY
    DEV -->|"gRPC :50052"| SERVER

    PROXY -->|"gRPC internal"| SERVER
    SERVER --> GRAPH

    GRAPH -->|"host.docker.internal:11434"| T1
    GRAPH -->|"host.docker.internal:9200"| T2
    SERVER -->|"host.docker.internal:5432"| T3

    T1 -->|"secure tunnel"| OL
    T2 -->|"secure tunnel"| ES
    T3 -->|"secure tunnel"| PG

    SERVER -->|"traces HTTP"| LF
    DEV -->|"browser direct"| LF

Key networking detail: Docker does not talk to the kubectl tunnels directly. The path is:

Docker container
  → host.docker.internal (resolves to macOS host via extra_hosts)
    → kubectl port-forward (active on macOS)
      → Vultr K8s service

Langfuse is the exception — it has a public IP (45.77.119.180) and is accessed directly, without a tunnel.


3. Component Inventory

Component File Role
gRPC server server.py Entry point for all AI requests. Manages session store, model selection, and state initialization
HTTP proxy openai_proxy.py OpenAI + Ollama compatible HTTP layer. Translates REST → gRPC
LangGraph orchestrator graph.py Builds and executes the agentic routing graph
Prompt definitions prompts.py All prompt templates in one place: classifier, reformulator, generators, platform
Agent state state.py AgentState TypedDict shared across all graph nodes
LLM factory utils/llm_factory.py Provider-agnostic model instantiation (Ollama, OpenAI, Anthropic, Bedrock)
Embedding factory utils/emb_factory.py Provider-agnostic embedding model instantiation
Evaluation pipeline evaluate.py RAGAS evaluation with Claude as judge
Proto contract protos/brunix.proto Source of truth for the gRPC API

Model slots:

Slot Env var Used for Current model
Main OLLAMA_MODEL_NAME RETRIEVAL, CODE_GENERATION, classification qwen3:1.7b
Conversational OLLAMA_MODEL_NAME_CONVERSATIONAL CONVERSATIONAL, PLATFORM qwen3:0.6b
Embeddings OLLAMA_EMB_MODEL_NAME Query embedding, document indexing bge-m3
Evaluation judge ANTHROPIC_MODEL RAGAS scoring claude-sonnet-4-20250514

4. Request Lifecycle

Two entry paths, both ending at the same LangGraph:

sequenceDiagram
    participant C as Client
    participant P as openai_proxy.py :8000
    participant S as server.py :50051
    participant G as graph.py
    participant O as Ollama (via tunnel)
    participant E as Elasticsearch (via tunnel)

    C->>P: POST /v1/chat/completions
    P->>P: parse user field (editor_content, selected_text, extra_context, user_info)
    P->>S: gRPC AskAgent / AskAgentStream

    S->>S: base64 decode context fields
    S->>S: load session_store + classify_history_store
    S->>G: invoke graph with AgentState

    G->>O: classify (LLM call)
    O-->>G: query_type + use_editor_context

    alt RETRIEVAL or CODE_GENERATION
        G->>O: reformulate query
        O-->>G: reformulated_query
        G->>O: embed reformulated_query
        O-->>G: query_vector
        G->>E: BM25 search + kNN search
        E-->>G: ranked chunks (RRF fusion)
        G->>O: generate with context
        O-->>G: response
    else CONVERSATIONAL
        G->>O: respond_conversational (no retrieval)
        O-->>G: response
    else PLATFORM
        G->>O: respond_platform (no retrieval, uses extra_context)
        O-->>G: response
    end

    G-->>S: final_state
    S->>S: update session_store + classify_history_store
    S-->>P: AgentResponse stream
    P-->>C: SSE / JSON response

5. Query Classification & Routing

The classifier is the most critical node. It determines the entire execution path for a request.

Taxonomy

Type Intent RAG Model slot Prompt
RETRIEVAL Understand AVAP language concepts Yes main GENERATE_PROMPT
CODE_GENERATION Produce working AVAP code Yes main CODE_GENERATION_PROMPT
CONVERSATIONAL Rephrase or continue prior answer No conversational CONVERSATIONAL_PROMPT
PLATFORM Account, metrics, usage, billing No conversational PLATFORM_PROMPT

Classification pipeline

flowchart TD
    Q[Incoming query] --> FP{Fast-path check\n_is_platform_query}
    FP -->|known platform prefix| PLATFORM[PLATFORM\nno LLM call]
    FP -->|no match| LLM[LLM Classifier\nqwen3:1.7b]

    LLM --> IH[Intent history\nlast 6 entries as context]
    IH --> OUT[Output: TYPE + EDITOR/NO_EDITOR]

    OUT --> R[RETRIEVAL]
    OUT --> C[CODE_GENERATION]
    OUT --> V[CONVERSATIONAL]
    OUT --> PL[PLATFORM]

Intent history — solving anchoring bias

The classifier does not receive raw conversation messages. It receives a compact trace of prior classifications:

[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"

Why: A 1.7B model receiving full message history computes P(type | history) instead of P(type | message_content) — it biases toward the dominant type of the session. The intent history gives the classifier enough context to resolve ambiguous references ("this", "esto") without the topical noise that causes anchoring.

Rule enforced in prompt: <history_rule> — the distribution of previous intents must not influence the prior probability of the current classification.

Routing contract (from ADR-0008)

Rule Description Priority
RC-01 Known platform prefix → PLATFORM without LLM Highest
RC-02 Usage metrics / quota data in message → PLATFORM High
RC-03 History resolves references only, never predicts type Medium
RC-04 PLATFORM and CONVERSATIONAL never touch Elasticsearch Medium
RC-05 RETRIEVAL/CODE_GENERATION → main model; CONVERSATIONAL/PLATFORM → conversational model Medium
RC-06 Intent history capped at 6 entries Low

6. LangGraph Workflow

Two graphs are built at startup and reused across all requests:

build_graph — used by AskAgent (non-streaming)

flowchart TD
    START([start]) --> CL[classify]
    CL -->|RETRIEVAL| RF[reformulate]
    CL -->|CODE_GENERATION| RF
    CL -->|CONVERSATIONAL| RC[respond_conversational]
    CL -->|PLATFORM| RP[respond_platform]
    RF --> RT[retrieve]
    RT -->|RETRIEVAL| GE[generate]
    RT -->|CODE_GENERATION| GC[generate_code]
    GE --> END([end])
    GC --> END
    RC --> END
    RP --> END

build_prepare_graph — used by AskAgentStream (streaming)

This graph only runs the preparation phase. The generation is handled outside the graph by llm.stream() to enable true token-by-token streaming.

flowchart TD
    START([start]) --> CL[classify]
    CL -->|RETRIEVAL| RF[reformulate]
    CL -->|CODE_GENERATION| RF
    CL -->|CONVERSATIONAL| SK[skip_retrieve]
    CL -->|PLATFORM| SK
    RF --> RT[retrieve]
    RT --> END([end])
    SK --> END

After prepare_graph returns, server.py calls build_final_messages(prepared) to construct the prompt and then streams directly from the selected LLM.


Only RETRIEVAL and CODE_GENERATION queries reach this pipeline.

flowchart TD
    Q[reformulated_query] --> EMB[embed_query\nbge-m3]
    Q --> BM25[BM25 search\nElasticsearch multi_match\nfields: content^2 text^2\nfuzziness: AUTO]
    EMB --> KNN[kNN search\nElasticsearch\nfield: embedding\nk=8 num_candidates=40]

    BM25 --> RRF[RRF Fusion\n1 / rank+60 per hit\ncombined score]
    KNN --> RRF

    RRF --> RANK[Ranked docs\ntop-8]
    RANK --> FMT[format_context\nchunk headers: id type block section source\nAVAP CODE flag for code blocks]
    FMT --> CTX[context string\ninjected into generation prompt]

Why hybrid search: BM25 is strong for exact AVAP command names (registerEndpoint, addVar) which are rare in embedding space. kNN captures semantic similarity. RRF fusion combines both without requiring score normalization.

Elasticsearch index schema:

Field Type Description
content / text text Chunk text (BM25 searchable)
embedding dense_vector bge-m3 embedding (kNN searchable)
doc_type keyword code, spec, code_example, bnf
block_type keyword function, if, startLoop, try, etc.
section keyword Document section heading
source_file keyword Origin file
chunk_id keyword Unique chunk identifier

8. Streaming Architecture

AskAgentStream implements true token-by-token streaming. It does not buffer the full response.

sequenceDiagram
    participant C as Client
    participant S as server.py
    participant PG as prepare_graph
    participant LLM as Ollama LLM

    C->>S: AskAgentStream(request)
    S->>PG: invoke prepare_graph(initial_state)
    Note over PG: classify → reformulate → retrieve<br/>(or skip_retrieve for CONV/PLATFORM)
    PG-->>S: prepared state (query_type, context, messages)

    S->>S: build_final_messages(prepared)
    S->>S: select active_llm based on query_type

    loop token streaming
        S->>LLM: active_llm.stream(final_messages)
        LLM-->>S: chunk.content
        S-->>C: AgentResponse(text=token, is_final=false)
    end

    S-->>C: AgentResponse(text="", is_final=true)
    S->>S: update session_store + classify_history_store

Model selection in stream path:

active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm

9. Editor Context Pipeline

Implemented in PRD-0002. The VS Code extension can send three optional context fields with each query.

flowchart TD
    REQ[AgentRequest] --> B64[base64 decode\neditor_content\nselected_text\nextra_context]
    B64 --> STATE[AgentState]

    STATE --> CL[classify node\n_build_classify_prompt]
    CL --> ED{use_editor_context?}

    ED -->|EDITOR| INJ[inject into prompt\n1 selected_text — highest priority\n2 editor_content — file context\n3 RAG chunks\n4 extra_context]
    ED -->|NO_EDITOR| NONINJ[standard prompt\nno editor content injected]

    INJ --> GEN[generation node]
    NONINJ --> GEN

Classifier output format: Two tokens — TYPE EDITOR_SIGNAL

Examples: RETRIEVAL NO_EDITOR, CODE_GENERATION EDITOR, PLATFORM NO_EDITOR

EDITOR is set only when the user explicitly refers to the code in their editor: "this code", "fix this", "que hace esto", "explain this selection".


10. Platform Query Pipeline

Implemented in PRD-0003. Queries about account, metrics, usage, or billing bypass RAG entirely.

flowchart TD
    Q[Incoming query] --> FP{_is_platform_query?}
    FP -->|yes — known prefix| SKIP[skip classifier LLM\nroute = PLATFORM]
    FP -->|no| CL[LLM classifier]
    CL -->|PLATFORM| ROUTE[route to respond_platform]
    SKIP --> ROUTE

    ROUTE --> PROMPT["PLATFORM_PROMPT\n+ extra_context injection\n+ user_info available"]
    PROMPT --> LLM[qwen3:0.6b\nconversational model slot]
    LLM --> RESP[response]

    style SKIP fill:#2d6a4f,color:#fff
    style ROUTE fill:#2d6a4f,color:#fff

No Elasticsearch call is made for PLATFORM queries. The data is already in the request via extra_context and user_info injected by the caller (AVS Platform).


11. Session State & Intent History

Two stores are maintained per session in memory:

flowchart LR
    subgraph Stores ["In-memory stores (server.py)"]
        SS["session_store\ndict[session_id → list[BaseMessage]]\nfull conversation history\nused by generation nodes"]
        CHS["classify_history_store\ndict[session_id → list[ClassifyEntry]]\ncompact intent trace\nused by classifier"]
    end

    subgraph Entry ["ClassifyEntry (state.py)"]
        CE["type: str\nRETRIEVAL | CODE_GENERATION\nCONVERSATIONAL | PLATFORM\n─────────────────\ntopic: str\n60-char query snippet"]
    end

    CHS --> CE

classify_history_store is also a data flywheel. Every session generates labeled (topic, type) pairs automatically. When sufficient sessions accumulate (~500), this store can be exported to train the Layer 2 embedding classifier described in ADR-0008 Future Path — eliminating the need for the LLM classifier on the majority of requests.

AgentState fields

class AgentState(TypedDict):
    # Core
    messages:           Annotated[list, add_messages]  # full conversation
    session_id:         str
    query_type:         str                             # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM
    reformulated_query: str
    context:            str                             # RAG retrieved context

    # Classifier intent history
    classify_history:   list[ClassifyEntry]             # compact trace, persisted across turns

    # Editor context (PRD-0002)
    editor_content:     str                             # base64 decoded
    selected_text:      str                             # base64 decoded
    extra_context:      str                             # base64 decoded
    user_info:          str                             # JSON: {dev_id, project_id, org_id}
    use_editor_context: bool                            # set by classifier

12. Evaluation Pipeline

EvaluateRAG runs an automated quality benchmark using RAGAS with Claude as the judge. It does not share infrastructure with the main request pipeline.

flowchart TD
    GD[golden_dataset.json\n50 Q&A pairs with ground_truth] --> FILTER[category + limit filter]
    FILTER --> LOOP[for each question]

    LOOP --> RET[retrieve_context\nhybrid BM25+kNN\nsame as main pipeline]
    RET --> GEN[generate_answer\nusing main LLM]
    GEN --> ROW[dataset row\nquestion · answer · contexts · ground_truth]

    ROW --> DS[HuggingFace Dataset]
    DS --> RAGAS[ragas.evaluate\nfaithfulness\nanswer_relevancy\ncontext_recall\ncontext_precision]

    RAGAS --> JUDGE[Claude judge\nclaude-sonnet-4-20250514\nRateLimitedChatAnthropic\n3s delay between calls]
    JUDGE --> SCORES[per-metric scores]
    SCORES --> GLOBAL[global_score = mean of valid metrics]
    GLOBAL --> VERDICT{verdict}
    VERDICT -->|≥ 0.80| EX[EXCELLENT]
    VERDICT -->|≥ 0.60| AC[ACCEPTABLE]
    VERDICT -->|< 0.60| IN[INSUFFICIENT]

Important: EvaluateRAG uses RateLimitedChatAnthropic — a subclass that injects a 3-second delay between calls to respect Anthropic API rate limits. RAGAS is configured with max_workers=1.


13. Data Ingestion Pipeline

Two independent pipelines populate the Elasticsearch index. Both require the Ollama tunnel active on localhost:11434 and Elasticsearch on localhost:9200.

flowchart TD
    subgraph PipelineA ["Pipeline A — Chonkie (recommended)"]
        DA[docs/**/*.md\ndocs/**/*.avap] --> CA[Chonkie\nMarkdownChef + TokenChunker]
        CA --> EA[OllamaEmbeddings\nbatch embed]
        EA --> ESA[Elasticsearch bulk index]
    end

    subgraph PipelineB ["Pipeline B — AVAP Native"]
        DB[docs/**/*.avap\ndocs/**/*.md] --> CB[avap_chunker.py\nGenericLexer + LanguageConfig\nblock detection + semantic tags\nMinHash LSH dedup]
        CB --> JSONL[chunks.jsonl]
        JSONL --> EB[avap_ingestor.py\nasync producer/consumer\nOllamaAsyncEmbedder\nbatch=8]
        EB --> ESB[Elasticsearch async_bulk\nbatch=50\nDeadLetterQueue on failure]
    end

    ESA --> IDX[(avap-knowledge-v2\nElasticsearch index)]
    ESB --> IDX

Pipeline B produces richer metadata — block type, section, semantic tags (uses_orm, uses_http, uses_auth, etc.), complexity score, and MinHash deduplication. Use it for .avap files that need full semantic analysis.


14. Observability Stack

flowchart LR
    S[server.py] -->|LANGFUSE_HOST\nLANGFUSE_PUBLIC_KEY\nLANGFUSE_SECRET_KEY| LF[Langfuse\n45.77.119.180:80]
    LF --> TR[Traces\nper-request spans]
    LF --> ME[Metrics\nlatency · token counts]
    LF --> EV[Evaluation scores\nfaithfulness · relevancy]

Langfuse is accessed directly via public IP — no kubectl tunnel required. The engine sends traces automatically on every request when the Langfuse environment variables are set.


15. Known Limitations & Future Work

Active tactical debt

Item Description ADR
LLM classifier Generative model doing discriminative work — non-deterministic, pays full inference cost for a 4-class label ADR-0008
RC-02 is soft Platform data signal enforced via prompt <platform_priority_rule>, not code — can be overridden by model ADR-0008
classify_history not exported Data flywheel accumulates but has no export mechanism yet ADR-0008
user_info unused dev_id, project_id, org_id are in state but not consumed by any graph node PRD-0002
CONFIDENCE_PROMPT_TEMPLATE unused Self-RAG capability is scaffolded in prompts.py but not wired into the graph

Roadmap (ADR-0008 Future Path)

flowchart LR
    P0["Now\nLLM classifier\n~95% of traffic"] -->
    P1["Phase 1\nExport classify_history_store\nlabeled dataset"] -->
    P2["Phase 2\nEmbedding classifier Layer 2\nbge-m3 + logistic regression\n~1ms CPU"] -->
    P3["Phase 3\nCaller-declared query_type\nproto field 7"] -->
    P4["Phase 4\nLLM classifier = anomaly handler\n<2% of traffic"]

Target steady-state: the LLM classifier handles fewer than 2% of requests — only genuinely ambiguous queries that neither hard rules nor the trained embedding classifier can resolve with confidence.