21 KiB

Raw Blame History

Brunix Assistance Engine — Architecture Reference

Audience: Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment. Last updated: 2026-04-09 Version: 1.8.x

System Classification
Infrastructure Layout
Component Inventory
Request Lifecycle
Query Classification & Routing
LangGraph Workflow
RAG Pipeline — Hybrid Search
Streaming Architecture
Editor Context Pipeline
Platform Query Pipeline
Session State & Intent History
Evaluation Pipeline
Data Ingestion Pipeline
Observability Stack
Known Limitations & Future Work

1. System Classification

The Brunix Assistance Engine is a Modular Agentic RAG system.

Characteristic	Classification	Evidence
Pipeline structure	Agentic RAG	LangGraph classifier routes each query to a different pipeline — not a fixed retrieve-then-generate chain
Retrieval strategy	Advanced RAG	Hybrid BM25+kNN with RRF fusion, query reformulation before retrieval, two-phase streaming
Architecture style	Modular RAG	Each node (classify, reformulate, retrieve, generate) is independently replaceable; routing contract is implementation-independent
Self-evaluation	Not yet active	`CONFIDENCE_PROMPT_TEMPLATE` exists but is not wired into the graph — Self-RAG capability is scaffolded

What it is not: naive RAG (no fixed pipeline), Graph RAG (no knowledge graph), Self-RAG (no confidence feedback loop, yet).

The classifier is the central architectural element. It determines which pipeline executes for each request — including whether retrieval happens at all.

2. Infrastructure Layout

graph TD
    subgraph Clients ["External Clients"]
        VSC["VS Code Extension"]
        AVS["AVS Platform"]
        DEV["grpcurl / tests"]
    end

    subgraph Docker ["Docker — brunix-assistance-engine"]
        PROXY["openai_proxy.py\n:8000 HTTP/SSE\nOpenAI + Ollama compatible"]
        SERVER["server.py\n:50051 gRPC internal\nAskAgent · AskAgentStream · EvaluateRAG"]
        GRAPH["graph.py — LangGraph\nclassify → route → execute"]
    end

    subgraph Mac ["Developer Machine (macOS)"]
        Docker
        subgraph Tunnels ["kubectl port-forward tunnels"]
            T1["localhost:11434 → Ollama"]
            T2["localhost:9200  → Elasticsearch"]
            T3["localhost:5432  → Postgres"]
        end
    end

    subgraph Vultr ["Vultr — Devaron K8s Cluster"]
        OL["ollama-light-service\nqwen3:1.7b · qwen3:0.6b · bge-m3"]
        ES["brunix-vector-db\nElasticsearch :9200\navap-knowledge-v2"]
        PG["brunix-postgres\nPostgres :5432\nLangfuse data"]
        LF["Langfuse UI\n45.77.119.180:80\nObservability + tracing"]
    end

    VSC -->|"HTTP/SSE :8000"| PROXY
    AVS -->|"HTTP/SSE :8000"| PROXY
    DEV -->|"gRPC :50052"| SERVER

    PROXY -->|"gRPC internal"| SERVER
    SERVER --> GRAPH

    GRAPH -->|"host.docker.internal:11434"| T1
    GRAPH -->|"host.docker.internal:9200"| T2
    SERVER -->|"host.docker.internal:5432"| T3

    T1 -->|"secure tunnel"| OL
    T2 -->|"secure tunnel"| ES
    T3 -->|"secure tunnel"| PG

    SERVER -->|"traces HTTP"| LF
    DEV -->|"browser direct"| LF

Key networking detail: Docker does not talk to the kubectl tunnels directly. The path is:

Docker container
  → host.docker.internal (resolves to macOS host via extra_hosts)
    → kubectl port-forward (active on macOS)
      → Vultr K8s service

Langfuse is the exception — it has a public IP (45.77.119.180) and is accessed directly, without a tunnel.

3. Component Inventory

Component	File	Role
gRPC server	`server.py`	Entry point for all AI requests. Manages session store, model selection, and state initialization
HTTP proxy	`openai_proxy.py`	OpenAI + Ollama compatible HTTP layer. Translates REST → gRPC
LangGraph orchestrator	`graph.py`	Builds and executes the agentic routing graph
Prompt definitions	`prompts.py`	All prompt templates in one place: classifier, reformulator, generators, platform
Agent state	`state.py`	`AgentState` TypedDict shared across all graph nodes
LLM factory	`utils/llm_factory.py`	Provider-agnostic model instantiation (Ollama, OpenAI, Anthropic, Bedrock)
Embedding factory	`utils/emb_factory.py`	Provider-agnostic embedding model instantiation
Evaluation pipeline	`evaluate.py`	RAGAS evaluation with Claude as judge
Proto contract	`protos/brunix.proto`	Source of truth for the gRPC API

Model slots:

Slot	Env var	Used for	Current model
Main	`OLLAMA_MODEL_NAME`	`RETRIEVAL`, `CODE_GENERATION`, classification	`qwen3:1.7b`
Conversational	`OLLAMA_MODEL_NAME_CONVERSATIONAL`	`CONVERSATIONAL`, `PLATFORM`	`qwen3:0.6b`
Embeddings	`OLLAMA_EMB_MODEL_NAME`	Query embedding, document indexing	`bge-m3`
Evaluation judge	`ANTHROPIC_MODEL`	RAGAS scoring	`claude-sonnet-4-20250514`

4. Request Lifecycle

Two entry paths, both ending at the same LangGraph:

sequenceDiagram
    participant C as Client
    participant P as openai_proxy.py :8000
    participant S as server.py :50051
    participant G as graph.py
    participant O as Ollama (via tunnel)
    participant E as Elasticsearch (via tunnel)

    C->>P: POST /v1/chat/completions
    P->>P: parse user field (editor_content, selected_text, extra_context, user_info)
    P->>S: gRPC AskAgent / AskAgentStream

    S->>S: base64 decode context fields
    S->>S: load session_store + classify_history_store
    S->>G: invoke graph with AgentState

    G->>O: classify (LLM call)
    O-->>G: query_type + use_editor_context

    alt RETRIEVAL or CODE_GENERATION
        G->>O: reformulate query
        O-->>G: reformulated_query
        G->>O: embed reformulated_query
        O-->>G: query_vector
        G->>E: BM25 search + kNN search
        E-->>G: ranked chunks (RRF fusion)
        G->>O: generate with context
        O-->>G: response
    else CONVERSATIONAL
        G->>O: respond_conversational (no retrieval)
        O-->>G: response
    else PLATFORM
        G->>O: respond_platform (no retrieval, uses extra_context)
        O-->>G: response
    end

    G-->>S: final_state
    S->>S: update session_store + classify_history_store
    S-->>P: AgentResponse stream
    P-->>C: SSE / JSON response

5. Query Classification & Routing

The classifier is the most critical node. It determines the entire execution path for a request.

Taxonomy

Type	Intent	RAG	Model slot	Prompt
`RETRIEVAL`	Understand AVAP language concepts	Yes	main	`GENERATE_PROMPT`
`CODE_GENERATION`	Produce working AVAP code	Yes	main	`CODE_GENERATION_PROMPT`
`CONVERSATIONAL`	Rephrase or continue prior answer	No	conversational	`CONVERSATIONAL_PROMPT`
`PLATFORM`	Account, metrics, usage, billing	No	conversational	`PLATFORM_PROMPT`

Classification pipeline

flowchart TD
    Q[Incoming query] --> FP{Fast-path check\n_is_platform_query}
    FP -->|known platform prefix| PLATFORM[PLATFORM\nno LLM call]
    FP -->|no match| LLM[LLM Classifier\nqwen3:1.7b]

    LLM --> IH[Intent history\nlast 6 entries as context]
    IH --> OUT[Output: TYPE + EDITOR/NO_EDITOR]

    OUT --> R[RETRIEVAL]
    OUT --> C[CODE_GENERATION]
    OUT --> V[CONVERSATIONAL]
    OUT --> PL[PLATFORM]

Intent history — solving anchoring bias

The classifier does not receive raw conversation messages. It receives a compact trace of prior classifications:

[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"

Why: A 1.7B model receiving full message history computes P(type | history) instead of P(type | message_content) — it biases toward the dominant type of the session. The intent history gives the classifier enough context to resolve ambiguous references ("this", "esto") without the topical noise that causes anchoring.

Rule enforced in prompt: <history_rule> — the distribution of previous intents must not influence the prior probability of the current classification.

Routing contract (from ADR-0008)

Rule	Description	Priority
RC-01	Known platform prefix → `PLATFORM` without LLM	Highest
RC-02	Usage metrics / quota data in message → `PLATFORM`	High
RC-03	History resolves references only, never predicts type	Medium
RC-04	`PLATFORM` and `CONVERSATIONAL` never touch Elasticsearch	Medium
RC-05	`RETRIEVAL`/`CODE_GENERATION` → main model; `CONVERSATIONAL`/`PLATFORM` → conversational model	Medium
RC-06	Intent history capped at 6 entries	Low

6. LangGraph Workflow

Two graphs are built at startup and reused across all requests:

`build_graph` — used by `AskAgent` (non-streaming)

flowchart TD
    START([start]) --> CL[classify]
    CL -->|RETRIEVAL| RF[reformulate]
    CL -->|CODE_GENERATION| RF
    CL -->|CONVERSATIONAL| RC[respond_conversational]
    CL -->|PLATFORM| RP[respond_platform]
    RF --> RT[retrieve]
    RT -->|RETRIEVAL| GE[generate]
    RT -->|CODE_GENERATION| GC[generate_code]
    GE --> END([end])
    GC --> END
    RC --> END
    RP --> END

`build_prepare_graph` — used by `AskAgentStream` (streaming)

This graph only runs the preparation phase. The generation is handled outside the graph by llm.stream() to enable true token-by-token streaming.

flowchart TD
    START([start]) --> CL[classify]
    CL -->|RETRIEVAL| RF[reformulate]
    CL -->|CODE_GENERATION| RF
    CL -->|CONVERSATIONAL| SK[skip_retrieve]
    CL -->|PLATFORM| SK
    RF --> RT[retrieve]
    RT --> END([end])
    SK --> END

After prepare_graph returns, server.py calls build_final_messages(prepared) to construct the prompt and then streams directly from the selected LLM.

7. RAG Pipeline — Hybrid Search

Only RETRIEVAL and CODE_GENERATION queries reach this pipeline.

flowchart TD
    Q[reformulated_query] --> EMB[embed_query\nbge-m3]
    Q --> BM25[BM25 search\nElasticsearch multi_match\nfields: content^2 text^2\nfuzziness: AUTO]
    EMB --> KNN[kNN search\nElasticsearch\nfield: embedding\nk=8 num_candidates=40]

    BM25 --> RRF[RRF Fusion\n1 / rank+60 per hit\ncombined score]
    KNN --> RRF

    RRF --> RANK[Ranked docs\ntop-8]
    RANK --> FMT[format_context\nchunk headers: id type block section source\nAVAP CODE flag for code blocks]
    FMT --> CTX[context string\ninjected into generation prompt]

Why hybrid search: BM25 is strong for exact AVAP command names (registerEndpoint, addVar) which are rare in embedding space. kNN captures semantic similarity. RRF fusion combines both without requiring score normalization.

Elasticsearch index schema:

Field	Type	Description
`content` / `text`	`text`	Chunk text (BM25 searchable)
`embedding`	`dense_vector`	bge-m3 embedding (kNN searchable)
`doc_type`	`keyword`	`code`, `spec`, `code_example`, `bnf`
`block_type`	`keyword`	`function`, `if`, `startLoop`, `try`, etc.
`section`	`keyword`	Document section heading
`source_file`	`keyword`	Origin file
`chunk_id`	`keyword`	Unique chunk identifier

8. Streaming Architecture

AskAgentStream implements true token-by-token streaming. It does not buffer the full response.

sequenceDiagram
    participant C as Client
    participant S as server.py
    participant PG as prepare_graph
    participant LLM as Ollama LLM

    C->>S: AskAgentStream(request)
    S->>PG: invoke prepare_graph(initial_state)
    Note over PG: classify → reformulate → retrieve<br/>(or skip_retrieve for CONV/PLATFORM)
    PG-->>S: prepared state (query_type, context, messages)

    S->>S: build_final_messages(prepared)
    S->>S: select active_llm based on query_type

    loop token streaming
        S->>LLM: active_llm.stream(final_messages)
        LLM-->>S: chunk.content
        S-->>C: AgentResponse(text=token, is_final=false)
    end

    S-->>C: AgentResponse(text="", is_final=true)
    S->>S: update session_store + classify_history_store

Model selection in stream path:

active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm

9. Editor Context Pipeline

Implemented in PRD-0002. The VS Code extension can send three optional context fields with each query.

flowchart TD
    REQ[AgentRequest] --> B64[base64 decode\neditor_content\nselected_text\nextra_context]
    B64 --> STATE[AgentState]

    STATE --> CL[classify node\n_build_classify_prompt]
    CL --> ED{use_editor_context?}

    ED -->|EDITOR| INJ[inject into prompt\n1 selected_text — highest priority\n2 editor_content — file context\n3 RAG chunks\n4 extra_context]
    ED -->|NO_EDITOR| NONINJ[standard prompt\nno editor content injected]

    INJ --> GEN[generation node]
    NONINJ --> GEN

Classifier output format: Two tokens — TYPE EDITOR_SIGNAL

Examples: RETRIEVAL NO_EDITOR, CODE_GENERATION EDITOR, PLATFORM NO_EDITOR

EDITOR is set only when the user explicitly refers to the code in their editor: "this code", "fix this", "que hace esto", "explain this selection".

10. Platform Query Pipeline

Implemented in PRD-0003. Queries about account, metrics, usage, or billing bypass RAG entirely.

flowchart TD
    Q[Incoming query] --> FP{_is_platform_query?}
    FP -->|yes — known prefix| SKIP[skip classifier LLM\nroute = PLATFORM]
    FP -->|no| CL[LLM classifier]
    CL -->|PLATFORM| ROUTE[route to respond_platform]
    SKIP --> ROUTE

    ROUTE --> PROMPT["PLATFORM_PROMPT\n+ extra_context injection\n+ user_info available"]
    PROMPT --> LLM[qwen3:0.6b\nconversational model slot]
    LLM --> RESP[response]

    style SKIP fill:#2d6a4f,color:#fff
    style ROUTE fill:#2d6a4f,color:#fff

No Elasticsearch call is made for PLATFORM queries. The data is already in the request via extra_context and user_info injected by the caller (AVS Platform).

11. Session State & Intent History

Two stores are maintained per session in memory:

flowchart LR
    subgraph Stores ["In-memory stores (server.py)"]
        SS["session_store\ndict[session_id → list[BaseMessage]]\nfull conversation history\nused by generation nodes"]
        CHS["classify_history_store\ndict[session_id → list[ClassifyEntry]]\ncompact intent trace\nused by classifier"]
    end

    subgraph Entry ["ClassifyEntry (state.py)"]
        CE["type: str\nRETRIEVAL | CODE_GENERATION\nCONVERSATIONAL | PLATFORM\n─────────────────\ntopic: str\n60-char query snippet"]
    end

    CHS --> CE

classify_history_store is also a data flywheel. Every session generates labeled (topic, type) pairs automatically. When sufficient sessions accumulate (~500), this store can be exported to train the Layer 2 embedding classifier described in ADR-0008 Future Path — eliminating the need for the LLM classifier on the majority of requests.

AgentState fields

class AgentState(TypedDict):
    # Core
    messages:           Annotated[list, add_messages]  # full conversation
    session_id:         str
    query_type:         str                             # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM
    reformulated_query: str
    context:            str                             # RAG retrieved context

    # Classifier intent history
    classify_history:   list[ClassifyEntry]             # compact trace, persisted across turns

    # Editor context (PRD-0002)
    editor_content:     str                             # base64 decoded
    selected_text:      str                             # base64 decoded
    extra_context:      str                             # base64 decoded
    user_info:          str                             # JSON: {dev_id, project_id, org_id}
    use_editor_context: bool                            # set by classifier

12. Evaluation Pipeline

EvaluateRAG runs an automated quality benchmark using RAGAS with Claude as the judge. It does not share infrastructure with the main request pipeline.

flowchart TD
    GD[golden_dataset.json\n50 Q&A pairs with ground_truth] --> FILTER[category + limit filter]
    FILTER --> LOOP[for each question]

    LOOP --> RET[retrieve_context\nhybrid BM25+kNN\nsame as main pipeline]
    RET --> GEN[generate_answer\nusing main LLM]
    GEN --> ROW[dataset row\nquestion · answer · contexts · ground_truth]

    ROW --> DS[HuggingFace Dataset]
    DS --> RAGAS[ragas.evaluate\nfaithfulness\nanswer_relevancy\ncontext_recall\ncontext_precision]

    RAGAS --> JUDGE[Claude judge\nclaude-sonnet-4-20250514\nRateLimitedChatAnthropic\n3s delay between calls]
    JUDGE --> SCORES[per-metric scores]
    SCORES --> GLOBAL[global_score = mean of valid metrics]
    GLOBAL --> VERDICT{verdict}
    VERDICT -->|≥ 0.80| EX[EXCELLENT]
    VERDICT -->|≥ 0.60| AC[ACCEPTABLE]
    VERDICT -->|< 0.60| IN[INSUFFICIENT]

Important: EvaluateRAG uses RateLimitedChatAnthropic — a subclass that injects a 3-second delay between calls to respect Anthropic API rate limits. RAGAS is configured with max_workers=1.

13. Data Ingestion Pipeline

Two independent pipelines populate the Elasticsearch index. Both require the Ollama tunnel active on localhost:11434 and Elasticsearch on localhost:9200.

flowchart TD
    subgraph PipelineA ["Pipeline A — Chonkie (recommended)"]
        DA[docs/**/*.md\ndocs/**/*.avap] --> CA[Chonkie\nMarkdownChef + TokenChunker]
        CA --> EA[OllamaEmbeddings\nbatch embed]
        EA --> ESA[Elasticsearch bulk index]
    end

    subgraph PipelineB ["Pipeline B — AVAP Native"]
        DB[docs/**/*.avap\ndocs/**/*.md] --> CB[avap_chunker.py\nGenericLexer + LanguageConfig\nblock detection + semantic tags\nMinHash LSH dedup]
        CB --> JSONL[chunks.jsonl]
        JSONL --> EB[avap_ingestor.py\nasync producer/consumer\nOllamaAsyncEmbedder\nbatch=8]
        EB --> ESB[Elasticsearch async_bulk\nbatch=50\nDeadLetterQueue on failure]
    end

    ESA --> IDX[(avap-knowledge-v2\nElasticsearch index)]
    ESB --> IDX

Pipeline B produces richer metadata — block type, section, semantic tags (uses_orm, uses_http, uses_auth, etc.), complexity score, and MinHash deduplication. Use it for .avap files that need full semantic analysis.

14. Observability Stack

flowchart LR
    S[server.py] -->|LANGFUSE_HOST\nLANGFUSE_PUBLIC_KEY\nLANGFUSE_SECRET_KEY| LF[Langfuse\n45.77.119.180:80]
    LF --> TR[Traces\nper-request spans]
    LF --> ME[Metrics\nlatency · token counts]
    LF --> EV[Evaluation scores\nfaithfulness · relevancy]

Langfuse is accessed directly via public IP — no kubectl tunnel required. The engine sends traces automatically on every request when the Langfuse environment variables are set.

15. Known Limitations & Future Work

Active tactical debt

Item	Description	ADR
LLM classifier	Generative model doing discriminative work — non-deterministic, pays full inference cost for a 4-class label	ADR-0008
RC-02 is soft	Platform data signal enforced via prompt `<platform_priority_rule>`, not code — can be overridden by model	ADR-0008
`classify_history` not exported	Data flywheel accumulates but has no export mechanism yet	ADR-0008
`user_info` unused	`dev_id`, `project_id`, `org_id` are in state but not consumed by any graph node	PRD-0002
`CONFIDENCE_PROMPT_TEMPLATE` unused	Self-RAG capability is scaffolded in `prompts.py` but not wired into the graph	—

Roadmap (ADR-0008 Future Path)

flowchart LR
    P0["Now\nLLM classifier\n~95% of traffic"] -->
    P1["Phase 1\nExport classify_history_store\nlabeled dataset"] -->
    P2["Phase 2\nEmbedding classifier Layer 2\nbge-m3 + logistic regression\n~1ms CPU"] -->
    P3["Phase 3\nCaller-declared query_type\nproto field 7"] -->
    P4["Phase 4\nLLM classifier = anomaly handler\n<2% of traffic"]

Target steady-state: the LLM classifier handles fewer than 2% of requests — only genuinely ambiguous queries that neither hard rules nor the trained embedding classifier can resolve with confidence.

21 KiB Raw Blame History