assistance-engine/docs/ARCHITECTURE.md

# Brunix Assistance Engine — Architecture Reference

> **Audience:** Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment.
> **Last updated:** 2026-04-09
> **Version:** 1.8.x
> **Architect:** Rafael Ruiz (CTO, 101OBEX Corp)
> **Related ADRs:** ADR-0001 · ADR-0002 · ADR-0003 · ADR-0004 · ADR-0005 · ADR-0006 · ADR-0007 · ADR-0008
> **Related PRDs:** PRD-0001 · PRD-0002 · PRD-0003

---

## Table of Contents

1. [System Classification](#1-system-classification)
2. [Infrastructure Layout](#2-infrastructure-layout)
3. [Component Inventory](#3-component-inventory)
4. [Request Lifecycle](#4-request-lifecycle)
5. [Query Classification & Routing](#5-query-classification--routing)
6. [LangGraph Workflow](#6-langgraph-workflow)
7. [RAG Pipeline — Hybrid Search](#7-rag-pipeline--hybrid-search)
8. [Streaming Architecture](#8-streaming-architecture)
9. [Editor Context Pipeline](#9-editor-context-pipeline)
10. [Platform Query Pipeline](#10-platform-query-pipeline)
11. [Session State & Intent History](#11-session-state--intent-history)
12. [Evaluation Pipeline](#12-evaluation-pipeline)
13. [Data Ingestion Pipeline](#13-data-ingestion-pipeline)
14. [Observability Stack](#14-observability-stack)
15. [Known Limitations & Future Work](#15-known-limitations--future-work)

---

## 1. System Classification

The Brunix Assistance Engine is a **Modular Agentic RAG** system.

| Characteristic | Classification | Evidence |
|---|---|---|
| Pipeline structure | **Agentic RAG** | LangGraph classifier routes each query to a different pipeline — not a fixed retrieve-then-generate chain |
| Retrieval strategy | **Advanced RAG** | Hybrid BM25+kNN with RRF fusion, query reformulation before retrieval, two-phase streaming |
| Architecture style | **Modular RAG** | Each node (classify, reformulate, retrieve, generate) is independently replaceable; routing contract is implementation-independent |
| Self-evaluation | Not yet active | `CONFIDENCE_PROMPT_TEMPLATE` exists but is not wired into the graph — Self-RAG capability is scaffolded |

**What it is not:** naive RAG (no fixed pipeline), Graph RAG (no knowledge graph), Self-RAG (no confidence feedback loop, yet).

The classifier is the central architectural element. It determines which pipeline executes for each request — including whether retrieval happens at all.

---

## 2. Infrastructure Layout

```mermaid
graph TD
    subgraph Clients ["External Clients"]
        VSC["VS Code Extension"]
        AVS["AVS Platform"]
        DEV["grpcurl / tests"]
    end

    subgraph Docker ["Docker — brunix-assistance-engine"]
        PROXY["openai_proxy.py\n:8000 HTTP/SSE\nOpenAI + Ollama compatible"]
        SERVER["server.py\n:50051 gRPC internal\nAskAgent · AskAgentStream · EvaluateRAG"]
        GRAPH["graph.py — LangGraph\nclassify → route → execute"]
    end

    subgraph Mac ["Developer Machine (macOS)"]
        Docker
        subgraph Tunnels ["kubectl port-forward tunnels"]
            T1["localhost:11434 → Ollama"]
            T2["localhost:9200  → Elasticsearch"]
            T3["localhost:5432  → Postgres"]
        end
    end

    subgraph Vultr ["Vultr — Devaron K8s Cluster"]
        OL["ollama-light-service\nqwen3:1.7b · qwen3:0.6b · bge-m3"]
        ES["brunix-vector-db\nElasticsearch :9200\navap-knowledge-v2"]
        PG["brunix-postgres\nPostgres :5432\nLangfuse data"]
        LF["Langfuse UI\n45.77.119.180:80\nObservability + tracing"]
    end

    VSC -->|"HTTP/SSE :8000"| PROXY
    AVS -->|"HTTP/SSE :8000"| PROXY
    DEV -->|"gRPC :50052"| SERVER

    PROXY -->|"gRPC internal"| SERVER
    SERVER --> GRAPH

    GRAPH -->|"host.docker.internal:11434"| T1
    GRAPH -->|"host.docker.internal:9200"| T2
    SERVER -->|"host.docker.internal:5432"| T3

    T1 -->|"secure tunnel"| OL
    T2 -->|"secure tunnel"| ES
    T3 -->|"secure tunnel"| PG

    SERVER -->|"traces HTTP"| LF
    DEV -->|"browser direct"| LF
```

**Key networking detail:** Docker does not talk to the kubectl tunnels directly. The path is:

```
Docker container
  → host.docker.internal (resolves to macOS host via extra_hosts)
    → kubectl port-forward (active on macOS)
      → Vultr K8s service
```

Langfuse is the exception — it has a public IP (`45.77.119.180`) and is accessed directly, without a tunnel.

---

## 3. Component Inventory

| Component | File | Role |
|---|---|---|
| gRPC server | `server.py` | Entry point for all AI requests. Manages session store, model selection, and state initialization |
| HTTP proxy | `openai_proxy.py` | OpenAI + Ollama compatible HTTP layer. Translates REST → gRPC |
| LangGraph orchestrator | `graph.py` | Builds and executes the agentic routing graph |
| Prompt definitions | `prompts.py` | All prompt templates in one place: classifier, reformulator, generators, platform |
| Agent state | `state.py` | `AgentState` TypedDict shared across all graph nodes |
| LLM factory | `utils/llm_factory.py` | Provider-agnostic model instantiation (Ollama, OpenAI, Anthropic, Bedrock) |
| Embedding factory | `utils/emb_factory.py` | Provider-agnostic embedding model instantiation |
| Evaluation pipeline | `evaluate.py` | RAGAS evaluation with Claude as judge |
| Proto contract | `protos/brunix.proto` | Source of truth for the gRPC API |

**Model slots:**

| Slot | Env var | Used for | Current model |
|---|---|---|---|
| Main | `OLLAMA_MODEL_NAME` | `RETRIEVAL`, `CODE_GENERATION`, classification | `qwen3:1.7b` |
| Conversational | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | `CONVERSATIONAL`, `PLATFORM` | `qwen3:0.6b` |
| Embeddings | `OLLAMA_EMB_MODEL_NAME` | Query embedding, document indexing | `bge-m3` |
| Evaluation judge | `ANTHROPIC_MODEL` | RAGAS scoring | `claude-sonnet-4-20250514` |

---

## 4. Request Lifecycle

Two entry paths, both ending at the same LangGraph:

```mermaid
sequenceDiagram
    participant C as Client
    participant P as openai_proxy.py :8000
    participant S as server.py :50051
    participant G as graph.py
    participant O as Ollama (via tunnel)
    participant E as Elasticsearch (via tunnel)

    C->>P: POST /v1/chat/completions
    P->>P: parse user field (editor_content, selected_text, extra_context, user_info)
    P->>S: gRPC AskAgent / AskAgentStream

    S->>S: base64 decode context fields
    S->>S: load session_store + classify_history_store
    S->>G: invoke graph with AgentState

    G->>O: classify (LLM call)
    O-->>G: query_type + use_editor_context

    alt RETRIEVAL or CODE_GENERATION
        G->>O: reformulate query
        O-->>G: reformulated_query
        G->>O: embed reformulated_query
        O-->>G: query_vector
        G->>E: BM25 search + kNN search
        E-->>G: ranked chunks (RRF fusion)
        G->>O: generate with context
        O-->>G: response
    else CONVERSATIONAL
        G->>O: respond_conversational (no retrieval)
        O-->>G: response
    else PLATFORM
        G->>O: respond_platform (no retrieval, uses extra_context)
        O-->>G: response
    end

    G-->>S: final_state
    S->>S: update session_store + classify_history_store
    S-->>P: AgentResponse stream
    P-->>C: SSE / JSON response
```

---

## 5. Query Classification & Routing

The classifier is the most critical node. It determines the entire execution path for a request.

### Taxonomy

| Type | Intent | RAG | Model slot | Prompt |
|---|---|---|---|---|
| `RETRIEVAL` | Understand AVAP language concepts | Yes | main | `GENERATE_PROMPT` |
| `CODE_GENERATION` | Produce working AVAP code | Yes | main | `CODE_GENERATION_PROMPT` |
| `CONVERSATIONAL` | Rephrase or continue prior answer | No | conversational | `CONVERSATIONAL_PROMPT` |
| `PLATFORM` | Account, metrics, usage, billing | No | conversational | `PLATFORM_PROMPT` |

### Classification pipeline

```mermaid
flowchart TD
    Q[Incoming query] --> FP{Fast-path check\n_is_platform_query}
    FP -->|known platform prefix| PLATFORM[PLATFORM\nno LLM call]
    FP -->|no match| LLM[LLM Classifier\nqwen3:1.7b]

    LLM --> IH[Intent history\nlast 6 entries as context]
    IH --> OUT[Output: TYPE + EDITOR/NO_EDITOR]

    OUT --> R[RETRIEVAL]
    OUT --> C[CODE_GENERATION]
    OUT --> V[CONVERSATIONAL]
    OUT --> PL[PLATFORM]
```

### Intent history — solving anchoring bias

The classifier does not receive raw conversation messages. It receives a compact trace of prior classifications:

```
[RETRIEVAL] "What is addVar in AVAP?"
[CODE_GENERATION] "Write an API endpoint that retur"
[PLATFORM] "You have a project usage percentag"
```

**Why:** A 1.7B model receiving full message history computes `P(type | history)` instead of `P(type | message_content)` — it biases toward the dominant type of the session. The intent history gives the classifier enough context to resolve ambiguous references (`"this"`, `"esto"`) without the topical noise that causes anchoring.

**Rule enforced in prompt:** `<history_rule>` — the distribution of previous intents must not influence the prior probability of the current classification.

### Routing contract (from ADR-0008)

| Rule | Description | Priority |
|---|---|---|
| RC-01 | Known platform prefix → `PLATFORM` without LLM | Highest |
| RC-02 | Usage metrics / quota data in message → `PLATFORM` | High |
| RC-03 | History resolves references only, never predicts type | Medium |
| RC-04 | `PLATFORM` and `CONVERSATIONAL` never touch Elasticsearch | Medium |
| RC-05 | `RETRIEVAL`/`CODE_GENERATION` → main model; `CONVERSATIONAL`/`PLATFORM` → conversational model | Medium |
| RC-06 | Intent history capped at 6 entries | Low |

---

## 6. LangGraph Workflow

Two graphs are built at startup and reused across all requests:

### `build_graph` — used by `AskAgent` (non-streaming)

```mermaid
flowchart TD
    START([start]) --> CL[classify]
    CL -->|RETRIEVAL| RF[reformulate]
    CL -->|CODE_GENERATION| RF
    CL -->|CONVERSATIONAL| RC[respond_conversational]
    CL -->|PLATFORM| RP[respond_platform]
    RF --> RT[retrieve]
    RT -->|RETRIEVAL| GE[generate]
    RT -->|CODE_GENERATION| GC[generate_code]
    GE --> END([end])
    GC --> END
    RC --> END
    RP --> END
```

### `build_prepare_graph` — used by `AskAgentStream` (streaming)

This graph only runs the preparation phase. The generation is handled outside the graph by `llm.stream()` to enable true token-by-token streaming.

```mermaid
flowchart TD
    START([start]) --> CL[classify]
    CL -->|RETRIEVAL| RF[reformulate]
    CL -->|CODE_GENERATION| RF
    CL -->|CONVERSATIONAL| SK[skip_retrieve]
    CL -->|PLATFORM| SK
    RF --> RT[retrieve]
    RT --> END([end])
    SK --> END
```

After `prepare_graph` returns, `server.py` calls `build_final_messages(prepared)` to construct the prompt and then streams directly from the selected LLM.

---

## 7. RAG Pipeline — Hybrid Search

Only `RETRIEVAL` and `CODE_GENERATION` queries reach this pipeline.

```mermaid
flowchart TD
    Q[reformulated_query] --> EMB[embed_query\nbge-m3]
    Q --> BM25[BM25 search\nElasticsearch multi_match\nfields: content^2 text^2\nfuzziness: AUTO]
    EMB --> KNN[kNN search\nElasticsearch\nfield: embedding\nk=8 num_candidates=40]

    BM25 --> RRF[RRF Fusion\n1 / rank+60 per hit\ncombined score]
    KNN --> RRF

    RRF --> RANK[Ranked docs\ntop-8]
    RANK --> FMT[format_context\nchunk headers: id type block section source\nAVAP CODE flag for code blocks]
    FMT --> CTX[context string\ninjected into generation prompt]
```

**Why hybrid search:** BM25 is strong for exact AVAP command names (`registerEndpoint`, `addVar`) which are rare in embedding space. kNN captures semantic similarity. RRF fusion combines both without requiring score normalization.

**Elasticsearch index schema:**

| Field | Type | Description |
|---|---|---|
| `content` / `text` | `text` | Chunk text (BM25 searchable) |
| `embedding` | `dense_vector` | bge-m3 embedding (kNN searchable) |
| `doc_type` | `keyword` | `code`, `spec`, `code_example`, `bnf` |
| `block_type` | `keyword` | `function`, `if`, `startLoop`, `try`, etc. |
| `section` | `keyword` | Document section heading |
| `source_file` | `keyword` | Origin file |
| `chunk_id` | `keyword` | Unique chunk identifier |

---

## 8. Streaming Architecture

`AskAgentStream` implements true token-by-token streaming. It does not buffer the full response.

```mermaid
sequenceDiagram
    participant C as Client
    participant S as server.py
    participant PG as prepare_graph
    participant LLM as Ollama LLM

    C->>S: AskAgentStream(request)
    S->>PG: invoke prepare_graph(initial_state)
    Note over PG: classify → reformulate → retrieve<br/>(or skip_retrieve for CONV/PLATFORM)
    PG-->>S: prepared state (query_type, context, messages)

    S->>S: build_final_messages(prepared)
    S->>S: select active_llm based on query_type

    loop token streaming
        S->>LLM: active_llm.stream(final_messages)
        LLM-->>S: chunk.content
        S-->>C: AgentResponse(text=token, is_final=false)
    end

    S-->>C: AgentResponse(text="", is_final=true)
    S->>S: update session_store + classify_history_store
```

**Model selection in stream path:**
```python
active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm
```

---

## 9. Editor Context Pipeline

Implemented in PRD-0002. The VS Code extension can send three optional context fields with each query.

```mermaid
flowchart TD
    REQ[AgentRequest] --> B64[base64 decode\neditor_content\nselected_text\nextra_context]
    B64 --> STATE[AgentState]

    STATE --> CL[classify node\n_build_classify_prompt]
    CL --> ED{use_editor_context?}

    ED -->|EDITOR| INJ[inject into prompt\n1 selected_text — highest priority\n2 editor_content — file context\n3 RAG chunks\n4 extra_context]
    ED -->|NO_EDITOR| NONINJ[standard prompt\nno editor content injected]

    INJ --> GEN[generation node]
    NONINJ --> GEN
```

**Classifier output format:** Two tokens — `TYPE EDITOR_SIGNAL`

Examples: `RETRIEVAL NO_EDITOR`, `CODE_GENERATION EDITOR`, `PLATFORM NO_EDITOR`

`EDITOR` is set only when the user explicitly refers to the code in their editor: *"this code"*, *"fix this"*, *"que hace esto"*, *"explain this selection"*.

---

## 10. Platform Query Pipeline

Implemented in PRD-0003. Queries about account, metrics, usage, or billing bypass RAG entirely.

```mermaid
flowchart TD
    Q[Incoming query] --> FP{_is_platform_query?}
    FP -->|yes — known prefix| SKIP[skip classifier LLM\nroute = PLATFORM]
    FP -->|no| CL[LLM classifier]
    CL -->|PLATFORM| ROUTE[route to respond_platform]
    SKIP --> ROUTE

    ROUTE --> PROMPT["PLATFORM_PROMPT\n+ extra_context injection\n+ user_info available"]
    PROMPT --> LLM[qwen3:0.6b\nconversational model slot]
    LLM --> RESP[response]

    style SKIP fill:#2d6a4f,color:#fff
    style ROUTE fill:#2d6a4f,color:#fff
```

**No Elasticsearch call is made for PLATFORM queries.** The data is already in the request via `extra_context` and `user_info` injected by the caller (AVS Platform).

---

## 11. Session State & Intent History

Two stores are maintained per session in memory:

```mermaid
flowchart LR
    subgraph Stores ["In-memory stores (server.py)"]
        SS["session_store\ndict[session_id → list[BaseMessage]]\nfull conversation history\nused by generation nodes"]
        CHS["classify_history_store\ndict[session_id → list[ClassifyEntry]]\ncompact intent trace\nused by classifier"]
    end

    subgraph Entry ["ClassifyEntry (state.py)"]
        CE["type: str\nRETRIEVAL | CODE_GENERATION\nCONVERSATIONAL | PLATFORM\n─────────────────\ntopic: str\n60-char query snippet"]
    end

    CHS --> CE
```

**`classify_history_store` is also a data flywheel.** Every session generates labeled `(topic, type)` pairs automatically. When sufficient sessions accumulate (~500), this store can be exported to train the Layer 2 embedding classifier described in ADR-0008 Future Path — eliminating the need for the LLM classifier on the majority of requests.

### AgentState fields

```python
class AgentState(TypedDict):
    # Core
    messages:           Annotated[list, add_messages]  # full conversation
    session_id:         str
    query_type:         str                             # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM
    reformulated_query: str
    context:            str                             # RAG retrieved context

    # Classifier intent history
    classify_history:   list[ClassifyEntry]             # compact trace, persisted across turns

    # Editor context (PRD-0002)
    editor_content:     str                             # base64 decoded
    selected_text:      str                             # base64 decoded
    extra_context:      str                             # base64 decoded
    user_info:          str                             # JSON: {dev_id, project_id, org_id}
    use_editor_context: bool                            # set by classifier
```

---

## 12. Evaluation Pipeline

`EvaluateRAG` runs an automated quality benchmark using RAGAS with Claude as the judge. It does not share infrastructure with the main request pipeline.

```mermaid
flowchart TD
    GD[golden_dataset.json\n50 Q&A pairs with ground_truth] --> FILTER[category + limit filter]
    FILTER --> LOOP[for each question]

    LOOP --> RET[retrieve_context\nhybrid BM25+kNN\nsame as main pipeline]
    RET --> GEN[generate_answer\nusing main LLM]
    GEN --> ROW[dataset row\nquestion · answer · contexts · ground_truth]

    ROW --> DS[HuggingFace Dataset]
    DS --> RAGAS[ragas.evaluate\nfaithfulness\nanswer_relevancy\ncontext_recall\ncontext_precision]

    RAGAS --> JUDGE[Claude judge\nclaude-sonnet-4-20250514\nRateLimitedChatAnthropic\n3s delay between calls]
    JUDGE --> SCORES[per-metric scores]
    SCORES --> GLOBAL[global_score = mean of valid metrics]
    GLOBAL --> VERDICT{verdict}
    VERDICT -->|≥ 0.80| EX[EXCELLENT]
    VERDICT -->|≥ 0.60| AC[ACCEPTABLE]
    VERDICT -->|< 0.60| IN[INSUFFICIENT]
```

**Important:** `EvaluateRAG` uses `RateLimitedChatAnthropic` — a subclass that injects a 3-second delay between calls to respect Anthropic API rate limits. RAGAS is configured with `max_workers=1`.

---

## 13. Data Ingestion Pipeline

Two independent pipelines populate the Elasticsearch index. Both require the Ollama tunnel active on `localhost:11434` and Elasticsearch on `localhost:9200`.

```mermaid
flowchart TD
    subgraph PipelineA ["Pipeline A — Chonkie (recommended)"]
        DA[docs/**/*.md\ndocs/**/*.avap] --> CA[Chonkie\nMarkdownChef + TokenChunker]
        CA --> EA[OllamaEmbeddings\nbatch embed]
        EA --> ESA[Elasticsearch bulk index]
    end

    subgraph PipelineB ["Pipeline B — AVAP Native"]
        DB[docs/**/*.avap\ndocs/**/*.md] --> CB[avap_chunker.py\nGenericLexer + LanguageConfig\nblock detection + semantic tags\nMinHash LSH dedup]
        CB --> JSONL[chunks.jsonl]
        JSONL --> EB[avap_ingestor.py\nasync producer/consumer\nOllamaAsyncEmbedder\nbatch=8]
        EB --> ESB[Elasticsearch async_bulk\nbatch=50\nDeadLetterQueue on failure]
    end

    ESA --> IDX[(avap-knowledge-v2\nElasticsearch index)]
    ESB --> IDX
```

**Pipeline B produces richer metadata** — block type, section, semantic tags (`uses_orm`, `uses_http`, `uses_auth`, etc.), complexity score, and MinHash deduplication. Use it for `.avap` files that need full semantic analysis.

---

## 14. Observability Stack

```mermaid
flowchart LR
    S[server.py] -->|LANGFUSE_HOST\nLANGFUSE_PUBLIC_KEY\nLANGFUSE_SECRET_KEY| LF[Langfuse\n45.77.119.180:80]
    LF --> TR[Traces\nper-request spans]
    LF --> ME[Metrics\nlatency · token counts]
    LF --> EV[Evaluation scores\nfaithfulness · relevancy]
```

Langfuse is accessed directly via public IP — no kubectl tunnel required. The engine sends traces automatically on every request when the Langfuse environment variables are set.

---

## 15. Known Limitations & Future Work

### Active tactical debt

| Item | Description | ADR |
|---|---|---|
| LLM classifier | Generative model doing discriminative work — non-deterministic, pays full inference cost for a 4-class label | ADR-0008 |
| RC-02 is soft | Platform data signal enforced via prompt `<platform_priority_rule>`, not code — can be overridden by model | ADR-0008 |
| `classify_history` not exported | Data flywheel accumulates but has no export mechanism yet | ADR-0008 |
| `user_info` unused | `dev_id`, `project_id`, `org_id` are in state but not consumed by any graph node | PRD-0002 |
| `CONFIDENCE_PROMPT_TEMPLATE` unused | Self-RAG capability is scaffolded in `prompts.py` but not wired into the graph | — |

### Roadmap (ADR-0008 Future Path)

```mermaid
flowchart LR
    P0["Now\nLLM classifier\n~95% of traffic"] -->
    P1["Phase 1\nExport classify_history_store\nlabeled dataset"] -->
    P2["Phase 2\nEmbedding classifier Layer 2\nbge-m3 + logistic regression\n~1ms CPU"] -->
    P3["Phase 3\nCaller-declared query_type\nproto field 7"] -->
    P4["Phase 4\nLLM classifier = anomaly handler\n<2% of traffic"]
```

Target steady-state: the LLM classifier handles fewer than 2% of requests — only genuinely ambiguous queries that neither hard rules nor the trained embedding classifier can resolve with confidence.