From 70e3f6216544ca9d8801e9354d4e29f03b75d914 Mon Sep 17 00:00:00 2001 From: rafa-ruiz Date: Thu, 9 Apr 2026 20:52:20 -0700 Subject: [PATCH] architecture.md --- docs/ARCHITECTURE.md | 996 +++++++++++++++++++++---------------------- 1 file changed, 480 insertions(+), 516 deletions(-) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 828b996..540d327 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -1,577 +1,541 @@ # Brunix Assistance Engine — Architecture Reference > **Audience:** Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment. -> **Last updated:** 2026-03-20 -> **Version:** 1.6.x +> **Last updated:** 2026-04-09 +> **Version:** 1.8.x --- ## Table of Contents -1. [System Overview](#1-system-overview) -2. [Component Inventory](#2-component-inventory) -3. [Request Lifecycle](#3-request-lifecycle) -4. [LangGraph Workflow](#4-langgraph-workflow) -5. [RAG Pipeline — Hybrid Search](#5-rag-pipeline--hybrid-search) -6. [Editor Context Pipeline](#6-editor-context-pipeline) -7. [Streaming Architecture (AskAgentStream)](#7-streaming-architecture-askagentstream) -8. [Evaluation Pipeline (EvaluateRAG)](#8-evaluation-pipeline-evaluaterag) -9. [Data Ingestion Pipeline](#9-data-ingestion-pipeline) -10. [Infrastructure Layout](#10-infrastructure-layout) -11. [Session State & Conversation Memory](#11-session-state--conversation-memory) -12. [Observability Stack](#12-observability-stack) -13. [Security Boundaries](#13-security-boundaries) -14. [Known Limitations & Future Work](#14-known-limitations--future-work) +1. [System Classification](#1-system-classification) +2. [Infrastructure Layout](#2-infrastructure-layout) +3. [Component Inventory](#3-component-inventory) +4. [Request Lifecycle](#4-request-lifecycle) +5. [Query Classification & Routing](#5-query-classification--routing) +6. [LangGraph Workflow](#6-langgraph-workflow) +7. [RAG Pipeline — Hybrid Search](#7-rag-pipeline--hybrid-search) +8. [Streaming Architecture](#8-streaming-architecture) +9. [Editor Context Pipeline](#9-editor-context-pipeline) +10. [Platform Query Pipeline](#10-platform-query-pipeline) +11. [Session State & Intent History](#11-session-state--intent-history) +12. [Evaluation Pipeline](#12-evaluation-pipeline) +13. [Data Ingestion Pipeline](#13-data-ingestion-pipeline) +14. [Observability Stack](#14-observability-stack) +15. [Known Limitations & Future Work](#15-known-limitations--future-work) --- -## 1. System Overview +## 1. System Classification -The **Brunix Assistance Engine** is a stateful, streaming-capable AI service that answers questions about the AVAP programming language. It combines: +The Brunix Assistance Engine is a **Modular Agentic RAG** system. -- **gRPC** as the primary communication interface (port `50051` inside container, `50052` on host) -- **LangGraph** for deterministic, multi-step agentic orchestration -- **Hybrid RAG** (BM25 + kNN with RRF fusion) over an Elasticsearch vector index -- **Ollama** as the local LLM and embedding backend -- **RAGAS + Claude** as the automated evaluation judge -- **Editor context injection** — the VS Code extension can send active file content and selected code alongside each query; the engine decides whether to use it based on the user's intent - -A secondary **OpenAI-compatible HTTP proxy** (port `8000`) is served via FastAPI/Uvicorn, enabling integration with tools that expect the OpenAI API format. - -``` -┌─────────────────────────────────────────────────────────────┐ -│ External Clients │ -│ grpcurl / App SDK │ OpenAI-compatible client │ -│ VS Code extension │ (continue.dev, LiteLLM) │ -└────────────┬────────────────┴──────────────┬────────────────┘ - │ gRPC :50052 │ HTTP :8000 - ▼ ▼ -┌────────────────────────────────────────────────────────────┐ -│ Docker Container │ -│ │ -│ ┌─────────────────────┐ ┌──────────────────────────┐ │ -│ │ server.py (gRPC) │ │ openai_proxy.py (HTTP) │ │ -│ │ BrunixEngine │ │ FastAPI / Uvicorn │ │ -│ └──────────┬──────────┘ └──────────────────────────┘ │ -│ │ │ -│ ┌──────────▼──────────────────────────────────────────┐ │ -│ │ LangGraph Orchestration │ │ -│ │ classify → reformulate → retrieve → generate │ │ -│ └──────────────────────────┬───────────────────────────┘ │ -│ │ │ -│ ┌───────────────────┼────────────────────┐ │ -│ ▼ ▼ ▼ │ -│ Ollama (LLM) Ollama (Embed) Elasticsearch │ -│ via tunnel via tunnel via tunnel │ -└────────────────────────────────────────────────────────────┘ - │ kubectl port-forward tunnels │ - ▼ ▼ - Devaron Cluster (Vultr Kubernetes) - ollama-light-service:11434 brunix-vector-db:9200 - brunix-postgres:5432 Langfuse UI -``` - ---- - -## 2. Component Inventory - -| Component | File / Service | Responsibility | +| Characteristic | Classification | Evidence | |---|---|---| -| **gRPC Server** | `Docker/src/server.py` | Entry point. Implements the `AssistanceEngine` servicer. Initializes LLM, embeddings, ES client, and both graphs. Decodes Base64 editor context fields from incoming requests. | -| **Full Graph** | `Docker/src/graph.py` → `build_graph()` | Complete workflow: classify → reformulate → retrieve → generate. Used by `AskAgent` and `EvaluateRAG`. | -| **Prepare Graph** | `Docker/src/graph.py` → `build_prepare_graph()` | Partial workflow: classify → reformulate → retrieve. Does **not** call the LLM for generation. Used by `AskAgentStream` to enable manual token streaming. | -| **Message Builder** | `Docker/src/graph.py` → `build_final_messages()` | Reconstructs the final prompt list from prepared state for `llm.stream()`. Injects editor context when `use_editor_context` is `True`. | -| **Prompt Library** | `Docker/src/prompts.py` | Centralized definitions for `CLASSIFY`, `REFORMULATE`, `GENERATE`, `CODE_GENERATION`, and `CONVERSATIONAL` prompts. | -| **Agent State** | `Docker/src/state.py` | `AgentState` TypedDict shared across all graph nodes. Includes editor context fields and `use_editor_context` flag. | -| **Evaluation Suite** | `Docker/src/evaluate.py` | RAGAS-based pipeline. Uses the production retriever + Ollama LLM for generation, and Claude as the impartial judge. | -| **OpenAI Proxy** | `Docker/src/openai_proxy.py` | FastAPI application that wraps `AskAgent` / `AskAgentStream` under OpenAI and Ollama compatible endpoints. Parses editor context from the `user` field. | -| **LLM Factory** | `Docker/src/utils/llm_factory.py` | Provider-agnostic factory for chat models (Ollama, AWS Bedrock). | -| **Embedding Factory** | `Docker/src/utils/emb_factory.py` | Provider-agnostic factory for embedding models (Ollama, HuggingFace). | -| **Ingestion Pipeline** | `scripts/pipelines/flows/elasticsearch_ingestion.py` | Chunks and ingests AVAP documents into Elasticsearch with embeddings. | -| **AVAP Chunker** | `scripts/pipelines/ingestion/avap_chunker.py` | Semantic chunker for `.avap` source files using `avap_config.json` as grammar. | -| **Unit Tests** | `Docker/tests/test_prd_0002.py` | 40 unit tests covering editor context parsing, Base64 decoding, classifier output, reformulate anchor, and injection logic. | +| Pipeline structure | **Agentic RAG** | LangGraph classifier routes each query to a different pipeline — not a fixed retrieve-then-generate chain | +| Retrieval strategy | **Advanced RAG** | Hybrid BM25+kNN with RRF fusion, query reformulation before retrieval, two-phase streaming | +| Architecture style | **Modular RAG** | Each node (classify, reformulate, retrieve, generate) is independently replaceable; routing contract is implementation-independent | +| Self-evaluation | Not yet active | `CONFIDENCE_PROMPT_TEMPLATE` exists but is not wired into the graph — Self-RAG capability is scaffolded | + +**What it is not:** naive RAG (no fixed pipeline), Graph RAG (no knowledge graph), Self-RAG (no confidence feedback loop, yet). + +The classifier is the central architectural element. It determines which pipeline executes for each request — including whether retrieval happens at all. --- -## 3. Request Lifecycle +## 2. Infrastructure Layout -### 3.1 `AskAgent` (non-streaming) +```mermaid +graph TD + subgraph Clients ["External Clients"] + VSC["VS Code Extension"] + AVS["AVS Platform"] + DEV["grpcurl / tests"] + end -``` -Client → gRPC AgentRequest{query, session_id, editor_content*, selected_text*, extra_context*, user_info*} - │ (* Base64-encoded; user_info is JSON string) - │ - ├─ Decode Base64 fields (editor_content, selected_text, extra_context) - ├─ Load conversation history from session_store[session_id] - ├─ Build initial_state = {messages, session_id, editor_content, selected_text, extra_context, user_info} - │ - └─ graph.invoke(initial_state) - ├─ classify → query_type ∈ {RETRIEVAL, CODE_GENERATION, CONVERSATIONAL} - │ use_editor_context ∈ {True, False} - ├─ reformulate → reformulated_query - │ (anchored to selected_text if use_editor_context=True) - ├─ retrieve → context (top-8 hybrid RRF chunks) - └─ generate → final AIMessage - (editor context injected only if use_editor_context=True) - │ - ├─ Persist updated history to session_store[session_id] - └─ yield AgentResponse{text, avap_code="AVAP-2026", is_final=True} + subgraph Docker ["Docker — brunix-assistance-engine"] + PROXY["openai_proxy.py\n:8000 HTTP/SSE\nOpenAI + Ollama compatible"] + SERVER["server.py\n:50051 gRPC internal\nAskAgent · AskAgentStream · EvaluateRAG"] + GRAPH["graph.py — LangGraph\nclassify → route → execute"] + end + + subgraph Mac ["Developer Machine (macOS)"] + Docker + subgraph Tunnels ["kubectl port-forward tunnels"] + T1["localhost:11434 → Ollama"] + T2["localhost:9200 → Elasticsearch"] + T3["localhost:5432 → Postgres"] + end + end + + subgraph Vultr ["Vultr — Devaron K8s Cluster"] + OL["ollama-light-service\nqwen3:1.7b · qwen3:0.6b · bge-m3"] + ES["brunix-vector-db\nElasticsearch :9200\navap-knowledge-v2"] + PG["brunix-postgres\nPostgres :5432\nLangfuse data"] + LF["Langfuse UI\n45.77.119.180:80\nObservability + tracing"] + end + + VSC -->|"HTTP/SSE :8000"| PROXY + AVS -->|"HTTP/SSE :8000"| PROXY + DEV -->|"gRPC :50052"| SERVER + + PROXY -->|"gRPC internal"| SERVER + SERVER --> GRAPH + + GRAPH -->|"host.docker.internal:11434"| T1 + GRAPH -->|"host.docker.internal:9200"| T2 + SERVER -->|"host.docker.internal:5432"| T3 + + T1 -->|"secure tunnel"| OL + T2 -->|"secure tunnel"| ES + T3 -->|"secure tunnel"| PG + + SERVER -->|"traces HTTP"| LF + DEV -->|"browser direct"| LF ``` -### 3.2 `AskAgentStream` (token streaming) +**Key networking detail:** Docker does not talk to the kubectl tunnels directly. The path is: ``` -Client → gRPC AgentRequest{query, session_id, editor_content*, selected_text*, extra_context*, user_info*} - │ - ├─ Decode Base64 fields - ├─ Load history from session_store[session_id] - ├─ Build initial_state - │ - ├─ prepare_graph.invoke(initial_state) ← Phase 1: no LLM generation - │ ├─ classify → query_type + use_editor_context - │ ├─ reformulate - │ └─ retrieve (or skip_retrieve if CONVERSATIONAL) - │ - ├─ build_final_messages(prepared_state) ← Reconstruct prompt with editor context if flagged - │ - └─ for chunk in llm.stream(final_messages): - └─ yield AgentResponse{text=token, is_final=False} - │ - ├─ Persist full assembled response to session_store - └─ yield AgentResponse{text="", is_final=True} +Docker container + → host.docker.internal (resolves to macOS host via extra_hosts) + → kubectl port-forward (active on macOS) + → Vultr K8s service ``` -### 3.3 HTTP Proxy → gRPC +Langfuse is the exception — it has a public IP (`45.77.119.180`) and is accessed directly, without a tunnel. -``` -Client → POST /v1/chat/completions {messages, stream, session_id, user} - │ - ├─ Extract query from last user message in messages[] - ├─ Read session_id from dedicated field (NOT from user) - ├─ Parse user field as JSON → {editor_content, selected_text, extra_context, user_info} - │ - ├─ stream=false → _invoke_blocking() → AskAgent gRPC call - └─ stream=true → _iter_stream() → AskAgentStream gRPC call → SSE token stream -``` +--- -### 3.4 `EvaluateRAG` +## 3. Component Inventory -``` -Client → gRPC EvalRequest{category?, limit?, index?} - │ - └─ evaluate.run_evaluation(...) - ├─ Load golden_dataset.json - ├─ Filter by category / limit - ├─ For each question: - │ ├─ retrieve_context (hybrid BM25+kNN, same as production) - │ └─ generate_answer (Ollama LLM + GENERATE_PROMPT) - ├─ Build RAGAS Dataset - ├─ Run RAGAS metrics with Claude as judge - └─ Compute global_score + verdict - │ - └─ return EvalResponse{scores, global_score, verdict, details[]} +| Component | File | Role | +|---|---|---| +| gRPC server | `server.py` | Entry point for all AI requests. Manages session store, model selection, and state initialization | +| HTTP proxy | `openai_proxy.py` | OpenAI + Ollama compatible HTTP layer. Translates REST → gRPC | +| LangGraph orchestrator | `graph.py` | Builds and executes the agentic routing graph | +| Prompt definitions | `prompts.py` | All prompt templates in one place: classifier, reformulator, generators, platform | +| Agent state | `state.py` | `AgentState` TypedDict shared across all graph nodes | +| LLM factory | `utils/llm_factory.py` | Provider-agnostic model instantiation (Ollama, OpenAI, Anthropic, Bedrock) | +| Embedding factory | `utils/emb_factory.py` | Provider-agnostic embedding model instantiation | +| Evaluation pipeline | `evaluate.py` | RAGAS evaluation with Claude as judge | +| Proto contract | `protos/brunix.proto` | Source of truth for the gRPC API | + +**Model slots:** + +| Slot | Env var | Used for | Current model | +|---|---|---|---| +| Main | `OLLAMA_MODEL_NAME` | `RETRIEVAL`, `CODE_GENERATION`, classification | `qwen3:1.7b` | +| Conversational | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | `CONVERSATIONAL`, `PLATFORM` | `qwen3:0.6b` | +| Embeddings | `OLLAMA_EMB_MODEL_NAME` | Query embedding, document indexing | `bge-m3` | +| Evaluation judge | `ANTHROPIC_MODEL` | RAGAS scoring | `claude-sonnet-4-20250514` | + +--- + +## 4. Request Lifecycle + +Two entry paths, both ending at the same LangGraph: + +```mermaid +sequenceDiagram + participant C as Client + participant P as openai_proxy.py :8000 + participant S as server.py :50051 + participant G as graph.py + participant O as Ollama (via tunnel) + participant E as Elasticsearch (via tunnel) + + C->>P: POST /v1/chat/completions + P->>P: parse user field (editor_content, selected_text, extra_context, user_info) + P->>S: gRPC AskAgent / AskAgentStream + + S->>S: base64 decode context fields + S->>S: load session_store + classify_history_store + S->>G: invoke graph with AgentState + + G->>O: classify (LLM call) + O-->>G: query_type + use_editor_context + + alt RETRIEVAL or CODE_GENERATION + G->>O: reformulate query + O-->>G: reformulated_query + G->>O: embed reformulated_query + O-->>G: query_vector + G->>E: BM25 search + kNN search + E-->>G: ranked chunks (RRF fusion) + G->>O: generate with context + O-->>G: response + else CONVERSATIONAL + G->>O: respond_conversational (no retrieval) + O-->>G: response + else PLATFORM + G->>O: respond_platform (no retrieval, uses extra_context) + O-->>G: response + end + + G-->>S: final_state + S->>S: update session_store + classify_history_store + S-->>P: AgentResponse stream + P-->>C: SSE / JSON response ``` --- -## 4. LangGraph Workflow +## 5. Query Classification & Routing -### 4.1 Agent State +The classifier is the most critical node. It determines the entire execution path for a request. + +### Taxonomy + +| Type | Intent | RAG | Model slot | Prompt | +|---|---|---|---|---| +| `RETRIEVAL` | Understand AVAP language concepts | Yes | main | `GENERATE_PROMPT` | +| `CODE_GENERATION` | Produce working AVAP code | Yes | main | `CODE_GENERATION_PROMPT` | +| `CONVERSATIONAL` | Rephrase or continue prior answer | No | conversational | `CONVERSATIONAL_PROMPT` | +| `PLATFORM` | Account, metrics, usage, billing | No | conversational | `PLATFORM_PROMPT` | + +### Classification pipeline + +```mermaid +flowchart TD + Q[Incoming query] --> FP{Fast-path check\n_is_platform_query} + FP -->|known platform prefix| PLATFORM[PLATFORM\nno LLM call] + FP -->|no match| LLM[LLM Classifier\nqwen3:1.7b] + + LLM --> IH[Intent history\nlast 6 entries as context] + IH --> OUT[Output: TYPE + EDITOR/NO_EDITOR] + + OUT --> R[RETRIEVAL] + OUT --> C[CODE_GENERATION] + OUT --> V[CONVERSATIONAL] + OUT --> PL[PLATFORM] +``` + +### Intent history — solving anchoring bias + +The classifier does not receive raw conversation messages. It receives a compact trace of prior classifications: + +``` +[RETRIEVAL] "What is addVar in AVAP?" +[CODE_GENERATION] "Write an API endpoint that retur" +[PLATFORM] "You have a project usage percentag" +``` + +**Why:** A 1.7B model receiving full message history computes `P(type | history)` instead of `P(type | message_content)` — it biases toward the dominant type of the session. The intent history gives the classifier enough context to resolve ambiguous references (`"this"`, `"esto"`) without the topical noise that causes anchoring. + +**Rule enforced in prompt:** `` — the distribution of previous intents must not influence the prior probability of the current classification. + +### Routing contract (from ADR-0008) + +| Rule | Description | Priority | +|---|---|---| +| RC-01 | Known platform prefix → `PLATFORM` without LLM | Highest | +| RC-02 | Usage metrics / quota data in message → `PLATFORM` | High | +| RC-03 | History resolves references only, never predicts type | Medium | +| RC-04 | `PLATFORM` and `CONVERSATIONAL` never touch Elasticsearch | Medium | +| RC-05 | `RETRIEVAL`/`CODE_GENERATION` → main model; `CONVERSATIONAL`/`PLATFORM` → conversational model | Medium | +| RC-06 | Intent history capped at 6 entries | Low | + +--- + +## 6. LangGraph Workflow + +Two graphs are built at startup and reused across all requests: + +### `build_graph` — used by `AskAgent` (non-streaming) + +```mermaid +flowchart TD + START([start]) --> CL[classify] + CL -->|RETRIEVAL| RF[reformulate] + CL -->|CODE_GENERATION| RF + CL -->|CONVERSATIONAL| RC[respond_conversational] + CL -->|PLATFORM| RP[respond_platform] + RF --> RT[retrieve] + RT -->|RETRIEVAL| GE[generate] + RT -->|CODE_GENERATION| GC[generate_code] + GE --> END([end]) + GC --> END + RC --> END + RP --> END +``` + +### `build_prepare_graph` — used by `AskAgentStream` (streaming) + +This graph only runs the preparation phase. The generation is handled outside the graph by `llm.stream()` to enable true token-by-token streaming. + +```mermaid +flowchart TD + START([start]) --> CL[classify] + CL -->|RETRIEVAL| RF[reformulate] + CL -->|CODE_GENERATION| RF + CL -->|CONVERSATIONAL| SK[skip_retrieve] + CL -->|PLATFORM| SK + RF --> RT[retrieve] + RT --> END([end]) + SK --> END +``` + +After `prepare_graph` returns, `server.py` calls `build_final_messages(prepared)` to construct the prompt and then streams directly from the selected LLM. + +--- + +## 7. RAG Pipeline — Hybrid Search + +Only `RETRIEVAL` and `CODE_GENERATION` queries reach this pipeline. + +```mermaid +flowchart TD + Q[reformulated_query] --> EMB[embed_query\nbge-m3] + Q --> BM25[BM25 search\nElasticsearch multi_match\nfields: content^2 text^2\nfuzziness: AUTO] + EMB --> KNN[kNN search\nElasticsearch\nfield: embedding\nk=8 num_candidates=40] + + BM25 --> RRF[RRF Fusion\n1 / rank+60 per hit\ncombined score] + KNN --> RRF + + RRF --> RANK[Ranked docs\ntop-8] + RANK --> FMT[format_context\nchunk headers: id type block section source\nAVAP CODE flag for code blocks] + FMT --> CTX[context string\ninjected into generation prompt] +``` + +**Why hybrid search:** BM25 is strong for exact AVAP command names (`registerEndpoint`, `addVar`) which are rare in embedding space. kNN captures semantic similarity. RRF fusion combines both without requiring score normalization. + +**Elasticsearch index schema:** + +| Field | Type | Description | +|---|---|---| +| `content` / `text` | `text` | Chunk text (BM25 searchable) | +| `embedding` | `dense_vector` | bge-m3 embedding (kNN searchable) | +| `doc_type` | `keyword` | `code`, `spec`, `code_example`, `bnf` | +| `block_type` | `keyword` | `function`, `if`, `startLoop`, `try`, etc. | +| `section` | `keyword` | Document section heading | +| `source_file` | `keyword` | Origin file | +| `chunk_id` | `keyword` | Unique chunk identifier | + +--- + +## 8. Streaming Architecture + +`AskAgentStream` implements true token-by-token streaming. It does not buffer the full response. + +```mermaid +sequenceDiagram + participant C as Client + participant S as server.py + participant PG as prepare_graph + participant LLM as Ollama LLM + + C->>S: AskAgentStream(request) + S->>PG: invoke prepare_graph(initial_state) + Note over PG: classify → reformulate → retrieve
(or skip_retrieve for CONV/PLATFORM) + PG-->>S: prepared state (query_type, context, messages) + + S->>S: build_final_messages(prepared) + S->>S: select active_llm based on query_type + + loop token streaming + S->>LLM: active_llm.stream(final_messages) + LLM-->>S: chunk.content + S-->>C: AgentResponse(text=token, is_final=false) + end + + S-->>C: AgentResponse(text="", is_final=true) + S->>S: update session_store + classify_history_store +``` + +**Model selection in stream path:** +```python +active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm +``` + +--- + +## 9. Editor Context Pipeline + +Implemented in PRD-0002. The VS Code extension can send three optional context fields with each query. + +```mermaid +flowchart TD + REQ[AgentRequest] --> B64[base64 decode\neditor_content\nselected_text\nextra_context] + B64 --> STATE[AgentState] + + STATE --> CL[classify node\n_build_classify_prompt] + CL --> ED{use_editor_context?} + + ED -->|EDITOR| INJ[inject into prompt\n1 selected_text — highest priority\n2 editor_content — file context\n3 RAG chunks\n4 extra_context] + ED -->|NO_EDITOR| NONINJ[standard prompt\nno editor content injected] + + INJ --> GEN[generation node] + NONINJ --> GEN +``` + +**Classifier output format:** Two tokens — `TYPE EDITOR_SIGNAL` + +Examples: `RETRIEVAL NO_EDITOR`, `CODE_GENERATION EDITOR`, `PLATFORM NO_EDITOR` + +`EDITOR` is set only when the user explicitly refers to the code in their editor: *"this code"*, *"fix this"*, *"que hace esto"*, *"explain this selection"*. + +--- + +## 10. Platform Query Pipeline + +Implemented in PRD-0003. Queries about account, metrics, usage, or billing bypass RAG entirely. + +```mermaid +flowchart TD + Q[Incoming query] --> FP{_is_platform_query?} + FP -->|yes — known prefix| SKIP[skip classifier LLM\nroute = PLATFORM] + FP -->|no| CL[LLM classifier] + CL -->|PLATFORM| ROUTE[route to respond_platform] + SKIP --> ROUTE + + ROUTE --> PROMPT["PLATFORM_PROMPT\n+ extra_context injection\n+ user_info available"] + PROMPT --> LLM[qwen3:0.6b\nconversational model slot] + LLM --> RESP[response] + + style SKIP fill:#2d6a4f,color:#fff + style ROUTE fill:#2d6a4f,color:#fff +``` + +**No Elasticsearch call is made for PLATFORM queries.** The data is already in the request via `extra_context` and `user_info` injected by the caller (AVS Platform). + +--- + +## 11. Session State & Intent History + +Two stores are maintained per session in memory: + +```mermaid +flowchart LR + subgraph Stores ["In-memory stores (server.py)"] + SS["session_store\ndict[session_id → list[BaseMessage]]\nfull conversation history\nused by generation nodes"] + CHS["classify_history_store\ndict[session_id → list[ClassifyEntry]]\ncompact intent trace\nused by classifier"] + end + + subgraph Entry ["ClassifyEntry (state.py)"] + CE["type: str\nRETRIEVAL | CODE_GENERATION\nCONVERSATIONAL | PLATFORM\n─────────────────\ntopic: str\n60-char query snippet"] + end + + CHS --> CE +``` + +**`classify_history_store` is also a data flywheel.** Every session generates labeled `(topic, type)` pairs automatically. When sufficient sessions accumulate (~500), this store can be exported to train the Layer 2 embedding classifier described in ADR-0008 Future Path — eliminating the need for the LLM classifier on the majority of requests. + +### AgentState fields ```python class AgentState(TypedDict): - messages: Annotated[list, add_messages] # conversation history + # Core + messages: Annotated[list, add_messages] # full conversation session_id: str - query_type: str # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL + query_type: str # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM reformulated_query: str - context: str # formatted RAG context string - editor_content: str # decoded from Base64 - selected_text: str # decoded from Base64 - extra_context: str # decoded from Base64 - user_info: str # JSON string: {"dev_id", "project_id", "org_id"} - use_editor_context: bool # set by classifier — True only if query explicitly refers to editor + context: str # RAG retrieved context + + # Classifier intent history + classify_history: list[ClassifyEntry] # compact trace, persisted across turns + + # Editor context (PRD-0002) + editor_content: str # base64 decoded + selected_text: str # base64 decoded + extra_context: str # base64 decoded + user_info: str # JSON: {dev_id, project_id, org_id} + use_editor_context: bool # set by classifier ``` -### 4.2 Full Graph (`build_graph`) - -``` - ┌─────────────┐ - │ classify │ ← sees: query + history + selected_text (if present) - │ │ outputs: query_type + use_editor_context - └──────┬──────┘ - │ - ┌────────────────┼──────────────────┐ - ▼ ▼ ▼ - RETRIEVAL CODE_GENERATION CONVERSATIONAL - │ │ │ - └────────┬───────┘ │ - ▼ ▼ - ┌──────────────┐ ┌────────────────────────┐ - │ reformulate │ │ respond_conversational │ - │ │ └───────────┬────────────┘ - │ if use_editor│ │ - │ anchor query │ │ - │ to selected │ │ - └──────┬───────┘ │ - ▼ │ - ┌──────────────┐ │ - │ retrieve │ │ - └──────┬───────┘ │ - │ │ - ┌────────┴───────────┐ │ - ▼ ▼ │ - ┌──────────┐ ┌───────────────┐ │ - │ generate │ │ generate_code │ │ - │ │ │ │ │ - │ injects │ │ injects editor│ │ - │ editor │ │ context only │ │ - │ context │ │ if flag=True │ │ - │ if flag │ └───────┬───────┘ │ - └────┬─────┘ │ │ - │ │ │ - └────────────────────┴────────────────┘ - │ - END -``` - -### 4.3 Prepare Graph (`build_prepare_graph`) - -Identical routing for classify, but generation nodes are replaced by `END`. The `CONVERSATIONAL` branch uses `skip_retrieve` (returns empty context). The `use_editor_context` flag is set here and carried forward into `build_final_messages`. - -### 4.4 Classifier — Two-Token Output - -The classifier outputs exactly two tokens separated by a space: - -``` - - -Examples: - RETRIEVAL NO_EDITOR - CODE_GENERATION EDITOR - CONVERSATIONAL NO_EDITOR -``` - -`EDITOR` is set only when the user message explicitly refers to editor code using expressions like "this code", "este codigo", "fix this", "que hace esto", "explain this", etc. General AVAP questions, code generation requests, and conversational follow-ups always return `NO_EDITOR`. - -### 4.5 Query Type Routing - -| `query_type` | Triggers retrieve? | Generation prompt | Editor context injected? | -|---|---|---|---| -| `RETRIEVAL` | Yes | `GENERATE_PROMPT` | Only if `use_editor_context=True` | -| `CODE_GENERATION` | Yes | `CODE_GENERATION_PROMPT` | Only if `use_editor_context=True` | -| `CONVERSATIONAL` | No | `CONVERSATIONAL_PROMPT` | Never | - -### 4.6 Reformulator — Mode-Aware & Language-Preserving - -The reformulator receives `[MODE: ]` prepended to the query: - -- **MODE RETRIEVAL:** Compresses the query into compact keywords. Does NOT expand with AVAP commands. Preserves original language — Spanish queries stay in Spanish, English queries stay in English. -- **MODE CODE_GENERATION:** Applies the AVAP command expansion mapping (registerEndpoint, addParam, ormAccessSelect, etc.). -- **MODE CONVERSATIONAL:** Returns the query as-is. - -Language preservation is critical for BM25 retrieval — the AVAP LRM is written in Spanish, so a Spanish query must reach the retriever in Spanish for lexical matching to work correctly. - --- -## 5. RAG Pipeline — Hybrid Search +## 12. Evaluation Pipeline -The retrieval system (`hybrid_search_native`) fuses BM25 lexical search and kNN dense vector search using **Reciprocal Rank Fusion (RRF)**. +`EvaluateRAG` runs an automated quality benchmark using RAGAS with Claude as the judge. It does not share infrastructure with the main request pipeline. -``` -User query (reformulated, language-preserved) - │ - ├─ embeddings.embed_query(query) → query_vector [1024-dim] - │ - ├─ ES bool query: - │ ├─ must: multi_match (BM25) on [content^2, text^2] - │ └─ should: boost spec/narrative doc_types (2.0x / 1.5x) - │ └─ top-k BM25 hits - │ - └─ ES knn on field [embedding], num_candidates = k×5 - └─ top-k kNN hits - │ - ├─ RRF fusion: score(doc) = Σ 1/(rank + 60) - │ - └─ Top-8 documents → format_context() → context string +```mermaid +flowchart TD + GD[golden_dataset.json\n50 Q&A pairs with ground_truth] --> FILTER[category + limit filter] + FILTER --> LOOP[for each question] + + LOOP --> RET[retrieve_context\nhybrid BM25+kNN\nsame as main pipeline] + RET --> GEN[generate_answer\nusing main LLM] + GEN --> ROW[dataset row\nquestion · answer · contexts · ground_truth] + + ROW --> DS[HuggingFace Dataset] + DS --> RAGAS[ragas.evaluate\nfaithfulness\nanswer_relevancy\ncontext_recall\ncontext_precision] + + RAGAS --> JUDGE[Claude judge\nclaude-sonnet-4-20250514\nRateLimitedChatAnthropic\n3s delay between calls] + JUDGE --> SCORES[per-metric scores] + SCORES --> GLOBAL[global_score = mean of valid metrics] + GLOBAL --> VERDICT{verdict} + VERDICT -->|≥ 0.80| EX[EXCELLENT] + VERDICT -->|≥ 0.60| AC[ACCEPTABLE] + VERDICT -->|< 0.60| IN[INSUFFICIENT] ``` -**RRF constant:** `60` (standard value). - -**doc_type boost:** `spec` and `narrative` chunks receive a score boost in the BM25 query to prioritize definitional and explanatory content over raw code examples when the query is about meaning or documentation. - -**Chunk metadata** attached to each retrieved document: - -| Field | Description | -|---|---| -| `chunk_id` | Unique identifier within the index | -| `source_file` | Origin document filename | -| `doc_type` | `spec`, `code`, `code_example`, `bnf` | -| `block_type` | AVAP block type: `narrative`, `function`, `if`, `startLoop`, `try`, etc. | -| `section` | Document section/chapter heading | +**Important:** `EvaluateRAG` uses `RateLimitedChatAnthropic` — a subclass that injects a 3-second delay between calls to respect Anthropic API rate limits. RAGAS is configured with `max_workers=1`. --- -## 6. Editor Context Pipeline +## 13. Data Ingestion Pipeline -The editor context pipeline (PRD-0002) allows the VS Code extension to send the user's active editor state alongside every query. The engine uses this context only when the user explicitly refers to their code. +Two independent pipelines populate the Elasticsearch index. Both require the Ollama tunnel active on `localhost:11434` and Elasticsearch on `localhost:9200`. -### Transport +```mermaid +flowchart TD + subgraph PipelineA ["Pipeline A — Chonkie (recommended)"] + DA[docs/**/*.md\ndocs/**/*.avap] --> CA[Chonkie\nMarkdownChef + TokenChunker] + CA --> EA[OllamaEmbeddings\nbatch embed] + EA --> ESA[Elasticsearch bulk index] + end -Editor context travels differently depending on the client protocol: + subgraph PipelineB ["Pipeline B — AVAP Native"] + DB[docs/**/*.avap\ndocs/**/*.md] --> CB[avap_chunker.py\nGenericLexer + LanguageConfig\nblock detection + semantic tags\nMinHash LSH dedup] + CB --> JSONL[chunks.jsonl] + JSONL --> EB[avap_ingestor.py\nasync producer/consumer\nOllamaAsyncEmbedder\nbatch=8] + EB --> ESB[Elasticsearch async_bulk\nbatch=50\nDeadLetterQueue on failure] + end -**Via gRPC directly (`AgentRequest` fields 3–6):** -- `editor_content` (field 3) — Base64-encoded full file content -- `selected_text` (field 4) — Base64-encoded selected text -- `extra_context` (field 5) — Base64-encoded free-form context -- `user_info` (field 6) — JSON string `{"dev_id":…,"project_id":…,"org_id":…}` - -**Via HTTP proxy (OpenAI `/v1/chat/completions`):** -- Transported in the standard `user` field as a JSON string -- Same four keys, same encodings -- The proxy parses, extracts, and forwards to gRPC - -### Pipeline - -``` -AgentRequest arrives - │ - ├─ server.py: Base64 decode editor_content, selected_text, extra_context - ├─ user_info passed as-is (JSON string) - │ - └─ initial_state populated with all four fields - │ - ▼ - classify node: - ├─ If selected_text present → injected into classify prompt as - ├─ LLM outputs: RETRIEVAL EDITOR or RETRIEVAL NO_EDITOR (etc.) - └─ use_editor_context = True if second token == EDITOR - │ - ▼ - reformulate node: - ├─ If use_editor_context=True AND selected_text present: - │ anchor = selected_text + "\n\nUser question: " + query - │ → LLM reformulates using selected code as primary signal - └─ Else: reformulate query as normal - │ - ▼ - retrieve node: (unchanged — uses reformulated_query) - │ - ▼ - generate / generate_code node: - ├─ If use_editor_context=True: - │ prompt = + + + RAG_prompt - │ Priority: selected_text > editor_content > RAG context > extra_context - └─ Else: standard RAG prompt — no editor content injected + ESA --> IDX[(avap-knowledge-v2\nElasticsearch index)] + ESB --> IDX ``` -### Intent detection examples +**Pipeline B produces richer metadata** — block type, section, semantic tags (`uses_orm`, `uses_http`, `uses_auth`, etc.), complexity score, and MinHash deduplication. Use it for `.avap` files that need full semantic analysis. -| User message | `use_editor_context` | Reason | +--- + +## 14. Observability Stack + +```mermaid +flowchart LR + S[server.py] -->|LANGFUSE_HOST\nLANGFUSE_PUBLIC_KEY\nLANGFUSE_SECRET_KEY| LF[Langfuse\n45.77.119.180:80] + LF --> TR[Traces\nper-request spans] + LF --> ME[Metrics\nlatency · token counts] + LF --> EV[Evaluation scores\nfaithfulness · relevancy] +``` + +Langfuse is accessed directly via public IP — no kubectl tunnel required. The engine sends traces automatically on every request when the Langfuse environment variables are set. + +--- + +## 15. Known Limitations & Future Work + +### Active tactical debt + +| Item | Description | ADR | |---|---|---| -| "Que significa AVAP?" | `False` | General definition question | -| "dame un API de hello world" | `False` | Code generation, no editor reference | -| "que hace este codigo?" | `True` | Explicit reference to "this code" | -| "fix this" | `True` | Explicit reference to current selection | -| "como mejoro esto?" | `True` | Explicit reference to current context | -| "how does addVar work?" | `False` | Documentation question, no editor reference | +| LLM classifier | Generative model doing discriminative work — non-deterministic, pays full inference cost for a 4-class label | ADR-0008 | +| RC-02 is soft | Platform data signal enforced via prompt ``, not code — can be overridden by model | ADR-0008 | +| `classify_history` not exported | Data flywheel accumulates but has no export mechanism yet | ADR-0008 | +| `user_info` unused | `dev_id`, `project_id`, `org_id` are in state but not consumed by any graph node | PRD-0002 | +| `CONFIDENCE_PROMPT_TEMPLATE` unused | Self-RAG capability is scaffolded in `prompts.py` but not wired into the graph | — | ---- +### Roadmap (ADR-0008 Future Path) -## 7. Streaming Architecture (AskAgentStream) - -The two-phase streaming design is critical to understand: - -**Why not stream through LangGraph?** -LangGraph's `stream()` method yields full state snapshots per node, not individual tokens. To achieve true per-token streaming to the gRPC client, the generation step is deliberately extracted from the graph and called directly via `llm.stream()`. - -**Phase 1 — Deterministic preparation (graph-managed):** -Classification, query reformulation, and retrieval run through `prepare_graph.invoke()`. This phase runs synchronously and produces the complete context before any token is emitted to the client. Editor context classification also happens here — `use_editor_context` is set in the prepared state. - -**Phase 2 — Token streaming (manual):** -`build_final_messages()` reconstructs the exact prompt, injecting editor context if `use_editor_context` is `True`. `llm.stream(final_messages)` yields one `AIMessageChunk` per token from Ollama. Each token is immediately forwarded as `AgentResponse{text=token, is_final=False}`. - -**Backpressure:** gRPC streaming is flow-controlled by the client. If the client stops reading, the Ollama token stream will block at the `yield` point. - ---- - -## 8. Evaluation Pipeline (EvaluateRAG) - -The evaluation suite implements an **offline RAG evaluation** pattern using RAGAS metrics. - -### Judge model separation - -The production LLM (Ollama `qwen2.5:1.5b`) is used for **answer generation** — the same pipeline as production to measure real-world quality. Claude (`claude-sonnet-4-20250514`) is used as the **evaluation judge** — an independent, high-capability model that scores the generated answers against ground truth. - -### RAGAS metrics - -| Metric | Measures | Input | -|---|---|---| -| `faithfulness` | Are claims in the answer supported by the retrieved context? | answer + contexts | -| `answer_relevancy` | Is the answer relevant to the question? | answer + question | -| `context_recall` | Does the retrieved context cover the ground truth? | contexts + ground_truth | -| `context_precision` | Are the retrieved chunks useful (signal-to-noise)? | contexts + ground_truth | - -### Global score & verdict - -``` -global_score = mean(non-zero metric scores) - -verdict: - ≥ 0.80 → EXCELLENT - ≥ 0.60 → ACCEPTABLE - < 0.60 → INSUFFICIENT +```mermaid +flowchart LR + P0["Now\nLLM classifier\n~95% of traffic"] --> + P1["Phase 1\nExport classify_history_store\nlabeled dataset"] --> + P2["Phase 2\nEmbedding classifier Layer 2\nbge-m3 + logistic regression\n~1ms CPU"] --> + P3["Phase 3\nCaller-declared query_type\nproto field 7"] --> + P4["Phase 4\nLLM classifier = anomaly handler\n<2% of traffic"] ``` -### Golden dataset - -Located at `Docker/src/golden_dataset.json`. Each entry: - -```json -{ - "id": "avap-001", - "category": "core_syntax", - "question": "How do you declare a variable in AVAP?", - "ground_truth": "Use addVar to declare a variable..." -} -``` - -> **Note:** The golden dataset does not include editor-context queries. EvaluateRAG measures the RAG pipeline in isolation. A separate editor-context golden dataset is planned as future work once the VS Code extension is validated. - ---- - -## 9. Data Ingestion Pipeline - -Documents flow into the Elasticsearch index through two paths: - -### Path A — AVAP documentation (structured markdown) - -``` -docs/LRM/avap.md -docs/avap_language_github_docs/*.md -docs/developer.avapframework.com/*.md - │ - ▼ -scripts/pipelines/flows/elasticsearch_ingestion.py - │ - ├─ Load markdown files - ├─ Chunk using scripts/pipelines/tasks/chunk.py - ├─ Generate embeddings via scripts/pipelines/tasks/embeddings.py - └─ Bulk index into Elasticsearch -``` - -### Path B — AVAP native code chunker - -``` -docs/samples/*.avap - │ - ▼ -scripts/pipelines/ingestion/avap_chunker.py - │ (grammar: scripts/pipelines/ingestion/avap_config.json v2.0) - │ - ├─ Lexer strips comments and string contents - ├─ Block detection (function, if, startLoop, try) - ├─ Statement classification (30 types + catch-all) - ├─ Semantic tag assignment (18 boolean tags) - └─ Output: JSONL chunks → avap_ingestor.py → Elasticsearch -``` - ---- - -## 10. Infrastructure Layout - -### Devaron Cluster (Vultr Kubernetes) - -| Service | K8s Name | Port | Purpose | -|---|---|---|---| -| LLM inference | `ollama-light-service` | `11434` | Text generation + embeddings | -| Vector database | `brunix-vector-db` | `9200` | Elasticsearch 8.x | -| Observability DB | `brunix-postgres` | `5432` | PostgreSQL for Langfuse | -| Langfuse UI | — | `80` | `http://45.77.119.180` | - -### Port map summary - -| Port | Protocol | Service | Scope | -|---|---|---|---| -| `50051` | gRPC | Brunix Engine (inside container) | Internal | -| `50052` | gRPC | Brunix Engine (host-mapped) | External | -| `8000` | HTTP | OpenAI proxy | External | -| `11434` | HTTP | Ollama (via tunnel) | Tunnel | -| `9200` | HTTP | Elasticsearch (via tunnel) | Tunnel | -| `5432` | TCP | PostgreSQL/Langfuse (via tunnel) | Tunnel | - ---- - -## 11. Session State & Conversation Memory - -Conversation history is managed via an in-process dictionary: - -```python -session_store: dict[str, list] = defaultdict(list) -# key: session_id (string, provided by client) -# value: list of LangChain BaseMessage objects -``` - -**Characteristics:** -- **In-memory only.** History is lost on container restart. -- **No TTL or eviction.** Sessions grow unbounded for the lifetime of the process. -- **Thread safety:** Python's GIL provides basic safety for the `ThreadPoolExecutor(max_workers=10)` gRPC server, but concurrent writes to the same `session_id` from two simultaneous requests are not explicitly protected. -- **History window:** `format_history_for_classify()` uses only the last 6 messages for query classification. - -> **Future work:** Replace `session_store` with a Redis-backed persistent store to survive restarts and support horizontal scaling. - ---- - -## 12. Observability Stack - -### Langfuse tracing - -Every `AskAgent` / `AskAgentStream` request creates a trace capturing input query, session ID, each LangGraph node execution, LLM token counts, latency, and final response. - -**Access:** `http://45.77.119.180` - -### Logging - -Key log markers: - -| Marker | Module | Meaning | -|---|---|---| -| `[ESEARCH]` | `server.py` | Elasticsearch connection status | -| `[classify]` | `graph.py` | Query type + `use_editor_context` flag + raw LLM output | -| `[reformulate]` | `graph.py` | Reformulated query string + whether selected_text was used as anchor | -| `[hybrid]` | `graph.py` | BM25 / kNN hit counts and RRF result count | -| `[retrieve]` | `graph.py` | Number of docs retrieved and context length | -| `[generate]` | `graph.py` | Response character count | -| `[AskAgent]` | `server.py` | editor and selected flags, query preview | -| `[AskAgentStream]` | `server.py` | Token count and total chars per stream | -| `[base64]` | `server.py` | Warning when a Base64 field fails to decode | - ---- - -## 13. Security Boundaries - -| Boundary | Current state | Risk | -|---|---|---| -| gRPC transport | **Insecure** (`add_insecure_port`) | Network interception possible. Acceptable in dev/tunnel setup; requires mTLS for production. | -| Elasticsearch auth | Optional (user/pass or API key via env vars) | Index is accessible without auth if vars are unset. | -| Editor context | Transmitted in plaintext (Base64 is encoding, not encryption) | File contents visible to anyone intercepting gRPC traffic. Requires TLS for production. | -| Container user | Non-root (`python:3.11-slim` default) | Low risk. Do not override with `root`. | -| Secrets in env | Via `.env` / `docker-compose` env injection | Never commit real values. | -| Session store | In-memory, no auth | Any caller with gRPC access can read/write any session by guessing its ID. | -| `user_info` | JSON string, no validation | `dev_id`, `project_id`, `org_id` are not authenticated — passed as metadata only. | - ---- - -## 14. Known Limitations & Future Work - -| Area | Limitation | Proposed solution | -|---|---|---| -| Session persistence | In-memory, lost on restart | Redis-backed `session_store` | -| Horizontal scaling | `session_store` is per-process | Sticky sessions or external session store | -| gRPC security | Insecure port | Add TLS + optional mTLS | -| Editor context security | Base64 is not encryption | TLS required before sending real file contents | -| `user_info` auth | Not validated or authenticated | JWT or API key validation on `user_info` fields | -| Elasticsearch auth | Not enforced if vars unset | Make auth required; fail-fast on startup | -| Context window | Full history passed to generate; no truncation | Sliding window or summarization for long sessions | -| Evaluation | Golden dataset has no editor-context queries | Build dedicated editor-context golden dataset after VS Code validation | -| Rate limiting | None on gRPC server | Add interceptor-based rate limiter | -| Health check | No gRPC health protocol | Implement `grpc.health.v1` | +Target steady-state: the LLM classifier handles fewer than 2% of requests — only genuinely ambiguous queries that neither hard rules nor the trained embedding classifier can resolve with confidence.