545 lines
21 KiB
Markdown
545 lines
21 KiB
Markdown
# Brunix Assistance Engine — Architecture Reference
|
|
|
|
> **Audience:** Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment.
|
|
> **Last updated:** 2026-04-09
|
|
> **Version:** 1.8.x
|
|
> **Architect:** Rafael Ruiz (CTO, 101OBEX Corp)
|
|
> **Related ADRs:** ADR-0001 · ADR-0002 · ADR-0003 · ADR-0004 · ADR-0005 · ADR-0006 · ADR-0007 · ADR-0008
|
|
> **Related PRDs:** PRD-0001 · PRD-0002 · PRD-0003
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [System Classification](#1-system-classification)
|
|
2. [Infrastructure Layout](#2-infrastructure-layout)
|
|
3. [Component Inventory](#3-component-inventory)
|
|
4. [Request Lifecycle](#4-request-lifecycle)
|
|
5. [Query Classification & Routing](#5-query-classification--routing)
|
|
6. [LangGraph Workflow](#6-langgraph-workflow)
|
|
7. [RAG Pipeline — Hybrid Search](#7-rag-pipeline--hybrid-search)
|
|
8. [Streaming Architecture](#8-streaming-architecture)
|
|
9. [Editor Context Pipeline](#9-editor-context-pipeline)
|
|
10. [Platform Query Pipeline](#10-platform-query-pipeline)
|
|
11. [Session State & Intent History](#11-session-state--intent-history)
|
|
12. [Evaluation Pipeline](#12-evaluation-pipeline)
|
|
13. [Data Ingestion Pipeline](#13-data-ingestion-pipeline)
|
|
14. [Observability Stack](#14-observability-stack)
|
|
15. [Known Limitations & Future Work](#15-known-limitations--future-work)
|
|
|
|
---
|
|
|
|
## 1. System Classification
|
|
|
|
The Brunix Assistance Engine is a **Modular Agentic RAG** system.
|
|
|
|
| Characteristic | Classification | Evidence |
|
|
|---|---|---|
|
|
| Pipeline structure | **Agentic RAG** | LangGraph classifier routes each query to a different pipeline — not a fixed retrieve-then-generate chain |
|
|
| Retrieval strategy | **Advanced RAG** | Hybrid BM25+kNN with RRF fusion, query reformulation before retrieval, two-phase streaming |
|
|
| Architecture style | **Modular RAG** | Each node (classify, reformulate, retrieve, generate) is independently replaceable; routing contract is implementation-independent |
|
|
| Self-evaluation | Not yet active | `CONFIDENCE_PROMPT_TEMPLATE` exists but is not wired into the graph — Self-RAG capability is scaffolded |
|
|
|
|
**What it is not:** naive RAG (no fixed pipeline), Graph RAG (no knowledge graph), Self-RAG (no confidence feedback loop, yet).
|
|
|
|
The classifier is the central architectural element. It determines which pipeline executes for each request — including whether retrieval happens at all.
|
|
|
|
---
|
|
|
|
## 2. Infrastructure Layout
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph Clients ["External Clients"]
|
|
VSC["VS Code Extension"]
|
|
AVS["AVS Platform"]
|
|
DEV["grpcurl / tests"]
|
|
end
|
|
|
|
subgraph Docker ["Docker — brunix-assistance-engine"]
|
|
PROXY["openai_proxy.py\n:8000 HTTP/SSE\nOpenAI + Ollama compatible"]
|
|
SERVER["server.py\n:50051 gRPC internal\nAskAgent · AskAgentStream · EvaluateRAG"]
|
|
GRAPH["graph.py — LangGraph\nclassify → route → execute"]
|
|
end
|
|
|
|
subgraph Mac ["Developer Machine (macOS)"]
|
|
Docker
|
|
subgraph Tunnels ["kubectl port-forward tunnels"]
|
|
T1["localhost:11434 → Ollama"]
|
|
T2["localhost:9200 → Elasticsearch"]
|
|
T3["localhost:5432 → Postgres"]
|
|
end
|
|
end
|
|
|
|
subgraph Vultr ["Vultr — Devaron K8s Cluster"]
|
|
OL["ollama-light-service\nqwen3:1.7b · qwen3:0.6b · bge-m3"]
|
|
ES["brunix-vector-db\nElasticsearch :9200\navap-knowledge-v2"]
|
|
PG["brunix-postgres\nPostgres :5432\nLangfuse data"]
|
|
LF["Langfuse UI\n45.77.119.180:80\nObservability + tracing"]
|
|
end
|
|
|
|
VSC -->|"HTTP/SSE :8000"| PROXY
|
|
AVS -->|"HTTP/SSE :8000"| PROXY
|
|
DEV -->|"gRPC :50052"| SERVER
|
|
|
|
PROXY -->|"gRPC internal"| SERVER
|
|
SERVER --> GRAPH
|
|
|
|
GRAPH -->|"host.docker.internal:11434"| T1
|
|
GRAPH -->|"host.docker.internal:9200"| T2
|
|
SERVER -->|"host.docker.internal:5432"| T3
|
|
|
|
T1 -->|"secure tunnel"| OL
|
|
T2 -->|"secure tunnel"| ES
|
|
T3 -->|"secure tunnel"| PG
|
|
|
|
SERVER -->|"traces HTTP"| LF
|
|
DEV -->|"browser direct"| LF
|
|
```
|
|
|
|
**Key networking detail:** Docker does not talk to the kubectl tunnels directly. The path is:
|
|
|
|
```
|
|
Docker container
|
|
→ host.docker.internal (resolves to macOS host via extra_hosts)
|
|
→ kubectl port-forward (active on macOS)
|
|
→ Vultr K8s service
|
|
```
|
|
|
|
Langfuse is the exception — it has a public IP (`45.77.119.180`) and is accessed directly, without a tunnel.
|
|
|
|
---
|
|
|
|
## 3. Component Inventory
|
|
|
|
| Component | File | Role |
|
|
|---|---|---|
|
|
| gRPC server | `server.py` | Entry point for all AI requests. Manages session store, model selection, and state initialization |
|
|
| HTTP proxy | `openai_proxy.py` | OpenAI + Ollama compatible HTTP layer. Translates REST → gRPC |
|
|
| LangGraph orchestrator | `graph.py` | Builds and executes the agentic routing graph |
|
|
| Prompt definitions | `prompts.py` | All prompt templates in one place: classifier, reformulator, generators, platform |
|
|
| Agent state | `state.py` | `AgentState` TypedDict shared across all graph nodes |
|
|
| LLM factory | `utils/llm_factory.py` | Provider-agnostic model instantiation (Ollama, OpenAI, Anthropic, Bedrock) |
|
|
| Embedding factory | `utils/emb_factory.py` | Provider-agnostic embedding model instantiation |
|
|
| Evaluation pipeline | `evaluate.py` | RAGAS evaluation with Claude as judge |
|
|
| Proto contract | `protos/brunix.proto` | Source of truth for the gRPC API |
|
|
|
|
**Model slots:**
|
|
|
|
| Slot | Env var | Used for | Current model |
|
|
|---|---|---|---|
|
|
| Main | `OLLAMA_MODEL_NAME` | `RETRIEVAL`, `CODE_GENERATION`, classification | `qwen3:1.7b` |
|
|
| Conversational | `OLLAMA_MODEL_NAME_CONVERSATIONAL` | `CONVERSATIONAL`, `PLATFORM` | `qwen3:0.6b` |
|
|
| Embeddings | `OLLAMA_EMB_MODEL_NAME` | Query embedding, document indexing | `bge-m3` |
|
|
| Evaluation judge | `ANTHROPIC_MODEL` | RAGAS scoring | `claude-sonnet-4-20250514` |
|
|
|
|
---
|
|
|
|
## 4. Request Lifecycle
|
|
|
|
Two entry paths, both ending at the same LangGraph:
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant C as Client
|
|
participant P as openai_proxy.py :8000
|
|
participant S as server.py :50051
|
|
participant G as graph.py
|
|
participant O as Ollama (via tunnel)
|
|
participant E as Elasticsearch (via tunnel)
|
|
|
|
C->>P: POST /v1/chat/completions
|
|
P->>P: parse user field (editor_content, selected_text, extra_context, user_info)
|
|
P->>S: gRPC AskAgent / AskAgentStream
|
|
|
|
S->>S: base64 decode context fields
|
|
S->>S: load session_store + classify_history_store
|
|
S->>G: invoke graph with AgentState
|
|
|
|
G->>O: classify (LLM call)
|
|
O-->>G: query_type + use_editor_context
|
|
|
|
alt RETRIEVAL or CODE_GENERATION
|
|
G->>O: reformulate query
|
|
O-->>G: reformulated_query
|
|
G->>O: embed reformulated_query
|
|
O-->>G: query_vector
|
|
G->>E: BM25 search + kNN search
|
|
E-->>G: ranked chunks (RRF fusion)
|
|
G->>O: generate with context
|
|
O-->>G: response
|
|
else CONVERSATIONAL
|
|
G->>O: respond_conversational (no retrieval)
|
|
O-->>G: response
|
|
else PLATFORM
|
|
G->>O: respond_platform (no retrieval, uses extra_context)
|
|
O-->>G: response
|
|
end
|
|
|
|
G-->>S: final_state
|
|
S->>S: update session_store + classify_history_store
|
|
S-->>P: AgentResponse stream
|
|
P-->>C: SSE / JSON response
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Query Classification & Routing
|
|
|
|
The classifier is the most critical node. It determines the entire execution path for a request.
|
|
|
|
### Taxonomy
|
|
|
|
| Type | Intent | RAG | Model slot | Prompt |
|
|
|---|---|---|---|---|
|
|
| `RETRIEVAL` | Understand AVAP language concepts | Yes | main | `GENERATE_PROMPT` |
|
|
| `CODE_GENERATION` | Produce working AVAP code | Yes | main | `CODE_GENERATION_PROMPT` |
|
|
| `CONVERSATIONAL` | Rephrase or continue prior answer | No | conversational | `CONVERSATIONAL_PROMPT` |
|
|
| `PLATFORM` | Account, metrics, usage, billing | No | conversational | `PLATFORM_PROMPT` |
|
|
|
|
### Classification pipeline
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
Q[Incoming query] --> FP{Fast-path check\n_is_platform_query}
|
|
FP -->|known platform prefix| PLATFORM[PLATFORM\nno LLM call]
|
|
FP -->|no match| LLM[LLM Classifier\nqwen3:1.7b]
|
|
|
|
LLM --> IH[Intent history\nlast 6 entries as context]
|
|
IH --> OUT[Output: TYPE + EDITOR/NO_EDITOR]
|
|
|
|
OUT --> R[RETRIEVAL]
|
|
OUT --> C[CODE_GENERATION]
|
|
OUT --> V[CONVERSATIONAL]
|
|
OUT --> PL[PLATFORM]
|
|
```
|
|
|
|
### Intent history — solving anchoring bias
|
|
|
|
The classifier does not receive raw conversation messages. It receives a compact trace of prior classifications:
|
|
|
|
```
|
|
[RETRIEVAL] "What is addVar in AVAP?"
|
|
[CODE_GENERATION] "Write an API endpoint that retur"
|
|
[PLATFORM] "You have a project usage percentag"
|
|
```
|
|
|
|
**Why:** A 1.7B model receiving full message history computes `P(type | history)` instead of `P(type | message_content)` — it biases toward the dominant type of the session. The intent history gives the classifier enough context to resolve ambiguous references (`"this"`, `"esto"`) without the topical noise that causes anchoring.
|
|
|
|
**Rule enforced in prompt:** `<history_rule>` — the distribution of previous intents must not influence the prior probability of the current classification.
|
|
|
|
### Routing contract (from ADR-0008)
|
|
|
|
| Rule | Description | Priority |
|
|
|---|---|---|
|
|
| RC-01 | Known platform prefix → `PLATFORM` without LLM | Highest |
|
|
| RC-02 | Usage metrics / quota data in message → `PLATFORM` | High |
|
|
| RC-03 | History resolves references only, never predicts type | Medium |
|
|
| RC-04 | `PLATFORM` and `CONVERSATIONAL` never touch Elasticsearch | Medium |
|
|
| RC-05 | `RETRIEVAL`/`CODE_GENERATION` → main model; `CONVERSATIONAL`/`PLATFORM` → conversational model | Medium |
|
|
| RC-06 | Intent history capped at 6 entries | Low |
|
|
|
|
---
|
|
|
|
## 6. LangGraph Workflow
|
|
|
|
Two graphs are built at startup and reused across all requests:
|
|
|
|
### `build_graph` — used by `AskAgent` (non-streaming)
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
START([start]) --> CL[classify]
|
|
CL -->|RETRIEVAL| RF[reformulate]
|
|
CL -->|CODE_GENERATION| RF
|
|
CL -->|CONVERSATIONAL| RC[respond_conversational]
|
|
CL -->|PLATFORM| RP[respond_platform]
|
|
RF --> RT[retrieve]
|
|
RT -->|RETRIEVAL| GE[generate]
|
|
RT -->|CODE_GENERATION| GC[generate_code]
|
|
GE --> END([end])
|
|
GC --> END
|
|
RC --> END
|
|
RP --> END
|
|
```
|
|
|
|
### `build_prepare_graph` — used by `AskAgentStream` (streaming)
|
|
|
|
This graph only runs the preparation phase. The generation is handled outside the graph by `llm.stream()` to enable true token-by-token streaming.
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
START([start]) --> CL[classify]
|
|
CL -->|RETRIEVAL| RF[reformulate]
|
|
CL -->|CODE_GENERATION| RF
|
|
CL -->|CONVERSATIONAL| SK[skip_retrieve]
|
|
CL -->|PLATFORM| SK
|
|
RF --> RT[retrieve]
|
|
RT --> END([end])
|
|
SK --> END
|
|
```
|
|
|
|
After `prepare_graph` returns, `server.py` calls `build_final_messages(prepared)` to construct the prompt and then streams directly from the selected LLM.
|
|
|
|
---
|
|
|
|
## 7. RAG Pipeline — Hybrid Search
|
|
|
|
Only `RETRIEVAL` and `CODE_GENERATION` queries reach this pipeline.
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
Q[reformulated_query] --> EMB[embed_query\nbge-m3]
|
|
Q --> BM25[BM25 search\nElasticsearch multi_match\nfields: content^2 text^2\nfuzziness: AUTO]
|
|
EMB --> KNN[kNN search\nElasticsearch\nfield: embedding\nk=8 num_candidates=40]
|
|
|
|
BM25 --> RRF[RRF Fusion\n1 / rank+60 per hit\ncombined score]
|
|
KNN --> RRF
|
|
|
|
RRF --> RANK[Ranked docs\ntop-8]
|
|
RANK --> FMT[format_context\nchunk headers: id type block section source\nAVAP CODE flag for code blocks]
|
|
FMT --> CTX[context string\ninjected into generation prompt]
|
|
```
|
|
|
|
**Why hybrid search:** BM25 is strong for exact AVAP command names (`registerEndpoint`, `addVar`) which are rare in embedding space. kNN captures semantic similarity. RRF fusion combines both without requiring score normalization.
|
|
|
|
**Elasticsearch index schema:**
|
|
|
|
| Field | Type | Description |
|
|
|---|---|---|
|
|
| `content` / `text` | `text` | Chunk text (BM25 searchable) |
|
|
| `embedding` | `dense_vector` | bge-m3 embedding (kNN searchable) |
|
|
| `doc_type` | `keyword` | `code`, `spec`, `code_example`, `bnf` |
|
|
| `block_type` | `keyword` | `function`, `if`, `startLoop`, `try`, etc. |
|
|
| `section` | `keyword` | Document section heading |
|
|
| `source_file` | `keyword` | Origin file |
|
|
| `chunk_id` | `keyword` | Unique chunk identifier |
|
|
|
|
---
|
|
|
|
## 8. Streaming Architecture
|
|
|
|
`AskAgentStream` implements true token-by-token streaming. It does not buffer the full response.
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant C as Client
|
|
participant S as server.py
|
|
participant PG as prepare_graph
|
|
participant LLM as Ollama LLM
|
|
|
|
C->>S: AskAgentStream(request)
|
|
S->>PG: invoke prepare_graph(initial_state)
|
|
Note over PG: classify → reformulate → retrieve<br/>(or skip_retrieve for CONV/PLATFORM)
|
|
PG-->>S: prepared state (query_type, context, messages)
|
|
|
|
S->>S: build_final_messages(prepared)
|
|
S->>S: select active_llm based on query_type
|
|
|
|
loop token streaming
|
|
S->>LLM: active_llm.stream(final_messages)
|
|
LLM-->>S: chunk.content
|
|
S-->>C: AgentResponse(text=token, is_final=false)
|
|
end
|
|
|
|
S-->>C: AgentResponse(text="", is_final=true)
|
|
S->>S: update session_store + classify_history_store
|
|
```
|
|
|
|
**Model selection in stream path:**
|
|
```python
|
|
active_llm = self.llm_conversational if query_type in ("CONVERSATIONAL", "PLATFORM") else self.llm
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Editor Context Pipeline
|
|
|
|
Implemented in PRD-0002. The VS Code extension can send three optional context fields with each query.
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
REQ[AgentRequest] --> B64[base64 decode\neditor_content\nselected_text\nextra_context]
|
|
B64 --> STATE[AgentState]
|
|
|
|
STATE --> CL[classify node\n_build_classify_prompt]
|
|
CL --> ED{use_editor_context?}
|
|
|
|
ED -->|EDITOR| INJ[inject into prompt\n1 selected_text — highest priority\n2 editor_content — file context\n3 RAG chunks\n4 extra_context]
|
|
ED -->|NO_EDITOR| NONINJ[standard prompt\nno editor content injected]
|
|
|
|
INJ --> GEN[generation node]
|
|
NONINJ --> GEN
|
|
```
|
|
|
|
**Classifier output format:** Two tokens — `TYPE EDITOR_SIGNAL`
|
|
|
|
Examples: `RETRIEVAL NO_EDITOR`, `CODE_GENERATION EDITOR`, `PLATFORM NO_EDITOR`
|
|
|
|
`EDITOR` is set only when the user explicitly refers to the code in their editor: *"this code"*, *"fix this"*, *"que hace esto"*, *"explain this selection"*.
|
|
|
|
---
|
|
|
|
## 10. Platform Query Pipeline
|
|
|
|
Implemented in PRD-0003. Queries about account, metrics, usage, or billing bypass RAG entirely.
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
Q[Incoming query] --> FP{_is_platform_query?}
|
|
FP -->|yes — known prefix| SKIP[skip classifier LLM\nroute = PLATFORM]
|
|
FP -->|no| CL[LLM classifier]
|
|
CL -->|PLATFORM| ROUTE[route to respond_platform]
|
|
SKIP --> ROUTE
|
|
|
|
ROUTE --> PROMPT["PLATFORM_PROMPT\n+ extra_context injection\n+ user_info available"]
|
|
PROMPT --> LLM[qwen3:0.6b\nconversational model slot]
|
|
LLM --> RESP[response]
|
|
|
|
style SKIP fill:#2d6a4f,color:#fff
|
|
style ROUTE fill:#2d6a4f,color:#fff
|
|
```
|
|
|
|
**No Elasticsearch call is made for PLATFORM queries.** The data is already in the request via `extra_context` and `user_info` injected by the caller (AVS Platform).
|
|
|
|
---
|
|
|
|
## 11. Session State & Intent History
|
|
|
|
Two stores are maintained per session in memory:
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph Stores ["In-memory stores (server.py)"]
|
|
SS["session_store\ndict[session_id → list[BaseMessage]]\nfull conversation history\nused by generation nodes"]
|
|
CHS["classify_history_store\ndict[session_id → list[ClassifyEntry]]\ncompact intent trace\nused by classifier"]
|
|
end
|
|
|
|
subgraph Entry ["ClassifyEntry (state.py)"]
|
|
CE["type: str\nRETRIEVAL | CODE_GENERATION\nCONVERSATIONAL | PLATFORM\n─────────────────\ntopic: str\n60-char query snippet"]
|
|
end
|
|
|
|
CHS --> CE
|
|
```
|
|
|
|
**`classify_history_store` is also a data flywheel.** Every session generates labeled `(topic, type)` pairs automatically. When sufficient sessions accumulate (~500), this store can be exported to train the Layer 2 embedding classifier described in ADR-0008 Future Path — eliminating the need for the LLM classifier on the majority of requests.
|
|
|
|
### AgentState fields
|
|
|
|
```python
|
|
class AgentState(TypedDict):
|
|
# Core
|
|
messages: Annotated[list, add_messages] # full conversation
|
|
session_id: str
|
|
query_type: str # RETRIEVAL | CODE_GENERATION | CONVERSATIONAL | PLATFORM
|
|
reformulated_query: str
|
|
context: str # RAG retrieved context
|
|
|
|
# Classifier intent history
|
|
classify_history: list[ClassifyEntry] # compact trace, persisted across turns
|
|
|
|
# Editor context (PRD-0002)
|
|
editor_content: str # base64 decoded
|
|
selected_text: str # base64 decoded
|
|
extra_context: str # base64 decoded
|
|
user_info: str # JSON: {dev_id, project_id, org_id}
|
|
use_editor_context: bool # set by classifier
|
|
```
|
|
|
|
---
|
|
|
|
## 12. Evaluation Pipeline
|
|
|
|
`EvaluateRAG` runs an automated quality benchmark using RAGAS with Claude as the judge. It does not share infrastructure with the main request pipeline.
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
GD[golden_dataset.json\n50 Q&A pairs with ground_truth] --> FILTER[category + limit filter]
|
|
FILTER --> LOOP[for each question]
|
|
|
|
LOOP --> RET[retrieve_context\nhybrid BM25+kNN\nsame as main pipeline]
|
|
RET --> GEN[generate_answer\nusing main LLM]
|
|
GEN --> ROW[dataset row\nquestion · answer · contexts · ground_truth]
|
|
|
|
ROW --> DS[HuggingFace Dataset]
|
|
DS --> RAGAS[ragas.evaluate\nfaithfulness\nanswer_relevancy\ncontext_recall\ncontext_precision]
|
|
|
|
RAGAS --> JUDGE[Claude judge\nclaude-sonnet-4-20250514\nRateLimitedChatAnthropic\n3s delay between calls]
|
|
JUDGE --> SCORES[per-metric scores]
|
|
SCORES --> GLOBAL[global_score = mean of valid metrics]
|
|
GLOBAL --> VERDICT{verdict}
|
|
VERDICT -->|≥ 0.80| EX[EXCELLENT]
|
|
VERDICT -->|≥ 0.60| AC[ACCEPTABLE]
|
|
VERDICT -->|< 0.60| IN[INSUFFICIENT]
|
|
```
|
|
|
|
**Important:** `EvaluateRAG` uses `RateLimitedChatAnthropic` — a subclass that injects a 3-second delay between calls to respect Anthropic API rate limits. RAGAS is configured with `max_workers=1`.
|
|
|
|
---
|
|
|
|
## 13. Data Ingestion Pipeline
|
|
|
|
Two independent pipelines populate the Elasticsearch index. Both require the Ollama tunnel active on `localhost:11434` and Elasticsearch on `localhost:9200`.
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
subgraph PipelineA ["Pipeline A — Chonkie (recommended)"]
|
|
DA[docs/**/*.md\ndocs/**/*.avap] --> CA[Chonkie\nMarkdownChef + TokenChunker]
|
|
CA --> EA[OllamaEmbeddings\nbatch embed]
|
|
EA --> ESA[Elasticsearch bulk index]
|
|
end
|
|
|
|
subgraph PipelineB ["Pipeline B — AVAP Native"]
|
|
DB[docs/**/*.avap\ndocs/**/*.md] --> CB[avap_chunker.py\nGenericLexer + LanguageConfig\nblock detection + semantic tags\nMinHash LSH dedup]
|
|
CB --> JSONL[chunks.jsonl]
|
|
JSONL --> EB[avap_ingestor.py\nasync producer/consumer\nOllamaAsyncEmbedder\nbatch=8]
|
|
EB --> ESB[Elasticsearch async_bulk\nbatch=50\nDeadLetterQueue on failure]
|
|
end
|
|
|
|
ESA --> IDX[(avap-knowledge-v2\nElasticsearch index)]
|
|
ESB --> IDX
|
|
```
|
|
|
|
**Pipeline B produces richer metadata** — block type, section, semantic tags (`uses_orm`, `uses_http`, `uses_auth`, etc.), complexity score, and MinHash deduplication. Use it for `.avap` files that need full semantic analysis.
|
|
|
|
---
|
|
|
|
## 14. Observability Stack
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
S[server.py] -->|LANGFUSE_HOST\nLANGFUSE_PUBLIC_KEY\nLANGFUSE_SECRET_KEY| LF[Langfuse\n45.77.119.180:80]
|
|
LF --> TR[Traces\nper-request spans]
|
|
LF --> ME[Metrics\nlatency · token counts]
|
|
LF --> EV[Evaluation scores\nfaithfulness · relevancy]
|
|
```
|
|
|
|
Langfuse is accessed directly via public IP — no kubectl tunnel required. The engine sends traces automatically on every request when the Langfuse environment variables are set.
|
|
|
|
---
|
|
|
|
## 15. Known Limitations & Future Work
|
|
|
|
### Active tactical debt
|
|
|
|
| Item | Description | ADR |
|
|
|---|---|---|
|
|
| LLM classifier | Generative model doing discriminative work — non-deterministic, pays full inference cost for a 4-class label | ADR-0008 |
|
|
| RC-02 is soft | Platform data signal enforced via prompt `<platform_priority_rule>`, not code — can be overridden by model | ADR-0008 |
|
|
| `classify_history` not exported | Data flywheel accumulates but has no export mechanism yet | ADR-0008 |
|
|
| `user_info` unused | `dev_id`, `project_id`, `org_id` are in state but not consumed by any graph node | PRD-0002 |
|
|
| `CONFIDENCE_PROMPT_TEMPLATE` unused | Self-RAG capability is scaffolded in `prompts.py` but not wired into the graph | — |
|
|
|
|
### Roadmap (ADR-0008 Future Path)
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
P0["Now\nLLM classifier\n~95% of traffic"] -->
|
|
P1["Phase 1\nExport classify_history_store\nlabeled dataset"] -->
|
|
P2["Phase 2\nEmbedding classifier Layer 2\nbge-m3 + logistic regression\n~1ms CPU"] -->
|
|
P3["Phase 3\nCaller-declared query_type\nproto field 7"] -->
|
|
P4["Phase 4\nLLM classifier = anomaly handler\n<2% of traffic"]
|
|
```
|
|
|
|
Target steady-state: the LLM classifier handles fewer than 2% of requests — only genuinely ambiguous queries that neither hard rules nor the trained embedding classifier can resolve with confidence.
|