assistance-engine/docs/ARCHITECTURE.md

22 KiB
Raw Permalink Blame History

Brunix Assistance Engine — Architecture Reference

Audience: Engineers contributing to this repository, architects reviewing the system design, and operators responsible for its deployment. Last updated: 2026-03-18 Version: 1.5.x


Table of Contents

  1. System Overview
  2. Component Inventory
  3. Request Lifecycle
  4. LangGraph Workflow
  5. RAG Pipeline — Hybrid Search
  6. Streaming Architecture (AskAgentStream)
  7. Evaluation Pipeline (EvaluateRAG)
  8. Data Ingestion Pipeline
  9. Infrastructure Layout
  10. Session State & Conversation Memory
  11. Observability Stack
  12. Security Boundaries
  13. Known Limitations & Future Work

1. System Overview

The Brunix Assistance Engine is a stateful, streaming-capable AI service that answers questions about the AVAP programming language. It combines:

  • gRPC as the primary communication interface (port 50051 inside container, 50052 on host)
  • LangGraph for deterministic, multi-step agentic orchestration
  • Hybrid RAG (BM25 + kNN with RRF fusion) over an Elasticsearch vector index
  • Ollama as the local LLM and embedding backend
  • RAGAS + Claude as the automated evaluation judge

A secondary OpenAI-compatible HTTP proxy (port 8000) is served via FastAPI/Uvicorn, enabling integration with tools that expect the OpenAI API format.

┌─────────────────────────────────────────────────────────────┐
│                        External Clients                     │
│    grpcurl / App SDK        │     OpenAI-compatible client  │
└────────────┬────────────────┴──────────────┬────────────────┘
             │ gRPC :50052                    │ HTTP :8000
             ▼                                ▼
┌────────────────────────────────────────────────────────────┐
│                   Docker Container                         │
│                                                            │
│  ┌─────────────────────┐    ┌──────────────────────────┐  │
│  │   server.py (gRPC)  │    │  openai_proxy.py (HTTP)  │  │
│  │   BrunixEngine      │    │  FastAPI / Uvicorn       │  │
│  └──────────┬──────────┘    └──────────────────────────┘  │
│             │                                              │
│  ┌──────────▼──────────────────────────────────────────┐  │
│  │               LangGraph Orchestration                │  │
│  │   classify → reformulate → retrieve → generate      │  │
│  └──────────────────────────┬───────────────────────────┘  │
│                             │                              │
│         ┌───────────────────┼────────────────────┐        │
│         ▼                   ▼                    ▼        │
│   Ollama (LLM)       Ollama (Embed)        Elasticsearch  │
│   via tunnel         via tunnel            via tunnel     │
└────────────────────────────────────────────────────────────┘
             │ kubectl port-forward tunnels │
             ▼                             ▼
      Devaron Cluster (Vultr Kubernetes)
      ollama-light-service:11434   brunix-vector-db:9200
      brunix-postgres:5432         Langfuse UI

2. Component Inventory

Component File / Service Responsibility
gRPC Server Docker/src/server.py Entry point. Implements the AssistanceEngine servicer. Initializes LLM, embeddings, ES client, and both graphs.
Full Graph Docker/src/graph.pybuild_graph() Complete workflow: classify → reformulate → retrieve → generate. Used by AskAgent and EvaluateRAG.
Prepare Graph Docker/src/graph.pybuild_prepare_graph() Partial workflow: classify → reformulate → retrieve. Does not call the LLM for generation. Used by AskAgentStream to enable manual token streaming.
Message Builder Docker/src/graph.pybuild_final_messages() Reconstructs the final prompt list from prepared state for llm.stream().
Prompt Library Docker/src/prompts.py Centralized definitions for CLASSIFY, REFORMULATE, GENERATE, CODE_GENERATION, and CONVERSATIONAL prompts.
Agent State Docker/src/state.py AgentState TypedDict shared across all graph nodes.
Evaluation Suite Docker/src/evaluate.py RAGAS-based pipeline. Uses the production retriever + Ollama LLM for generation, and Claude as the impartial judge.
OpenAI Proxy Docker/src/openai_proxy.py FastAPI application that wraps AskAgentStream under an /v1/chat/completions endpoint.
LLM Factory Docker/src/utils/llm_factory.py Provider-agnostic factory for chat models (Ollama, AWS Bedrock).
Embedding Factory Docker/src/utils/emb_factory.py Provider-agnostic factory for embedding models (Ollama, HuggingFace).
Ingestion Pipeline scripts/pipelines/flows/elasticsearch_ingestion.py Chunks and ingests AVAP documents into Elasticsearch with embeddings.
Dataset Generator scripts/pipelines/flows/generate_mbap.py Generates synthetic MBPP-style AVAP problems using Claude.
MBPP Translator scripts/pipelines/flows/translate_mbpp.py Translates MBPP Python dataset into AVAP equivalents.

3. Request Lifecycle

3.1 AskAgent (non-streaming)

Client → gRPC AgentRequest{query, session_id}
  │
  ├─ Load conversation history from session_store[session_id]
  ├─ Build initial_state = {messages: history + [user_msg], ...}
  │
  └─ graph.invoke(initial_state)
       ├─ classify    → query_type ∈ {RETRIEVAL, CODE_GENERATION, CONVERSATIONAL}
       ├─ reformulate → reformulated_query (keyword-optimized for semantic search)
       ├─ retrieve    → context (top-8 hybrid RRF chunks from Elasticsearch)
       └─ generate    → final AIMessage (llm.invoke)
  │
  ├─ Persist updated history to session_store[session_id]
  └─ yield AgentResponse{text, avap_code="AVAP-2026", is_final=True}

3.2 AskAgentStream (token streaming)

Client → gRPC AgentRequest{query, session_id}
  │
  ├─ Load history from session_store[session_id]
  ├─ Build initial_state
  │
  ├─ prepare_graph.invoke(initial_state)   ← Phase 1: no LLM generation
  │    ├─ classify
  │    ├─ reformulate
  │    └─ retrieve (or skip_retrieve if CONVERSATIONAL)
  │
  ├─ build_final_messages(prepared_state)  ← Reconstruct prompt list
  │
  └─ for chunk in llm.stream(final_messages):
       └─ yield AgentResponse{text=token, is_final=False}
  │
  ├─ Persist full assembled response to session_store
  └─ yield AgentResponse{text="", is_final=True}

3.3 EvaluateRAG

Client → gRPC EvalRequest{category?, limit?, index?}
  │
  └─ evaluate.run_evaluation(...)
       ├─ Load golden_dataset.json
       ├─ Filter by category / limit
       ├─ For each question:
       │    ├─ retrieve_context (hybrid BM25+kNN, same as production)
       │    └─ generate_answer (Ollama LLM + GENERATE_PROMPT)
       ├─ Build RAGAS Dataset
       ├─ Run RAGAS metrics with Claude as judge:
       │    faithfulness / answer_relevancy / context_recall / context_precision
       └─ Compute global_score + verdict (EXCELLENT / ACCEPTABLE / INSUFFICIENT)
  │
  └─ return EvalResponse{scores, global_score, verdict, details[]}

4. LangGraph Workflow

4.1 Full Graph (build_graph)

                    ┌─────────────┐
                    │   classify  │
                    └──────┬──────┘
                           │
          ┌────────────────┼──────────────────┐
          ▼                ▼                   ▼
    RETRIEVAL        CODE_GENERATION     CONVERSATIONAL
          │                │                   │
          └────────┬───────┘                   │
                   ▼                           ▼
            ┌──────────────┐        ┌────────────────────────┐
            │  reformulate │        │  respond_conversational │
            └──────┬───────┘        └───────────┬────────────┘
                   ▼                            │
            ┌──────────────┐                   │
            │   retrieve   │                   │
            └──────┬───────┘                   │
                   │                           │
          ┌────────┴───────────┐               │
          ▼                    ▼               │
    ┌──────────┐      ┌───────────────┐        │
    │ generate │      │ generate_code │        │
    └────┬─────┘      └───────┬───────┘        │
         │                    │                │
         └────────────────────┴────────────────┘
                              │
                             END

4.2 Prepare Graph (build_prepare_graph)

Identical routing for classify, but generation nodes are replaced by END. The CONVERSATIONAL branch uses skip_retrieve (returns empty context without querying Elasticsearch).

4.3 Query Type Routing

query_type Triggers retrieve? Generation prompt
RETRIEVAL Yes GENERATE_PROMPT (explanation-focused)
CODE_GENERATION Yes CODE_GENERATION_PROMPT (code-focused, returns AVAP blocks)
CONVERSATIONAL No CONVERSATIONAL_PROMPT (reformulation of prior answer)

The retrieval system (hybrid_search_native) fuses BM25 lexical search and kNN dense vector search using Reciprocal Rank Fusion (RRF).

User query
    │
    ├─ embeddings.embed_query(query) → query_vector [768-dim]
    │
    ├─ ES multi_match (BM25) on fields [content^2, text^2]
    │    └─ top-k BM25 hits
    │
    └─ ES knn on field [embedding], num_candidates = k×5
         └─ top-k kNN hits
    │
    ├─ RRF fusion: score(doc) = Σ 1/(rank + 60)
    │
    └─ Top-8 documents → format_context() → context string

RRF constant: 60 (standard value; prevents high-rank documents from dominating while still rewarding consensus between both retrieval modes).

Chunk metadata attached to each retrieved document:

Field Description
chunk_id Unique identifier within the index
source_file Origin document filename
doc_type prose, code, code_example, bnf
block_type AVAP block type: function, if, startLoop, try
section Document section/chapter heading

Documents of type code, code_example, bnf, or block type function / if / startLoop / try are tagged as [AVAP CODE] in the formatted context, signaling the LLM to treat them as executable syntax rather than prose.


6. Streaming Architecture (AskAgentStream)

The two-phase streaming design is critical to understand:

Why not stream through LangGraph?
LangGraph's stream() method yields full state snapshots per node, not individual tokens. To achieve true per-token streaming to the gRPC client, the generation step is deliberately extracted from the graph and called directly via llm.stream().

Phase 1 — Deterministic preparation (graph-managed):

  • Classification, query reformulation, and retrieval run through prepare_graph.invoke().
  • This phase runs synchronously and produces the complete context before any token is emitted to the client.

Phase 2 — Token streaming (manual):

  • build_final_messages() reconstructs the exact prompt that generate / generate_code / respond_conversational would have used.
  • llm.stream(final_messages) yields one AIMessageChunk per token from Ollama.
  • Each token is immediately forwarded to the gRPC client as AgentResponse{text=token, is_final=False}.
  • After the stream ends, the full assembled text is persisted to session_store.

Backpressure: gRPC streaming is flow-controlled by the client. If the client stops reading, the Ollama token stream will block at the yield point. No explicit buffer overflow protection is implemented (acceptable for the current single-client dev mode).


7. Evaluation Pipeline (EvaluateRAG)

The evaluation suite implements an offline RAG evaluation pattern using RAGAS metrics.

Judge model separation

The production LLM (Ollama qwen2.5:1.5b) is used for answer generation — the same pipeline as production to measure real-world quality. Claude (claude-sonnet-4-20250514) is used as the evaluation judge — an independent, high-capability model that scores the generated answers against ground truth.

RAGAS metrics

Metric Measures Input
faithfulness Are claims in the answer supported by the retrieved context? answer + contexts
answer_relevancy Is the answer relevant to the question? answer + question
context_recall Does the retrieved context cover the ground truth? contexts + ground_truth
context_precision Are the retrieved chunks useful (signal-to-noise)? contexts + ground_truth

Global score & verdict

global_score = mean(non-zero metric scores)

verdict:
  ≥ 0.80 → EXCELLENT
  ≥ 0.60 → ACCEPTABLE
  < 0.60 → INSUFFICIENT

Golden dataset

Located at Docker/src/golden_dataset.json. Each entry follows this schema:

{
  "id": "avap-001",
  "category": "core_syntax",
  "question": "How do you declare a variable in AVAP?",
  "ground_truth": "Use addVar to declare a variable..."
}

8. Data Ingestion Pipeline

Documents flow into the Elasticsearch index through two paths:

Path A — AVAP documentation (structured markdown)

docs/LRM/avap.md
docs/avap_language_github_docs/*.md
docs/developer.avapframework.com/*.md
        │
        ▼
scripts/pipelines/flows/elasticsearch_ingestion.py
        │
        ├─ Load markdown files
        ├─ Chunk using scripts/pipelines/tasks/chunk.py
        │    (semantic chunking via Chonkie library)
        ├─ Generate embeddings via scripts/pipelines/tasks/embeddings.py
        │    (Ollama or HuggingFace embedding model)
        └─ Bulk index into Elasticsearch
             index: avap-docs-* (configurable via ELASTICSEARCH_INDEX)
             mapping: {content, embedding, source_file, doc_type, section, ...}

Path B — Synthetic AVAP code samples

docs/samples/*.avap
        │
        ▼
scripts/pipelines/flows/generate_mbap.py
        │
        ├─ Read AVAP LRM (docs/LRM/avap.md)
        ├─ Call Claude API to generate MBPP-style problems
        └─ Output synthetic_datasets/mbpp_avap.json
             (used for fine-tuning and few-shot examples)

9. Infrastructure Layout

Devaron Cluster (Vultr Kubernetes)

Service K8s Name Port Purpose
LLM inference ollama-light-service 11434 Text generation + embeddings
Vector database brunix-vector-db 9200 Elasticsearch 8.x
Observability DB brunix-postgres 5432 PostgreSQL for Langfuse
Langfuse UI 80 http://45.77.119.180

Kubernetes tunnel commands

# Terminal 1 — LLM
kubectl port-forward --address 0.0.0.0 svc/ollama-light-service 11434:11434 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Terminal 2 — Elasticsearch
kubectl port-forward --address 0.0.0.0 svc/brunix-vector-db 9200:9200 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

# Terminal 3 — PostgreSQL (Langfuse)
kubectl port-forward --address 0.0.0.0 svc/brunix-postgres 5432:5432 \
  -n brunix --kubeconfig ./kubernetes/kubeconfig.yaml

Port map summary

Port Protocol Service Scope
50051 gRPC Brunix Engine (inside container) Internal
50052 gRPC Brunix Engine (host-mapped) External
8000 HTTP OpenAI proxy External
11434 HTTP Ollama (via tunnel) Tunnel
9200 HTTP Elasticsearch (via tunnel) Tunnel
5432 TCP PostgreSQL/Langfuse (via tunnel) Tunnel

10. Session State & Conversation Memory

Conversation history is managed via an in-process dictionary:

session_store: dict[str, list] = defaultdict(list)
# key: session_id (string, provided by client)
# value: list of LangChain BaseMessage objects

Characteristics:

  • In-memory only. History is lost on container restart.
  • No TTL or eviction. Sessions grow unbounded for the lifetime of the process.
  • Thread safety: Python's GIL provides basic safety for the ThreadPoolExecutor(max_workers=10) gRPC server, but concurrent writes to the same session_id from two simultaneous requests are not explicitly protected.
  • History window: format_history_for_classify() uses only the last 6 messages for query classification to keep the classify prompt short and deterministic.

Future work: Replace session_store with a Redis-backed persistent store to survive restarts and support horizontal scaling.


11. Observability Stack

Langfuse tracing

The server integrates Langfuse for end-to-end LLM tracing. Every AskAgent / AskAgentStream request creates a trace that captures:

  • Input query and session ID
  • Each LangGraph node execution (classify, reformulate, retrieve, generate)
  • LLM token counts, latency, and cost
  • Final response

Access: http://45.77.119.180 — requires a project API key configured via LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY.

Logging

Structured logging via Python's logging module, configured at INFO level. Log format:

[MODULE] context_info — key=value key=value

Key log markers:

Marker Module Meaning
[ESEARCH] server.py Elasticsearch connection status
[classify] graph.py Query type decision + raw LLM output
[reformulate] graph.py Reformulated query string
[hybrid] graph.py BM25 / kNN hit counts and RRF result count
[retrieve] graph.py Number of docs retrieved and context length
[generate] graph.py Response character count
[AskAgentStream] server.py Token count and total chars per stream
[eval] evaluate.py Per-question retrieval and generation status

12. Security Boundaries

Boundary Current state Risk
gRPC transport Insecure (add_insecure_port) Network interception possible. Acceptable in dev/tunnel setup; requires mTLS for production.
Elasticsearch auth Optional (user/pass or API key via env vars) Index is accessible without auth if ELASTICSEARCH_USER and ELASTICSEARCH_API_KEY are unset.
Container user Non-root (python:3.11-slim default) Low risk. Do not override with root.
Secrets in env Via .env / docker-compose env injection Never commit real values. See CONTRIBUTING.md.
Session store In-memory, no auth Any caller with access to the gRPC port can read/write any session by guessing its ID.
Kubeconfig ./kubernetes/kubeconfig.yaml (local only) Grants cluster access. Never commit. Listed in .gitignore.

13. Known Limitations & Future Work

Area Limitation Proposed solution
Session persistence In-memory, lost on restart Redis-backed session_store
Horizontal scaling session_store is per-process Sticky sessions or external session store
gRPC security Insecure port Add TLS + optional mTLS
Elasticsearch auth Not enforced if vars unset Make auth required; fail-fast on startup
Context window Full history passed to generate; no truncation Sliding window or summarization for long sessions
Evaluation Golden dataset must be manually maintained Automated golden dataset refresh pipeline
Rate limiting None on gRPC server Add interceptor-based rate limiter
Health check No gRPC health protocol Implement grpc.health.v1