assistance-engine/docs/API_REFERENCE.md

# Brunix Assistance Engine — API Reference

> **Protocol:** gRPC (proto3)
> **Port:** `50052` (host) → `50051` (container)
> **Reflection:** Enabled — service introspection available via `grpcurl`
> **Source of truth:** `Docker/protos/brunix.proto`

---

## Table of Contents

1. [Service Definition](#1-service-definition)
2. [Methods](#2-methods)
   - [AskAgent](#21-askagent)
   - [AskAgentStream](#22-askagentstream)
   - [EvaluateRAG](#23-evaluaterag)
3. [Message Types](#3-message-types)
4. [Error Handling](#4-error-handling)
5. [Client Examples](#5-client-examples)
6. [OpenAI-Compatible Proxy](#6-openai-compatible-proxy)

---

## 1. Service Definition

```protobuf
package brunix;

service AssistanceEngine {
  rpc AskAgent       (AgentRequest) returns (stream AgentResponse);
  rpc AskAgentStream (AgentRequest) returns (stream AgentResponse);
  rpc EvaluateRAG    (EvalRequest)  returns (EvalResponse);
}
```

Both `AskAgent` and `AskAgentStream` return a **server-side stream** of `AgentResponse` messages. They differ in how they produce and deliver the response — see [§2.1](#21-askagent) and [§2.2](#22-askagentstream).

---

## 2. Methods

### 2.1 `AskAgent`

**Behaviour:** Runs the full LangGraph pipeline (classify → reformulate → retrieve → generate) using `llm.invoke()`. Returns the complete answer as a **single** `AgentResponse` message with `is_final = true`.

**Use case:** Clients that do not support streaming or need a single atomic response.

**Request:**

```protobuf
message AgentRequest {
  string query      = 1;  // The user's question. Required. Max recommended: 4096 chars.
  string session_id = 2;  // Conversation session identifier. Optional.
                           // If empty, defaults to "default" (shared session).
                           // Use a UUID per user/conversation for isolation.
}
```

**Response stream:**

| Message # | `text` | `avap_code` | `is_final` |
|---|---|---|---|
| 1 (only) | Full answer text | `"AVAP-2026"` | `true` |

**Latency characteristics:** Depends on LLM generation time (non-streaming). Typically 3–15 seconds for `qwen2.5:1.5b` on the Devaron cluster.

---

### 2.2 `AskAgentStream`

**Behaviour:** Runs `prepare_graph` (classify → reformulate → retrieve), then calls `llm.stream()` directly. Emits one `AgentResponse` per token from Ollama, followed by a terminal message.

**Use case:** Interactive clients (chat UIs, terminal tools) that need progressive rendering.

**Request:** Same `AgentRequest` as `AskAgent`.

**Response stream:**

| Message # | `text` | `avap_code` | `is_final` |
|---|---|---|---|
| 1…N | Single token | `""` | `false` |
| N+1 (final) | `""` | `""` | `true` |

**Client contract:**
- Accumulate `text` from all messages where `is_final == false` to reconstruct the full answer.
- The `is_final == true` message signals end-of-stream. Its `text` is always empty and should be discarded.
- Do not close the stream early — the engine will fail to persist conversation history if the stream is interrupted.

---

### 2.3 `EvaluateRAG`

**Behaviour:** Runs the RAGAS evaluation pipeline against the golden dataset. Uses the production Ollama LLM for answer generation and Claude as the evaluation judge.

> **Requirement:** `ANTHROPIC_API_KEY` must be configured in the environment. This endpoint will return an error response if it is missing.

**Request:**

```protobuf
message EvalRequest {
  string category = 1;  // Optional. Filter golden dataset by category name.
                         // If empty, all categories are evaluated.
  int32  limit    = 2;  // Optional. Evaluate only the first N questions.
                         // If 0, all matching questions are evaluated.
  string index    = 3;  // Optional. Elasticsearch index to evaluate against.
                         // If empty, uses the server's configured ELASTICSEARCH_INDEX.
}
```

**Response (single, non-streaming):**

```protobuf
message EvalResponse {
  string status               = 1;  // "ok" or error description
  int32  questions_evaluated  = 2;  // Number of questions actually processed
  float  elapsed_seconds      = 3;  // Total wall-clock time
  string judge_model          = 4;  // Claude model used as judge
  string index                = 5;  // Elasticsearch index evaluated

  // RAGAS metric scores (0.0 – 1.0)
  float  faithfulness         = 6;
  float  answer_relevancy     = 7;
  float  context_recall       = 8;
  float  context_precision    = 9;

  float  global_score         = 10; // Mean of non-zero metric scores
  string verdict              = 11; // "EXCELLENT" | "ACCEPTABLE" | "INSUFFICIENT"

  repeated QuestionDetail details = 12;
}

message QuestionDetail {
  string id             = 1;  // Question ID from golden dataset
  string category       = 2;  // Question category
  string question       = 3;  // Question text
  string answer_preview = 4;  // First 300 chars of generated answer
  int32  n_chunks       = 5;  // Number of context chunks retrieved
}
```

**Verdict thresholds:**

| Score | Verdict |
|---|---|
| ≥ 0.80 | `EXCELLENT` |
| ≥ 0.60 | `ACCEPTABLE` |
| < 0.60 | `INSUFFICIENT` |

---

## 3. Message Types

### `AgentRequest`

| Field | Type | Required | Description |
|---|---|---|---|
| `query` | `string` | Yes | User's natural language question |
| `session_id` | `string` | No | Conversation identifier for multi-turn context. Use a stable UUID per user session. |

### `AgentResponse`

| Field | Type | Description |
|---|---|---|
| `text` | `string` | Token text (streaming) or full answer text (non-streaming) |
| `avap_code` | `string` | Currently always `"AVAP-2026"` in non-streaming mode, empty in streaming |
| `is_final` | `bool` | `true` only on the last message of the stream |

### `EvalRequest`

| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| `category` | `string` | No | `""` (all) | Filter golden dataset by category |
| `limit` | `int32` | No | `0` (all) | Max questions to evaluate |
| `index` | `string` | No | `$ELASTICSEARCH_INDEX` | ES index to evaluate |

### `EvalResponse`

See full definition in [§2.3](#23-evaluaterag).

---

## 4. Error Handling

The engine catches all exceptions and returns them as terminal `AgentResponse` messages rather than gRPC status errors. This means:

- The stream will **not** be terminated with a non-OK gRPC status code on application-level errors.
- Check for error strings in the `text` field that begin with `[ENG] Error:`.
- The stream will still end with `is_final = true`.

**Example error response:**
```json
{"text": "[ENG] Error: Connection refused connecting to Ollama", "is_final": true}
```

**`EvaluateRAG` error response:**
Returned as a single `EvalResponse` with `status` set to the error description:
```json
{"status": "ANTHROPIC_API_KEY no configurada en .env", ...}
```

---

## 5. Client Examples

### Introspect the service

```bash
grpcurl -plaintext localhost:50052 list
# Output: brunix.AssistanceEngine

grpcurl -plaintext localhost:50052 describe brunix.AssistanceEngine
```

### `AskAgent` — full response

```bash
grpcurl -plaintext \
  -d '{"query": "What is addVar in AVAP?", "session_id": "dev-001"}' \
  localhost:50052 \
  brunix.AssistanceEngine/AskAgent
```

Expected response:
```json
{
  "text": "addVar is an AVAP command that declares a new variable...",
  "avap_code": "AVAP-2026",
  "is_final": true
}
```

### `AskAgentStream` — token streaming

```bash
grpcurl -plaintext \
  -d '{"query": "Write an AVAP API that returns hello world", "session_id": "dev-001"}' \
  localhost:50052 \
  brunix.AssistanceEngine/AskAgentStream
```

Expected response (truncated):
```json
{"text": "Here", "is_final": false}
{"text": " is", "is_final": false}
{"text": " a", "is_final": false}
...
{"text": "", "is_final": true}
```

### `EvaluateRAG` — run evaluation

```bash
# Evaluate first 10 questions from the "core_syntax" category
grpcurl -plaintext \
  -d '{"category": "core_syntax", "limit": 10}' \
  localhost:50052 \
  brunix.AssistanceEngine/EvaluateRAG
```

Expected response:
```json
{
  "status": "ok",
  "questions_evaluated": 10,
  "elapsed_seconds": 142.3,
  "judge_model": "claude-sonnet-4-20250514",
  "index": "avap-docs-test",
  "faithfulness": 0.8421,
  "answer_relevancy": 0.7913,
  "context_recall": 0.7234,
  "context_precision": 0.6891,
  "global_score": 0.7615,
  "verdict": "ACCEPTABLE",
  "details": [...]
}
```

### Multi-turn conversation example

```bash
# Turn 1
grpcurl -plaintext \
  -d '{"query": "What is registerEndpoint?", "session_id": "user-abc"}' \
  localhost:50052 brunix.AssistanceEngine/AskAgentStream

# Turn 2 — the engine has history from Turn 1
grpcurl -plaintext \
  -d '{"query": "Can you show me an example?", "session_id": "user-abc"}' \
  localhost:50052 brunix.AssistanceEngine/AskAgentStream
```

### Regenerate gRPC stubs after modifying `brunix.proto`

```bash
python -m grpc_tools.protoc \
  -I./Docker/protos \
  --python_out=./Docker/src \
  --grpc_python_out=./Docker/src \
  ./Docker/protos/brunix.proto
```

---

## 6. OpenAI-Compatible Proxy

The container also exposes an HTTP server on port `8000` (`openai_proxy.py`) that wraps `AskAgentStream` under an OpenAI-compatible endpoint. This allows integration with any tool that supports the OpenAI Chat Completions API.

**Base URL:** `http://localhost:8000`

### `POST /v1/chat/completions`

**Request body:**

```json
{
  "model": "brunix",
  "messages": [
    {"role": "user", "content": "What is addVar in AVAP?"}
  ],
  "stream": true
}
```

**Notes:**
- The `model` field is ignored; the engine always uses the configured `OLLAMA_MODEL_NAME`.
- Session management is handled internally by the proxy. Conversation continuity across separate HTTP requests is not guaranteed.
- Only `stream: true` is fully supported. Non-streaming mode may be available but is not the primary use case.

**Example with curl:**

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "brunix",
    "messages": [{"role": "user", "content": "Explain AVAP loops"}],
    "stream": true
  }'
```