assistance-engine/docs/API_REFERENCE.md

340 lines
9.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Brunix Assistance Engine — API Reference
> **Protocol:** gRPC (proto3)
> **Port:** `50052` (host) → `50051` (container)
> **Reflection:** Enabled — service introspection available via `grpcurl`
> **Source of truth:** `Docker/protos/brunix.proto`
---
## Table of Contents
1. [Service Definition](#1-service-definition)
2. [Methods](#2-methods)
- [AskAgent](#21-askagent)
- [AskAgentStream](#22-askagentstream)
- [EvaluateRAG](#23-evaluaterag)
3. [Message Types](#3-message-types)
4. [Error Handling](#4-error-handling)
5. [Client Examples](#5-client-examples)
6. [OpenAI-Compatible Proxy](#6-openai-compatible-proxy)
---
## 1. Service Definition
```protobuf
package brunix;
service AssistanceEngine {
rpc AskAgent (AgentRequest) returns (stream AgentResponse);
rpc AskAgentStream (AgentRequest) returns (stream AgentResponse);
rpc EvaluateRAG (EvalRequest) returns (EvalResponse);
}
```
Both `AskAgent` and `AskAgentStream` return a **server-side stream** of `AgentResponse` messages. They differ in how they produce and deliver the response — see [§2.1](#21-askagent) and [§2.2](#22-askagentstream).
---
## 2. Methods
### 2.1 `AskAgent`
**Behaviour:** Runs the full LangGraph pipeline (classify → reformulate → retrieve → generate) using `llm.invoke()`. Returns the complete answer as a **single** `AgentResponse` message with `is_final = true`.
**Use case:** Clients that do not support streaming or need a single atomic response.
**Request:**
```protobuf
message AgentRequest {
string query = 1; // The user's question. Required. Max recommended: 4096 chars.
string session_id = 2; // Conversation session identifier. Optional.
// If empty, defaults to "default" (shared session).
// Use a UUID per user/conversation for isolation.
}
```
**Response stream:**
| Message # | `text` | `avap_code` | `is_final` |
|---|---|---|---|
| 1 (only) | Full answer text | `"AVAP-2026"` | `true` |
**Latency characteristics:** Depends on LLM generation time (non-streaming). Typically 315 seconds for `qwen2.5:1.5b` on the Devaron cluster.
---
### 2.2 `AskAgentStream`
**Behaviour:** Runs `prepare_graph` (classify → reformulate → retrieve), then calls `llm.stream()` directly. Emits one `AgentResponse` per token from Ollama, followed by a terminal message.
**Use case:** Interactive clients (chat UIs, terminal tools) that need progressive rendering.
**Request:** Same `AgentRequest` as `AskAgent`.
**Response stream:**
| Message # | `text` | `avap_code` | `is_final` |
|---|---|---|---|
| 1…N | Single token | `""` | `false` |
| N+1 (final) | `""` | `""` | `true` |
**Client contract:**
- Accumulate `text` from all messages where `is_final == false` to reconstruct the full answer.
- The `is_final == true` message signals end-of-stream. Its `text` is always empty and should be discarded.
- Do not close the stream early — the engine will fail to persist conversation history if the stream is interrupted.
---
### 2.3 `EvaluateRAG`
**Behaviour:** Runs the RAGAS evaluation pipeline against the golden dataset. Uses the production Ollama LLM for answer generation and Claude as the evaluation judge.
> **Requirement:** `ANTHROPIC_API_KEY` must be configured in the environment. This endpoint will return an error response if it is missing.
**Request:**
```protobuf
message EvalRequest {
string category = 1; // Optional. Filter golden dataset by category name.
// If empty, all categories are evaluated.
int32 limit = 2; // Optional. Evaluate only the first N questions.
// If 0, all matching questions are evaluated.
string index = 3; // Optional. Elasticsearch index to evaluate against.
// If empty, uses the server's configured ELASTICSEARCH_INDEX.
}
```
**Response (single, non-streaming):**
```protobuf
message EvalResponse {
string status = 1; // "ok" or error description
int32 questions_evaluated = 2; // Number of questions actually processed
float elapsed_seconds = 3; // Total wall-clock time
string judge_model = 4; // Claude model used as judge
string index = 5; // Elasticsearch index evaluated
// RAGAS metric scores (0.0 1.0)
float faithfulness = 6;
float answer_relevancy = 7;
float context_recall = 8;
float context_precision = 9;
float global_score = 10; // Mean of non-zero metric scores
string verdict = 11; // "EXCELLENT" | "ACCEPTABLE" | "INSUFFICIENT"
repeated QuestionDetail details = 12;
}
message QuestionDetail {
string id = 1; // Question ID from golden dataset
string category = 2; // Question category
string question = 3; // Question text
string answer_preview = 4; // First 300 chars of generated answer
int32 n_chunks = 5; // Number of context chunks retrieved
}
```
**Verdict thresholds:**
| Score | Verdict |
|---|---|
| ≥ 0.80 | `EXCELLENT` |
| ≥ 0.60 | `ACCEPTABLE` |
| < 0.60 | `INSUFFICIENT` |
---
## 3. Message Types
### `AgentRequest`
| Field | Type | Required | Description |
|---|---|---|---|
| `query` | `string` | Yes | User's natural language question |
| `session_id` | `string` | No | Conversation identifier for multi-turn context. Use a stable UUID per user session. |
### `AgentResponse`
| Field | Type | Description |
|---|---|---|
| `text` | `string` | Token text (streaming) or full answer text (non-streaming) |
| `avap_code` | `string` | Currently always `"AVAP-2026"` in non-streaming mode, empty in streaming |
| `is_final` | `bool` | `true` only on the last message of the stream |
### `EvalRequest`
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| `category` | `string` | No | `""` (all) | Filter golden dataset by category |
| `limit` | `int32` | No | `0` (all) | Max questions to evaluate |
| `index` | `string` | No | `$ELASTICSEARCH_INDEX` | ES index to evaluate |
### `EvalResponse`
See full definition in [§2.3](#23-evaluaterag).
---
## 4. Error Handling
The engine catches all exceptions and returns them as terminal `AgentResponse` messages rather than gRPC status errors. This means:
- The stream will **not** be terminated with a non-OK gRPC status code on application-level errors.
- Check for error strings in the `text` field that begin with `[ENG] Error:`.
- The stream will still end with `is_final = true`.
**Example error response:**
```json
{"text": "[ENG] Error: Connection refused connecting to Ollama", "is_final": true}
```
**`EvaluateRAG` error response:**
Returned as a single `EvalResponse` with `status` set to the error description:
```json
{"status": "ANTHROPIC_API_KEY no configurada en .env", ...}
```
---
## 5. Client Examples
### Introspect the service
```bash
grpcurl -plaintext localhost:50052 list
# Output: brunix.AssistanceEngine
grpcurl -plaintext localhost:50052 describe brunix.AssistanceEngine
```
### `AskAgent` — full response
```bash
grpcurl -plaintext \
-d '{"query": "What is addVar in AVAP?", "session_id": "dev-001"}' \
localhost:50052 \
brunix.AssistanceEngine/AskAgent
```
Expected response:
```json
{
"text": "addVar is an AVAP command that declares a new variable...",
"avap_code": "AVAP-2026",
"is_final": true
}
```
### `AskAgentStream` — token streaming
```bash
grpcurl -plaintext \
-d '{"query": "Write an AVAP API that returns hello world", "session_id": "dev-001"}' \
localhost:50052 \
brunix.AssistanceEngine/AskAgentStream
```
Expected response (truncated):
```json
{"text": "Here", "is_final": false}
{"text": " is", "is_final": false}
{"text": " a", "is_final": false}
...
{"text": "", "is_final": true}
```
### `EvaluateRAG` — run evaluation
```bash
# Evaluate first 10 questions from the "core_syntax" category
grpcurl -plaintext \
-d '{"category": "core_syntax", "limit": 10}' \
localhost:50052 \
brunix.AssistanceEngine/EvaluateRAG
```
Expected response:
```json
{
"status": "ok",
"questions_evaluated": 10,
"elapsed_seconds": 142.3,
"judge_model": "claude-sonnet-4-20250514",
"index": "avap-docs-test",
"faithfulness": 0.8421,
"answer_relevancy": 0.7913,
"context_recall": 0.7234,
"context_precision": 0.6891,
"global_score": 0.7615,
"verdict": "ACCEPTABLE",
"details": [...]
}
```
### Multi-turn conversation example
```bash
# Turn 1
grpcurl -plaintext \
-d '{"query": "What is registerEndpoint?", "session_id": "user-abc"}' \
localhost:50052 brunix.AssistanceEngine/AskAgentStream
# Turn 2 — the engine has history from Turn 1
grpcurl -plaintext \
-d '{"query": "Can you show me an example?", "session_id": "user-abc"}' \
localhost:50052 brunix.AssistanceEngine/AskAgentStream
```
### Regenerate gRPC stubs after modifying `brunix.proto`
```bash
python -m grpc_tools.protoc \
-I./Docker/protos \
--python_out=./Docker/src \
--grpc_python_out=./Docker/src \
./Docker/protos/brunix.proto
```
---
## 6. OpenAI-Compatible Proxy
The container also exposes an HTTP server on port `8000` (`openai_proxy.py`) that wraps `AskAgentStream` under an OpenAI-compatible endpoint. This allows integration with any tool that supports the OpenAI Chat Completions API.
**Base URL:** `http://localhost:8000`
### `POST /v1/chat/completions`
**Request body:**
```json
{
"model": "brunix",
"messages": [
{"role": "user", "content": "What is addVar in AVAP?"}
],
"stream": true
}
```
**Notes:**
- The `model` field is ignored; the engine always uses the configured `OLLAMA_MODEL_NAME`.
- Session management is handled internally by the proxy. Conversation continuity across separate HTTP requests is not guaranteed.
- Only `stream: true` is fully supported. Non-streaming mode may be available but is not the primary use case.
**Example with curl:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "brunix",
"messages": [{"role": "user", "content": "Explain AVAP loops"}],
"stream": true
}'
```