90 lines
4.8 KiB
Markdown
90 lines
4.8 KiB
Markdown
# PRD-0001: OpenAI-Compatible HTTP Proxy
|
|
|
|
**Date:** 2026-03-18
|
|
**Status:** Implemented
|
|
**Requested by:** Rafael Ruiz (CTO)
|
|
**Implemented in:** PR #58
|
|
**Related ADR:** ADR-0001 (gRPC as primary interface)
|
|
|
|
---
|
|
|
|
## Problem
|
|
|
|
The Brunix Assistance Engine exposes a gRPC interface as its primary API. gRPC is the right choice for performance and type safety in server-to-server communication, but it creates a significant adoption barrier for two categories of consumers:
|
|
|
|
**Existing OpenAI integrations.** Any tool or client already configured to call the OpenAI API — VS Code extensions using `continue.dev`, LiteLLM routers, Open WebUI instances, internal tooling at 101OBEX, Corp — requires code changes to switch to gRPC. The switching cost is non-trivial and creates friction that slows adoption.
|
|
|
|
**Model replacement use case.** The core strategic value of the Brunix RAG is that it can replace direct OpenAI API consumption with a locally-hosted, domain-specific assistant that has no per-token cost and no data privacy concerns. This value proposition is only actionable if the replacement is transparent — i.e., the client does not need to change to consume the Brunix RAG instead of OpenAI.
|
|
|
|
Without a compatibility layer, the Brunix engine cannot serve as a drop-in replacement for OpenAI models. Every potential adopter faces an integration project instead of a configuration change.
|
|
|
|
---
|
|
|
|
## Solution
|
|
|
|
Implement an HTTP server running alongside the gRPC server that exposes:
|
|
|
|
- The OpenAI Chat Completions API (`/v1/chat/completions`) — both streaming and non-streaming
|
|
- The OpenAI Completions API (`/v1/completions`) — legacy support
|
|
- The OpenAI Models API (`/v1/models`) — for compatibility with clients that enumerate available models
|
|
- The Ollama Chat API (`/api/chat`) — NDJSON streaming format
|
|
- The Ollama Generate API (`/api/generate`) — for Ollama-native clients
|
|
- The Ollama Tags API (`/api/tags`) — for clients that list available models
|
|
- A health endpoint (`/health`)
|
|
|
|
The proxy bridges HTTP → gRPC internally: `stream: false` routes to `AskAgent`, `stream: true` routes to `AskAgentStream`. The gRPC interface remains the primary interface and is not modified.
|
|
|
|
Any client that currently points to `https://api.openai.com` can be reconfigured to point to `http://localhost:8000` (or the server's address) with `model: brunix` and will work without any other change.
|
|
|
|
---
|
|
|
|
## Scope
|
|
|
|
**In scope:**
|
|
- OpenAI-compatible endpoints as listed above
|
|
- Ollama-compatible endpoints as listed above
|
|
- Routing `stream: false` to `AskAgent` and `stream: true` to `AskAgentStream`
|
|
- Session ID propagation via the `session_id` extension field in `ChatCompletionRequest`
|
|
- Health endpoint
|
|
|
|
**Out of scope:**
|
|
- OpenAI function calling / tool use
|
|
- OpenAI embeddings API (`/v1/embeddings`)
|
|
- OpenAI fine-tuning or moderation APIs
|
|
- Authentication / API key validation (handled at infrastructure level)
|
|
- Multi-turn conversation reconstruction from the message array (the proxy extracts only the last user message as the query)
|
|
|
|
---
|
|
|
|
## Technical implementation
|
|
|
|
**Stack:** FastAPI + uvicorn, running on port 8000 inside the same container as the gRPC server.
|
|
|
|
**Concurrency:** An asyncio event loop bridges FastAPI's async context with the synchronous gRPC calls via a dedicated `ThreadPoolExecutor` (configurable via `PROXY_THREAD_WORKERS`, default 20). This prevents gRPC blocking calls from stalling the async HTTP server.
|
|
|
|
**Streaming:** An `asyncio.Queue` connects the gRPC token stream (produced in a thread) with the FastAPI `StreamingResponse` (consumed in the async event loop). Tokens are forwarded as SSE events (OpenAI format) or NDJSON (Ollama format) as they arrive from `AskAgentStream`.
|
|
|
|
**Entry point:** `entrypoint.sh` starts both the gRPC server and the HTTP proxy as parallel processes. If either crashes, the other is terminated — the container fails cleanly rather than entering a partially active state.
|
|
|
|
**Environment variables:**
|
|
|
|
| Variable | Default | Description |
|
|
|---|---|---|
|
|
| `BRUNIX_GRPC_TARGET` | `localhost:50051` | gRPC server address |
|
|
| `PROXY_MODEL_ID` | `brunix` | Model name returned by `/v1/models` and `/api/tags` |
|
|
| `PROXY_THREAD_WORKERS` | `20` | ThreadPoolExecutor size for gRPC calls |
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
**Functional:** Any OpenAI-compatible client (continue.dev, LiteLLM, Open WebUI) can be pointed at `http://localhost:8000` with `model: brunix` and successfully send queries to the Brunix RAG without code changes.
|
|
|
|
**Strategic:** The VS Code extension and any 101OBEX, Corp internal tooling currently consuming OpenAI can switch to the Brunix RAG by changing one endpoint URL and one model name. No other changes required.
|
|
|
|
---
|
|
|
|
## Impact on existing interfaces
|
|
|
|
The gRPC interface (`AskAgent`, `AskAgentStream`, `EvaluateRAG`) is unchanged. Existing gRPC clients are not affected. The proxy is additive — it does not replace the gRPC interface, it complements it.
|