62 lines
3.3 KiB
Markdown
62 lines
3.3 KiB
Markdown
# ADR-0002: Two-Phase Streaming Design for `AskAgentStream`
|
||
|
||
**Date:** 2026-03-05
|
||
**Status:** Accepted
|
||
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
The initial `AskAgent` implementation calls `graph.invoke()` — LangGraph's synchronous execution — and returns the complete answer as a single gRPC message. This blocks the gRPC connection for the full generation time (typically 3–15 seconds) with no intermediate feedback to the client.
|
||
|
||
A streaming variant is required that forwards Ollama's token output to the client as tokens are produced, enabling real-time rendering in client UIs.
|
||
|
||
The straightforward approach would be to use LangGraph's own `graph.stream()` method.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
Implement `AskAgentStream` using a **two-phase design**:
|
||
|
||
**Phase 1 — Graph-managed preparation:**
|
||
Run `build_prepare_graph()` (classify → reformulate → retrieve) via `prepare_graph.invoke()`. This phase runs synchronously and produces the full classified, reformulated query and retrieved context. It does **not** call the LLM for generation.
|
||
|
||
**Phase 2 — Manual LLM streaming:**
|
||
Call `build_final_messages()` to reconstruct the exact prompt that the full graph would have used, then call `llm.stream(final_messages)` directly. Each token chunk is yielded immediately as an `AgentResponse`.
|
||
|
||
A separate `build_prepare_graph()` function mirrors the routing logic of `build_graph()` but terminates at `END` before any generation node. A `build_final_messages()` function replicates the prompt-building logic of `generate`, `generate_code`, and `respond_conversational`.
|
||
|
||
---
|
||
|
||
## Rationale
|
||
|
||
### Why not use `graph.stream()`?
|
||
|
||
LangGraph's `stream()` yields **state snapshots** at node boundaries, not LLM tokens. When using `llm.invoke()` inside a graph node, the invocation is atomic — there are no intermediate yields. To get per-token streaming from `llm.stream()`, the call must happen outside the graph.
|
||
|
||
### Why not inline the streaming call inside a graph node?
|
||
|
||
Yielding from inside a LangGraph node to an outer generator is architecturally complex and not idiomatic to LangGraph. It requires either a callback mechanism or breaking the node abstraction.
|
||
|
||
### Trade-offs
|
||
|
||
| Concern | Two-phase design | Alternative (streaming inside graph) |
|
||
|---|---|---|
|
||
| Code duplication | Medium — routing logic exists in both graphs | Low |
|
||
| Architectural clarity | High — phases are clearly separated | Low |
|
||
| LangGraph compatibility | High — standard usage | Low — requires framework internals |
|
||
| Maintainability | Requires keeping `build_prepare_graph` and `build_final_messages` in sync with `build_graph` | Single source of routing truth |
|
||
|
||
The duplication risk is accepted because: (1) the routing logic is simple (3 branches), (2) the prepare graph is strictly a subset of the full graph, and (3) both are tested via the same integration test queries.
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
- `graph.py` now exports three functions: `build_graph`, `build_prepare_graph`, `build_final_messages`.
|
||
- Any change to query routing logic in `build_graph` must be mirrored in `build_prepare_graph`.
|
||
- Any change to prompt selection in `generate` / `generate_code` / `respond_conversational` must be mirrored in `build_final_messages`.
|
||
- Session history persistence happens **after the stream ends**, not mid-stream. A client that disconnects early will cause history to not be saved for that turn.
|