assistance-engine/docs/ADR/ADR-0002-two-phase-streamin...

62 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-0002: Two-Phase Streaming Design for `AskAgentStream`
**Date:** 2026-03-05
**Status:** Accepted
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering
---
## Context
The initial `AskAgent` implementation calls `graph.invoke()` — LangGraph's synchronous execution — and returns the complete answer as a single gRPC message. This blocks the gRPC connection for the full generation time (typically 315 seconds) with no intermediate feedback to the client.
A streaming variant is required that forwards Ollama's token output to the client as tokens are produced, enabling real-time rendering in client UIs.
The straightforward approach would be to use LangGraph's own `graph.stream()` method.
---
## Decision
Implement `AskAgentStream` using a **two-phase design**:
**Phase 1 — Graph-managed preparation:**
Run `build_prepare_graph()` (classify → reformulate → retrieve) via `prepare_graph.invoke()`. This phase runs synchronously and produces the full classified, reformulated query and retrieved context. It does **not** call the LLM for generation.
**Phase 2 — Manual LLM streaming:**
Call `build_final_messages()` to reconstruct the exact prompt that the full graph would have used, then call `llm.stream(final_messages)` directly. Each token chunk is yielded immediately as an `AgentResponse`.
A separate `build_prepare_graph()` function mirrors the routing logic of `build_graph()` but terminates at `END` before any generation node. A `build_final_messages()` function replicates the prompt-building logic of `generate`, `generate_code`, and `respond_conversational`.
---
## Rationale
### Why not use `graph.stream()`?
LangGraph's `stream()` yields **state snapshots** at node boundaries, not LLM tokens. When using `llm.invoke()` inside a graph node, the invocation is atomic — there are no intermediate yields. To get per-token streaming from `llm.stream()`, the call must happen outside the graph.
### Why not inline the streaming call inside a graph node?
Yielding from inside a LangGraph node to an outer generator is architecturally complex and not idiomatic to LangGraph. It requires either a callback mechanism or breaking the node abstraction.
### Trade-offs
| Concern | Two-phase design | Alternative (streaming inside graph) |
|---|---|---|
| Code duplication | Medium — routing logic exists in both graphs | Low |
| Architectural clarity | High — phases are clearly separated | Low |
| LangGraph compatibility | High — standard usage | Low — requires framework internals |
| Maintainability | Requires keeping `build_prepare_graph` and `build_final_messages` in sync with `build_graph` | Single source of routing truth |
The duplication risk is accepted because: (1) the routing logic is simple (3 branches), (2) the prepare graph is strictly a subset of the full graph, and (3) both are tested via the same integration test queries.
---
## Consequences
- `graph.py` now exports three functions: `build_graph`, `build_prepare_graph`, `build_final_messages`.
- Any change to query routing logic in `build_graph` must be mirrored in `build_prepare_graph`.
- Any change to prompt selection in `generate` / `generate_code` / `respond_conversational` must be mirrored in `build_final_messages`.
- Session history persistence happens **after the stream ends**, not mid-stream. A client that disconnects early will cause history to not be saved for that turn.