3.3 KiB
ADR-0002: Two-Phase Streaming Design for AskAgentStream
Date: 2026-03-05 Status: Accepted Deciders: Rafael Ruiz (CTO)
Context
The initial AskAgent implementation calls graph.invoke() — LangGraph's synchronous execution — and returns the complete answer as a single gRPC message. This blocks the gRPC connection for the full generation time (typically 3–15 seconds) with no intermediate feedback to the client.
A streaming variant is required that forwards Ollama's token output to the client as tokens are produced, enabling real-time rendering in client UIs.
The straightforward approach would be to use LangGraph's own graph.stream() method.
Decision
Implement AskAgentStream using a two-phase design:
Phase 1 — Graph-managed preparation:
Run build_prepare_graph() (classify → reformulate → retrieve) via prepare_graph.invoke(). This phase runs synchronously and produces the full classified, reformulated query and retrieved context. It does not call the LLM for generation.
Phase 2 — Manual LLM streaming:
Call build_final_messages() to reconstruct the exact prompt that the full graph would have used, then call llm.stream(final_messages) directly. Each token chunk is yielded immediately as an AgentResponse.
A separate build_prepare_graph() function mirrors the routing logic of build_graph() but terminates at END before any generation node. A build_final_messages() function replicates the prompt-building logic of generate, generate_code, and respond_conversational.
Rationale
Why not use graph.stream()?
LangGraph's stream() yields state snapshots at node boundaries, not LLM tokens. When using llm.invoke() inside a graph node, the invocation is atomic — there are no intermediate yields. To get per-token streaming from llm.stream(), the call must happen outside the graph.
Why not inline the streaming call inside a graph node?
Yielding from inside a LangGraph node to an outer generator is architecturally complex and not idiomatic to LangGraph. It requires either a callback mechanism or breaking the node abstraction.
Trade-offs
| Concern | Two-phase design | Alternative (streaming inside graph) |
|---|---|---|
| Code duplication | Medium — routing logic exists in both graphs | Low |
| Architectural clarity | High — phases are clearly separated | Low |
| LangGraph compatibility | High — standard usage | Low — requires framework internals |
| Maintainability | Requires keeping build_prepare_graph and build_final_messages in sync with build_graph |
Single source of routing truth |
The duplication risk is accepted because: (1) the routing logic is simple (3 branches), (2) the prepare graph is strictly a subset of the full graph, and (3) both are tested via the same integration test queries.
Consequences
graph.pynow exports three functions:build_graph,build_prepare_graph,build_final_messages.- Any change to query routing logic in
build_graphmust be mirrored inbuild_prepare_graph. - Any change to prompt selection in
generate/generate_code/respond_conversationalmust be mirrored inbuild_final_messages. - Session history persistence happens after the stream ends, not mid-stream. A client that disconnects early will cause history to not be saved for that turn.