3.3 KiB

Raw Blame History

ADR-0002: Two-Phase Streaming Design for `AskAgentStream`

Date: 2026-03-05 Status: Accepted Deciders: Rafael Ruiz (CTO)

Context

The initial AskAgent implementation calls graph.invoke() — LangGraph's synchronous execution — and returns the complete answer as a single gRPC message. This blocks the gRPC connection for the full generation time (typically 3–15 seconds) with no intermediate feedback to the client.

A streaming variant is required that forwards Ollama's token output to the client as tokens are produced, enabling real-time rendering in client UIs.

The straightforward approach would be to use LangGraph's own graph.stream() method.

Decision

Implement AskAgentStream using a two-phase design:

Phase 1 — Graph-managed preparation:
Run build_prepare_graph() (classify → reformulate → retrieve) via prepare_graph.invoke(). This phase runs synchronously and produces the full classified, reformulated query and retrieved context. It does not call the LLM for generation.

Phase 2 — Manual LLM streaming:
Call build_final_messages() to reconstruct the exact prompt that the full graph would have used, then call llm.stream(final_messages) directly. Each token chunk is yielded immediately as an AgentResponse.

A separate build_prepare_graph() function mirrors the routing logic of build_graph() but terminates at END before any generation node. A build_final_messages() function replicates the prompt-building logic of generate, generate_code, and respond_conversational.

Rationale

Why not use `graph.stream()`?

LangGraph's stream() yields state snapshots at node boundaries, not LLM tokens. When using llm.invoke() inside a graph node, the invocation is atomic — there are no intermediate yields. To get per-token streaming from llm.stream(), the call must happen outside the graph.

Why not inline the streaming call inside a graph node?

Yielding from inside a LangGraph node to an outer generator is architecturally complex and not idiomatic to LangGraph. It requires either a callback mechanism or breaking the node abstraction.

Trade-offs

Concern	Two-phase design	Alternative (streaming inside graph)
Code duplication	Medium — routing logic exists in both graphs	Low
Architectural clarity	High — phases are clearly separated	Low
LangGraph compatibility	High — standard usage	Low — requires framework internals
Maintainability	Requires keeping `build_prepare_graph` and `build_final_messages` in sync with `build_graph`	Single source of routing truth

The duplication risk is accepted because: (1) the routing logic is simple (3 branches), (2) the prepare graph is strictly a subset of the full graph, and (3) both are tested via the same integration test queries.

Consequences

graph.py now exports three functions: build_graph, build_prepare_graph, build_final_messages.
Any change to query routing logic in build_graph must be mirrored in build_prepare_graph.
Any change to prompt selection in generate / generate_code / respond_conversational must be mirrored in build_final_messages.
Session history persistence happens after the stream ends, not mid-stream. A client that disconnects early will cause history to not be saved for that turn.

3.3 KiB Raw Blame History Unescape Escape

ADR-0002: Two-Phase Streaming Design for AskAgentStream