assistance-engine/docs/ADR/ADR-0002-two-phase-streamin...

3.3 KiB
Raw Permalink Blame History

ADR-0002: Two-Phase Streaming Design for AskAgentStream

Date: 2026-03-05 Status: Accepted Deciders: Rafael Ruiz (CTO)


Context

The initial AskAgent implementation calls graph.invoke() — LangGraph's synchronous execution — and returns the complete answer as a single gRPC message. This blocks the gRPC connection for the full generation time (typically 315 seconds) with no intermediate feedback to the client.

A streaming variant is required that forwards Ollama's token output to the client as tokens are produced, enabling real-time rendering in client UIs.

The straightforward approach would be to use LangGraph's own graph.stream() method.


Decision

Implement AskAgentStream using a two-phase design:

Phase 1 — Graph-managed preparation:
Run build_prepare_graph() (classify → reformulate → retrieve) via prepare_graph.invoke(). This phase runs synchronously and produces the full classified, reformulated query and retrieved context. It does not call the LLM for generation.

Phase 2 — Manual LLM streaming:
Call build_final_messages() to reconstruct the exact prompt that the full graph would have used, then call llm.stream(final_messages) directly. Each token chunk is yielded immediately as an AgentResponse.

A separate build_prepare_graph() function mirrors the routing logic of build_graph() but terminates at END before any generation node. A build_final_messages() function replicates the prompt-building logic of generate, generate_code, and respond_conversational.


Rationale

Why not use graph.stream()?

LangGraph's stream() yields state snapshots at node boundaries, not LLM tokens. When using llm.invoke() inside a graph node, the invocation is atomic — there are no intermediate yields. To get per-token streaming from llm.stream(), the call must happen outside the graph.

Why not inline the streaming call inside a graph node?

Yielding from inside a LangGraph node to an outer generator is architecturally complex and not idiomatic to LangGraph. It requires either a callback mechanism or breaking the node abstraction.

Trade-offs

Concern Two-phase design Alternative (streaming inside graph)
Code duplication Medium — routing logic exists in both graphs Low
Architectural clarity High — phases are clearly separated Low
LangGraph compatibility High — standard usage Low — requires framework internals
Maintainability Requires keeping build_prepare_graph and build_final_messages in sync with build_graph Single source of routing truth

The duplication risk is accepted because: (1) the routing logic is simple (3 branches), (2) the prepare graph is strictly a subset of the full graph, and (3) both are tested via the same integration test queries.


Consequences

  • graph.py now exports three functions: build_graph, build_prepare_graph, build_final_messages.
  • Any change to query routing logic in build_graph must be mirrored in build_prepare_graph.
  • Any change to prompt selection in generate / generate_code / respond_conversational must be mirrored in build_final_messages.
  • Session history persistence happens after the stream ends, not mid-stream. A client that disconnects early will cause history to not be saved for that turn.