assistance-engine/docs/ADR/ADR-0002-two-phase-streamin...

# ADR-0002: Two-Phase Streaming Design for `AskAgentStream`

**Date:** 2026-03-05
**Status:** Accepted
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering

---

## Context

The initial `AskAgent` implementation calls `graph.invoke()` — LangGraph's synchronous execution — and returns the complete answer as a single gRPC message. This blocks the gRPC connection for the full generation time (typically 3–15 seconds) with no intermediate feedback to the client.

A streaming variant is required that forwards Ollama's token output to the client as tokens are produced, enabling real-time rendering in client UIs.

The straightforward approach would be to use LangGraph's own `graph.stream()` method.

---

## Decision

Implement `AskAgentStream` using a **two-phase design**:

**Phase 1 — Graph-managed preparation:**
Run `build_prepare_graph()` (classify → reformulate → retrieve) via `prepare_graph.invoke()`. This phase runs synchronously and produces the full classified, reformulated query and retrieved context. It does **not** call the LLM for generation.

**Phase 2 — Manual LLM streaming:**
Call `build_final_messages()` to reconstruct the exact prompt that the full graph would have used, then call `llm.stream(final_messages)` directly. Each token chunk is yielded immediately as an `AgentResponse`.

A separate `build_prepare_graph()` function mirrors the routing logic of `build_graph()` but terminates at `END` before any generation node. A `build_final_messages()` function replicates the prompt-building logic of `generate`, `generate_code`, and `respond_conversational`.

---

## Rationale

### Why not use `graph.stream()`?

LangGraph's `stream()` yields **state snapshots** at node boundaries, not LLM tokens. When using `llm.invoke()` inside a graph node, the invocation is atomic — there are no intermediate yields. To get per-token streaming from `llm.stream()`, the call must happen outside the graph.

### Why not inline the streaming call inside a graph node?

Yielding from inside a LangGraph node to an outer generator is architecturally complex and not idiomatic to LangGraph. It requires either a callback mechanism or breaking the node abstraction.

### Trade-offs

| Concern | Two-phase design | Alternative (streaming inside graph) |
|---|---|---|
| Code duplication | Medium — routing logic exists in both graphs | Low |
| Architectural clarity | High — phases are clearly separated | Low |
| LangGraph compatibility | High — standard usage | Low — requires framework internals |
| Maintainability | Requires keeping `build_prepare_graph` and `build_final_messages` in sync with `build_graph` | Single source of routing truth |

The duplication risk is accepted because: (1) the routing logic is simple (3 branches), (2) the prepare graph is strictly a subset of the full graph, and (3) both are tested via the same integration test queries.

---

## Consequences

- `graph.py` now exports three functions: `build_graph`, `build_prepare_graph`, `build_final_messages`.
- Any change to query routing logic in `build_graph` must be mirrored in `build_prepare_graph`.
- Any change to prompt selection in `generate` / `generate_code` / `respond_conversational` must be mirrored in `build_final_messages`.
- Session history persistence happens **after the stream ends**, not mid-stream. A client that disconnects early will cause history to not be saved for that turn.