# ADR-0009: Per-Type Response Validation Layer **Date:** 2026-04-10 **Status:** Accepted **Deciders:** Rafael Ruiz (CTO) **Related ADRs:** ADR-0007 (MSVL for RAG Evaluation), ADR-0008 (Adaptive Query Routing), ADR-0003 (Hybrid Retrieval RRF) --- ## Context ### Problem 1 — Syntactically invalid code reaches users ADR-0007 documents that 10–16% of `CODE_GENERATION` responses contain syntactically invalid AVAP code — foreign language injection (Go, Python, JavaScript) or hallucinated commands (`getSHA256`, `readParam`, `returnResult`). The LLM judge used in `EvaluateRAG` does not detect these failures because it evaluates semantic coherence, not syntactic validity. This problem exists in production today. A user receiving a `CODE_GENERATION` response has no indication that the generated code would fail on the PLATON kernel. The AVAP Parser gRPC service — established in ADR-0007 as a hard dependency of the evaluation pipeline — is already available in the stack. It returns not just `VALID / INVALID` but a **complete line-by-line execution trace** on failure: ``` Line 3: unknown command 'getSHA256' — expected known identifier Line 7: unexpected construct 'for i in range(...)' — AVAP loop syntax required Line 12: 'returnResult' not defined — did you mean 'addResult'? ``` This trace is structured, specific, and directly actionable by the LLM. A retry informed by the parser trace is fundamentally different from a blind retry — the model knows exactly what failed and where. ### Problem 2 — Context relevance is not evaluated pre-generation The engine retrieves 8 chunks from Elasticsearch for every `RETRIEVAL` and `CODE_GENERATION` query without checking whether those chunks actually answer the question. `CONFIDENCE_PROMPT_TEMPLATE` has been scaffolded in `prompts.py` since the initial implementation but is not wired into the graph. Undetected low-relevance retrieval produces responses that are semantically fluent but factually ungrounded — the model generates plausible-sounding AVAP explanations or code not supported by the retrieved documentation. ### Why these are one ADR Both problems share the same architectural response: adding validation nodes to the production graph with type-specific logic and feedback-informed retry. The decision is one — **add a per-type validation layer** — and the implementation shares the same graph positions (post-retrieve, post-generate), the same retry contract (maximum 1 retry per request), and the same rationale (the engine must not silently return responses it has evidence to question). Splitting into separate ADRs would produce two documents that cannot be understood independently. --- ## Decision Add a **Per-Type Response Validation Layer (PTVL)** to the production LangGraph pipeline. Each query type has a distinct validation strategy matching its failure modes. ### Validation contract by type | Type | When | What | Mechanism | |---|---|---|---| | `CODE_GENERATION` | Post-generation | Syntactic validity of generated AVAP code | AVAP Parser gRPC — deterministic | | `RETRIEVAL` | Pre-generation | Relevance of retrieved context to the query | LLM relevance check — `CONFIDENCE_PROMPT_TEMPLATE` | | `CONVERSATIONAL` | None | — | No retrieval, no code generated | | `PLATFORM` | None | — | No retrieval, no code generated | --- ### Decision 1 — CODE_GENERATION: parser validation with trace-guided retry #### Flow ``` generate_code node │ ▼ [V1] AVAP Parser gRPC │ ├── VALID ──────────────────────────────► return response │ └── INVALID + line-by-line trace │ ▼ [inject trace into retry prompt] │ ▼ generate_code_retry node (1 attempt only) │ ▼ [V2] AVAP Parser gRPC │ ├── VALID ──────────────────────► return response │ └── INVALID ────────────────────► return response + validation_status flag ``` #### Trace-guided retry The parser trace is injected into the generation prompt as a structured correction context: ``` The previous attempt produced invalid AVAP code. Specific failures: Line 3: unknown command 'getSHA256' — expected known identifier Line 7: unexpected construct 'for i in range(...)' — AVAP loop syntax required Correct these errors. Do not repeat the same constructs. ``` This is not a blind retry. The LLM receives the exact failure points and can target its corrections. ADR-0007 documented the mapping between common hallucinated commands and their valid AVAP equivalents (`getSHA256` → `encodeSHA256`, `returnResult` → `addResult`, etc.) — the trace makes these corrections automatic without hardcoding the mapping. #### Parser SLA Inherited from ADR-0007: ≤2 seconds per call. **Silent fallback is permitted in production** (unlike evaluation, where ADR-0007 mandates abort). The distinction is that evaluation scores must be trustworthy; production responses degrade gracefully. #### Parser availability — circuit breaker A single timeout or connection error does not mean the parser is down. A sustained outage does. Hammering an unavailable gRPC service on every `CODE_GENERATION` request adds latency to every user request with zero benefit. The engine implements a **circuit breaker** with three states: ``` CLOSED ──[N consecutive failures]──► OPEN OPEN ──[cooldown expires]────────► HALF-OPEN HALF-OPEN ──[probe succeeds]────────► CLOSED HALF-OPEN ──[probe fails]───────────► OPEN ``` | Parameter | Default | Env var | |---|---|---| | Failure threshold to open | 3 consecutive failures | `PARSER_CB_THRESHOLD` | | Cooldown before half-open | 30 seconds | `PARSER_CB_COOLDOWN` | | Timeout per call | 2 seconds | `AVAP_PARSER_TIMEOUT` | While the circuit is **OPEN**, `CODE_GENERATION` responses are returned immediately with `validation_status = PARSER_UNAVAILABLE` — no gRPC call is attempted. The cooldown prevents thundering-herd reconnection attempts. ``` [ptvl] circuit OPEN — skipping parser, returning unvalidated [ptvl] circuit HALF-OPEN — probing parser availability [ptvl] circuit CLOSED — parser reachable ``` Setting `AVAP_PARSER_TIMEOUT=0` permanently opens the circuit — disables parser validation entirely. Useful during development when the parser service is not deployed. #### New environment variables ``` AVAP_PARSER_URL=grpc://... # URL of AVAP Parser gRPC service AVAP_PARSER_TIMEOUT=2 # seconds per call; 0 = disable validation PARSER_CB_THRESHOLD=3 # consecutive failures before circuit opens PARSER_CB_COOLDOWN=30 # seconds before circuit attempts half-open probe ``` --- ### Decision 2 — RETRIEVAL: context relevance check with reformulation retry #### Flow ``` retrieve node │ ▼ [V1] CONFIDENCE_PROMPT_TEMPLATE YES / NO │ ├── YES ────────────────────────────────► generate node (normal path) │ └── NO │ ▼ reformulate_with_hint node [reformulate query signalling context was insufficient] │ ▼ retrieve (retry) │ ▼ generate node (regardless of second retrieval result) ``` On second retrieval, generation proceeds unconditionally. The retry is a single best-effort improvement, not a gate. The model generates with whatever context is available — the user receives a response, not a refusal. #### Relevance check `CONFIDENCE_PROMPT_TEMPLATE` evaluates whether the retrieved context contains at least one passage relevant to the question. It returns `YES` or `NO` — no graduated score. The relevance check is only applied to `RETRIEVAL` queries here. `CODE_GENERATION` has the parser as a stronger post-generation signal; the pre-generation relevance check would add latency with lower signal value than the parser provides. #### Reformulation hint When context is insufficient, the reformulation node receives a hint that standard context was not found. This produces a semantically different reformulation — broader synonyms, alternative phrasings — rather than a near-duplicate of the original query. ``` [CONTEXT_INSUFFICIENT] The previous retrieval did not return relevant context for: "{original_query}" Reformulate this query using broader terms or alternative phrasing. ``` --- ## Graph changes ### New nodes | Node | Graph | Trigger | |---|---|---| | `validate_code` | `build_graph` | After `generate_code` | | `generate_code_retry` | `build_graph` | After `validate_code` when INVALID | | `check_context_relevance` | `build_graph` + `build_prepare_graph` | After `retrieve`, before `generate` (RETRIEVAL only) | | `reformulate_with_hint` | `build_graph` + `build_prepare_graph` | After `check_context_relevance` when NO | ### Updated flow — `build_graph` ```mermaid flowchart TD START([start]) --> CL[classify] CL -->|RETRIEVAL| RF[reformulate] CL -->|CODE_GENERATION| RF CL -->|CONVERSATIONAL| RC[respond_conversational] CL -->|PLATFORM| RP[respond_platform] RF --> RT[retrieve] RT -->|CODE_GENERATION| GC[generate_code] RT -->|RETRIEVAL| CR{check_context\nrelevance} CR -->|YES| GE[generate] CR -->|NO| RH[reformulate_with_hint] RH --> RT2[retrieve retry] RT2 --> GE GC --> VC{validate_code\nParser gRPC} VC -->|VALID| END([end]) VC -->|INVALID + trace| GCR[generate_code_retry\ntrace-guided] GCR --> VC2{validate_code\nParser gRPC} VC2 -->|VALID| END VC2 -->|INVALID| END GE --> END RC --> END RP --> END ``` --- ## New AgentState fields ```python class AgentState(TypedDict): ... # PTVL fields parser_trace: str # raw parser trace from first validation attempt (empty if valid) validation_status: str # see validation status values below context_relevant: bool # result of CONFIDENCE_PROMPT check (RETRIEVAL only) ``` ### Validation status values | Value | Meaning | When set | |---|---|---| | `""` (empty) | Valid — no issues detected | Parser returned VALID on first or second attempt | | `INVALID_UNRESOLVED` | Parser ran, code failed both attempts | Two parser calls made, both returned INVALID | | `PARSER_UNAVAILABLE` | Parser was unreachable or circuit is open | No parser call was made or all calls timed out | These are semantically distinct signals. `INVALID_UNRESOLVED` means the engine has evidence the code is wrong. `PARSER_UNAVAILABLE` means the engine has no evidence either way — the code may be correct. Clients must not treat them equivalently. `validation_status` is surfaced to the client via `AgentResponse`: ```protobuf message AgentResponse { string text = 1; string avap_code = 2; bool is_final = 3; string validation_status = 4; // "" | "INVALID_UNRESOLVED" | "PARSER_UNAVAILABLE" } ``` Clients that do not read `validation_status` are unaffected — the field defaults to empty string. --- ## Routing contract additions (RC-07, RC-08) These rules extend the contract defined in ADR-0008. ### RC-07 — Parser validation gate (priority: high) Every `CODE_GENERATION` response **MUST** be submitted to the AVAP Parser before delivery to the client, unless `AVAP_PARSER_TIMEOUT=0` or the parser service is unreachable. ``` route(q) = CODE_GENERATION → parser_validate(response) before yield ``` A `CODE_GENERATION` response returned without parser validation due to parser unavailability **MUST** be logged as `[ptvl] parser unavailable — returning unvalidated`. ### RC-08 — Retry budget (priority: medium) Each request has a maximum of **1 retry** regardless of type. A `CODE_GENERATION` request that fails parser validation twice returns the second attempt with `validation_status=true`. A `RETRIEVAL` request whose context is insufficient reformulates once and generates unconditionally on the second retrieval. No request may enter more than one retry cycle. --- ## Consequences ### Positive - Syntactically invalid AVAP code no longer reaches users silently. `validation_status` gives the client a typed signal: `INVALID_UNRESOLVED` (evidence of bad code) vs `PARSER_UNAVAILABLE` (no evidence either way) — clients can respond differently to each. - The parser trace makes retries targeted rather than blind — the LLM corrects specific lines, not the whole response. - Circuit breaker prevents parser outages from adding latency to every `CODE_GENERATION` request. After 3 consecutive failures the engine stops trying for 30 seconds. - Context relevance check catches retrievals that return topically adjacent but non-answering chunks, reducing fluent-but-ungrounded responses. - `AVAP_PARSER_TIMEOUT=0` allows development without the parser service — no hard dependency at startup. ### Negative / Trade-offs - **`CODE_GENERATION` latency**: +1 parser gRPC call per request (~50–200ms for valid code). +1 LLM generation call + 1 parser call on invalid code (~1–2s additional). - **`RETRIEVAL` latency**: +1 LLM call (relevance check) on every request. At `qwen3:1.7b` local inference, this adds ~300–500ms to every RETRIEVAL request — not negligible. - The parser becomes a **soft production dependency** for CODE_GENERATION. Parser outages degrade validation silently; monitoring must alert on sustained `parser unavailable` log volume. - The context relevance check is a **generative model doing a binary classification task** — the same architectural mismatch noted in ADR-0008 for the classifier. It is the correct interim solution while no discriminative relevance model exists. ### Open questions 1. **`RETRIEVAL` latency budget**: The +300–500ms from the relevance LLM call may be unacceptable for the VS Code extension use case where streaming latency is user-visible. A discriminative relevance model (embedding similarity between query vector and context vector, cosine threshold) would be ~1ms and eliminate this cost entirely. Deferred to a future amendment. 2. **`validation_status` UX**: The proto field is defined but the client behavior is not specified. What should the VS Code extension or AVS Platform display when `validation_status=true`? Requires a product decision outside this ADR's scope. 3. **Parser version pinning**: Inherited from ADR-0007 open question 2. Parser upgrades may alter what is considered valid AVAP. A policy for handling parser version changes in the production pipeline has not been defined. --- ## Future Path The context relevance check for `RETRIEVAL` (Decision 2) uses a generative LLM for a discriminative task — the same pattern that ADR-0008 identified as tactical debt for the classifier. The correct steady-state implementation is a cosine similarity threshold between the query embedding vector and the average context embedding vector: ``` relevance_score = cosine(embed(query), mean(embed(chunks))) if relevance_score < RELEVANCE_THRESHOLD: reformulate_with_hint() ``` This runs in microseconds using the `bge-m3` embeddings already computed during retrieval. It replaces the `CONFIDENCE_PROMPT_TEMPLATE` LLM call entirely and eliminates the +300–500ms latency penalty on every RETRIEVAL request. **Trigger for this upgrade:** once the RETRIEVAL validation LLM call appears as a measurable latency contribution in Langfuse traces.