15 KiB
ADR-0009: Per-Type Response Validation Layer
Date: 2026-04-10 Status: Accepted Deciders: Rafael Ruiz (CTO) Related ADRs: ADR-0007 (MSVL for RAG Evaluation), ADR-0008 (Adaptive Query Routing), ADR-0003 (Hybrid Retrieval RRF)
Context
Problem 1 — Syntactically invalid code reaches users
ADR-0007 documents that 10–16% of CODE_GENERATION responses contain syntactically invalid AVAP code — foreign language injection (Go, Python, JavaScript) or hallucinated commands (getSHA256, readParam, returnResult). The LLM judge used in EvaluateRAG does not detect these failures because it evaluates semantic coherence, not syntactic validity.
This problem exists in production today. A user receiving a CODE_GENERATION response has no indication that the generated code would fail on the PLATON kernel.
The AVAP Parser gRPC service — established in ADR-0007 as a hard dependency of the evaluation pipeline — is already available in the stack. It returns not just VALID / INVALID but a complete line-by-line execution trace on failure:
Line 3: unknown command 'getSHA256' — expected known identifier
Line 7: unexpected construct 'for i in range(...)' — AVAP loop syntax required
Line 12: 'returnResult' not defined — did you mean 'addResult'?
This trace is structured, specific, and directly actionable by the LLM. A retry informed by the parser trace is fundamentally different from a blind retry — the model knows exactly what failed and where.
Problem 2 — Context relevance is not evaluated pre-generation
The engine retrieves 8 chunks from Elasticsearch for every RETRIEVAL and CODE_GENERATION query without checking whether those chunks actually answer the question. CONFIDENCE_PROMPT_TEMPLATE has been scaffolded in prompts.py since the initial implementation but is not wired into the graph.
Undetected low-relevance retrieval produces responses that are semantically fluent but factually ungrounded — the model generates plausible-sounding AVAP explanations or code not supported by the retrieved documentation.
Why these are one ADR
Both problems share the same architectural response: adding validation nodes to the production graph with type-specific logic and feedback-informed retry. The decision is one — add a per-type validation layer — and the implementation shares the same graph positions (post-retrieve, post-generate), the same retry contract (maximum 1 retry per request), and the same rationale (the engine must not silently return responses it has evidence to question).
Splitting into separate ADRs would produce two documents that cannot be understood independently.
Decision
Add a Per-Type Response Validation Layer (PTVL) to the production LangGraph pipeline. Each query type has a distinct validation strategy matching its failure modes.
Validation contract by type
| Type | When | What | Mechanism |
|---|---|---|---|
CODE_GENERATION |
Post-generation | Syntactic validity of generated AVAP code | AVAP Parser gRPC — deterministic |
RETRIEVAL |
Pre-generation | Relevance of retrieved context to the query | LLM relevance check — CONFIDENCE_PROMPT_TEMPLATE |
CONVERSATIONAL |
None | — | No retrieval, no code generated |
PLATFORM |
None | — | No retrieval, no code generated |
Decision 1 — CODE_GENERATION: parser validation with trace-guided retry
Flow
generate_code node
│
▼
[V1] AVAP Parser gRPC
│
├── VALID ──────────────────────────────► return response
│
└── INVALID + line-by-line trace
│
▼
[inject trace into retry prompt]
│
▼
generate_code_retry node (1 attempt only)
│
▼
[V2] AVAP Parser gRPC
│
├── VALID ──────────────────────► return response
│
└── INVALID ────────────────────► return response + validation_status flag
Trace-guided retry
The parser trace is injected into the generation prompt as a structured correction context:
<parser_feedback>
The previous attempt produced invalid AVAP code. Specific failures:
Line 3: unknown command 'getSHA256' — expected known identifier
Line 7: unexpected construct 'for i in range(...)' — AVAP loop syntax required
Correct these errors. Do not repeat the same constructs.
</parser_feedback>
This is not a blind retry. The LLM receives the exact failure points and can target its corrections. ADR-0007 documented the mapping between common hallucinated commands and their valid AVAP equivalents (getSHA256 → encodeSHA256, returnResult → addResult, etc.) — the trace makes these corrections automatic without hardcoding the mapping.
Parser SLA
Inherited from ADR-0007: ≤2 seconds per call. Silent fallback is permitted in production (unlike evaluation, where ADR-0007 mandates abort). The distinction is that evaluation scores must be trustworthy; production responses degrade gracefully.
Parser availability — circuit breaker
A single timeout or connection error does not mean the parser is down. A sustained outage does. Hammering an unavailable gRPC service on every CODE_GENERATION request adds latency to every user request with zero benefit.
The engine implements a circuit breaker with three states:
CLOSED ──[N consecutive failures]──► OPEN
OPEN ──[cooldown expires]────────► HALF-OPEN
HALF-OPEN ──[probe succeeds]────────► CLOSED
HALF-OPEN ──[probe fails]───────────► OPEN
| Parameter | Default | Env var |
|---|---|---|
| Failure threshold to open | 3 consecutive failures | PARSER_CB_THRESHOLD |
| Cooldown before half-open | 30 seconds | PARSER_CB_COOLDOWN |
| Timeout per call | 2 seconds | AVAP_PARSER_TIMEOUT |
While the circuit is OPEN, CODE_GENERATION responses are returned immediately with validation_status = PARSER_UNAVAILABLE — no gRPC call is attempted. The cooldown prevents thundering-herd reconnection attempts.
[ptvl] circuit OPEN — skipping parser, returning unvalidated
[ptvl] circuit HALF-OPEN — probing parser availability
[ptvl] circuit CLOSED — parser reachable
Setting AVAP_PARSER_TIMEOUT=0 permanently opens the circuit — disables parser validation entirely. Useful during development when the parser service is not deployed.
New environment variables
AVAP_PARSER_URL=grpc://... # URL of AVAP Parser gRPC service
AVAP_PARSER_TIMEOUT=2 # seconds per call; 0 = disable validation
PARSER_CB_THRESHOLD=3 # consecutive failures before circuit opens
PARSER_CB_COOLDOWN=30 # seconds before circuit attempts half-open probe
Decision 2 — RETRIEVAL: context relevance check with reformulation retry
Flow
retrieve node
│
▼
[V1] CONFIDENCE_PROMPT_TEMPLATE
YES / NO
│
├── YES ────────────────────────────────► generate node (normal path)
│
└── NO
│
▼
reformulate_with_hint node
[reformulate query signalling context was insufficient]
│
▼
retrieve (retry)
│
▼
generate node (regardless of second retrieval result)
On second retrieval, generation proceeds unconditionally. The retry is a single best-effort improvement, not a gate. The model generates with whatever context is available — the user receives a response, not a refusal.
Relevance check
CONFIDENCE_PROMPT_TEMPLATE evaluates whether the retrieved context contains at least one passage relevant to the question. It returns YES or NO — no graduated score.
The relevance check is only applied to RETRIEVAL queries here. CODE_GENERATION has the parser as a stronger post-generation signal; the pre-generation relevance check would add latency with lower signal value than the parser provides.
Reformulation hint
When context is insufficient, the reformulation node receives a hint that standard context was not found. This produces a semantically different reformulation — broader synonyms, alternative phrasings — rather than a near-duplicate of the original query.
[CONTEXT_INSUFFICIENT]
The previous retrieval did not return relevant context for: "{original_query}"
Reformulate this query using broader terms or alternative phrasing.
Graph changes
New nodes
| Node | Graph | Trigger |
|---|---|---|
validate_code |
build_graph |
After generate_code |
generate_code_retry |
build_graph |
After validate_code when INVALID |
check_context_relevance |
build_graph + build_prepare_graph |
After retrieve, before generate (RETRIEVAL only) |
reformulate_with_hint |
build_graph + build_prepare_graph |
After check_context_relevance when NO |
Updated flow — build_graph
flowchart TD
START([start]) --> CL[classify]
CL -->|RETRIEVAL| RF[reformulate]
CL -->|CODE_GENERATION| RF
CL -->|CONVERSATIONAL| RC[respond_conversational]
CL -->|PLATFORM| RP[respond_platform]
RF --> RT[retrieve]
RT -->|CODE_GENERATION| GC[generate_code]
RT -->|RETRIEVAL| CR{check_context\nrelevance}
CR -->|YES| GE[generate]
CR -->|NO| RH[reformulate_with_hint]
RH --> RT2[retrieve retry]
RT2 --> GE
GC --> VC{validate_code\nParser gRPC}
VC -->|VALID| END([end])
VC -->|INVALID + trace| GCR[generate_code_retry\ntrace-guided]
GCR --> VC2{validate_code\nParser gRPC}
VC2 -->|VALID| END
VC2 -->|INVALID| END
GE --> END
RC --> END
RP --> END
New AgentState fields
class AgentState(TypedDict):
...
# PTVL fields
parser_trace: str # raw parser trace from first validation attempt (empty if valid)
validation_status: str # see validation status values below
context_relevant: bool # result of CONFIDENCE_PROMPT check (RETRIEVAL only)
Validation status values
| Value | Meaning | When set |
|---|---|---|
"" (empty) |
Valid — no issues detected | Parser returned VALID on first or second attempt |
INVALID_UNRESOLVED |
Parser ran, code failed both attempts | Two parser calls made, both returned INVALID |
PARSER_UNAVAILABLE |
Parser was unreachable or circuit is open | No parser call was made or all calls timed out |
These are semantically distinct signals. INVALID_UNRESOLVED means the engine has evidence the code is wrong. PARSER_UNAVAILABLE means the engine has no evidence either way — the code may be correct. Clients must not treat them equivalently.
validation_status is surfaced to the client via AgentResponse:
message AgentResponse {
string text = 1;
string avap_code = 2;
bool is_final = 3;
string validation_status = 4; // "" | "INVALID_UNRESOLVED" | "PARSER_UNAVAILABLE"
}
Clients that do not read validation_status are unaffected — the field defaults to empty string.
Routing contract additions (RC-07, RC-08)
These rules extend the contract defined in ADR-0008.
RC-07 — Parser validation gate (priority: high)
Every CODE_GENERATION response MUST be submitted to the AVAP Parser before delivery to the client, unless AVAP_PARSER_TIMEOUT=0 or the parser service is unreachable.
route(q) = CODE_GENERATION → parser_validate(response) before yield
A CODE_GENERATION response returned without parser validation due to parser unavailability MUST be logged as [ptvl] parser unavailable — returning unvalidated.
RC-08 — Retry budget (priority: medium)
Each request has a maximum of 1 retry regardless of type. A CODE_GENERATION request that fails parser validation twice returns the second attempt with validation_status=true. A RETRIEVAL request whose context is insufficient reformulates once and generates unconditionally on the second retrieval.
No request may enter more than one retry cycle.
Consequences
Positive
- Syntactically invalid AVAP code no longer reaches users silently.
validation_statusgives the client a typed signal:INVALID_UNRESOLVED(evidence of bad code) vsPARSER_UNAVAILABLE(no evidence either way) — clients can respond differently to each. - The parser trace makes retries targeted rather than blind — the LLM corrects specific lines, not the whole response.
- Circuit breaker prevents parser outages from adding latency to every
CODE_GENERATIONrequest. After 3 consecutive failures the engine stops trying for 30 seconds. - Context relevance check catches retrievals that return topically adjacent but non-answering chunks, reducing fluent-but-ungrounded responses.
AVAP_PARSER_TIMEOUT=0allows development without the parser service — no hard dependency at startup.
Negative / Trade-offs
CODE_GENERATIONlatency: +1 parser gRPC call per request (~50–200ms for valid code). +1 LLM generation call + 1 parser call on invalid code (~1–2s additional).RETRIEVALlatency: +1 LLM call (relevance check) on every request. Atqwen3:1.7blocal inference, this adds ~300–500ms to every RETRIEVAL request — not negligible.- The parser becomes a soft production dependency for CODE_GENERATION. Parser outages degrade validation silently; monitoring must alert on sustained
parser unavailablelog volume. - The context relevance check is a generative model doing a binary classification task — the same architectural mismatch noted in ADR-0008 for the classifier. It is the correct interim solution while no discriminative relevance model exists.
Open questions
-
RETRIEVALlatency budget: The +300–500ms from the relevance LLM call may be unacceptable for the VS Code extension use case where streaming latency is user-visible. A discriminative relevance model (embedding similarity between query vector and context vector, cosine threshold) would be ~1ms and eliminate this cost entirely. Deferred to a future amendment. -
validation_statusUX: The proto field is defined but the client behavior is not specified. What should the VS Code extension or AVS Platform display whenvalidation_status=true? Requires a product decision outside this ADR's scope. -
Parser version pinning: Inherited from ADR-0007 open question 2. Parser upgrades may alter what is considered valid AVAP. A policy for handling parser version changes in the production pipeline has not been defined.
Future Path
The context relevance check for RETRIEVAL (Decision 2) uses a generative LLM for a discriminative task — the same pattern that ADR-0008 identified as tactical debt for the classifier. The correct steady-state implementation is a cosine similarity threshold between the query embedding vector and the average context embedding vector:
relevance_score = cosine(embed(query), mean(embed(chunks)))
if relevance_score < RELEVANCE_THRESHOLD:
reformulate_with_hint()
This runs in microseconds using the bge-m3 embeddings already computed during retrieval. It replaces the CONFIDENCE_PROMPT_TEMPLATE LLM call entirely and eliminates the +300–500ms latency penalty on every RETRIEVAL request.
Trigger for this upgrade: once the RETRIEVAL validation LLM call appears as a measurable latency contribution in Langfuse traces.