assistance-engine/docs/ADR/ADR-0009-per-type-response-...

15 KiB
Raw Blame History

ADR-0009: Per-Type Response Validation Layer

Date: 2026-04-10 Status: Accepted Deciders: Rafael Ruiz (CTO) Related ADRs: ADR-0007 (MSVL for RAG Evaluation), ADR-0008 (Adaptive Query Routing), ADR-0003 (Hybrid Retrieval RRF)


Context

Problem 1 — Syntactically invalid code reaches users

ADR-0007 documents that 1016% of CODE_GENERATION responses contain syntactically invalid AVAP code — foreign language injection (Go, Python, JavaScript) or hallucinated commands (getSHA256, readParam, returnResult). The LLM judge used in EvaluateRAG does not detect these failures because it evaluates semantic coherence, not syntactic validity.

This problem exists in production today. A user receiving a CODE_GENERATION response has no indication that the generated code would fail on the PLATON kernel.

The AVAP Parser gRPC service — established in ADR-0007 as a hard dependency of the evaluation pipeline — is already available in the stack. It returns not just VALID / INVALID but a complete line-by-line execution trace on failure:

Line 3: unknown command 'getSHA256' — expected known identifier
Line 7: unexpected construct 'for i in range(...)' — AVAP loop syntax required
Line 12: 'returnResult' not defined — did you mean 'addResult'?

This trace is structured, specific, and directly actionable by the LLM. A retry informed by the parser trace is fundamentally different from a blind retry — the model knows exactly what failed and where.

Problem 2 — Context relevance is not evaluated pre-generation

The engine retrieves 8 chunks from Elasticsearch for every RETRIEVAL and CODE_GENERATION query without checking whether those chunks actually answer the question. CONFIDENCE_PROMPT_TEMPLATE has been scaffolded in prompts.py since the initial implementation but is not wired into the graph.

Undetected low-relevance retrieval produces responses that are semantically fluent but factually ungrounded — the model generates plausible-sounding AVAP explanations or code not supported by the retrieved documentation.

Why these are one ADR

Both problems share the same architectural response: adding validation nodes to the production graph with type-specific logic and feedback-informed retry. The decision is one — add a per-type validation layer — and the implementation shares the same graph positions (post-retrieve, post-generate), the same retry contract (maximum 1 retry per request), and the same rationale (the engine must not silently return responses it has evidence to question).

Splitting into separate ADRs would produce two documents that cannot be understood independently.


Decision

Add a Per-Type Response Validation Layer (PTVL) to the production LangGraph pipeline. Each query type has a distinct validation strategy matching its failure modes.

Validation contract by type

Type When What Mechanism
CODE_GENERATION Post-generation Syntactic validity of generated AVAP code AVAP Parser gRPC — deterministic
RETRIEVAL Pre-generation Relevance of retrieved context to the query LLM relevance check — CONFIDENCE_PROMPT_TEMPLATE
CONVERSATIONAL None No retrieval, no code generated
PLATFORM None No retrieval, no code generated

Decision 1 — CODE_GENERATION: parser validation with trace-guided retry

Flow

generate_code node
    │
    ▼
[V1] AVAP Parser gRPC
    │
    ├── VALID ──────────────────────────────► return response
    │
    └── INVALID + line-by-line trace
            │
            ▼
        [inject trace into retry prompt]
            │
            ▼
        generate_code_retry node  (1 attempt only)
            │
            ▼
        [V2] AVAP Parser gRPC
            │
            ├── VALID ──────────────────────► return response
            │
            └── INVALID ────────────────────► return response + validation_status flag

Trace-guided retry

The parser trace is injected into the generation prompt as a structured correction context:

<parser_feedback>
The previous attempt produced invalid AVAP code. Specific failures:

Line 3: unknown command 'getSHA256' — expected known identifier
Line 7: unexpected construct 'for i in range(...)' — AVAP loop syntax required

Correct these errors. Do not repeat the same constructs.
</parser_feedback>

This is not a blind retry. The LLM receives the exact failure points and can target its corrections. ADR-0007 documented the mapping between common hallucinated commands and their valid AVAP equivalents (getSHA256encodeSHA256, returnResultaddResult, etc.) — the trace makes these corrections automatic without hardcoding the mapping.

Parser SLA

Inherited from ADR-0007: ≤2 seconds per call. Silent fallback is permitted in production (unlike evaluation, where ADR-0007 mandates abort). The distinction is that evaluation scores must be trustworthy; production responses degrade gracefully.

Parser availability — circuit breaker

A single timeout or connection error does not mean the parser is down. A sustained outage does. Hammering an unavailable gRPC service on every CODE_GENERATION request adds latency to every user request with zero benefit.

The engine implements a circuit breaker with three states:

CLOSED  ──[N consecutive failures]──► OPEN
OPEN    ──[cooldown expires]────────► HALF-OPEN
HALF-OPEN ──[probe succeeds]────────► CLOSED
HALF-OPEN ──[probe fails]───────────► OPEN
Parameter Default Env var
Failure threshold to open 3 consecutive failures PARSER_CB_THRESHOLD
Cooldown before half-open 30 seconds PARSER_CB_COOLDOWN
Timeout per call 2 seconds AVAP_PARSER_TIMEOUT

While the circuit is OPEN, CODE_GENERATION responses are returned immediately with validation_status = PARSER_UNAVAILABLE — no gRPC call is attempted. The cooldown prevents thundering-herd reconnection attempts.

[ptvl] circuit OPEN — skipping parser, returning unvalidated
[ptvl] circuit HALF-OPEN — probing parser availability
[ptvl] circuit CLOSED — parser reachable

Setting AVAP_PARSER_TIMEOUT=0 permanently opens the circuit — disables parser validation entirely. Useful during development when the parser service is not deployed.

New environment variables

AVAP_PARSER_URL=grpc://...    # URL of AVAP Parser gRPC service
AVAP_PARSER_TIMEOUT=2         # seconds per call; 0 = disable validation
PARSER_CB_THRESHOLD=3         # consecutive failures before circuit opens
PARSER_CB_COOLDOWN=30         # seconds before circuit attempts half-open probe

Decision 2 — RETRIEVAL: context relevance check with reformulation retry

Flow

retrieve node
    │
    ▼
[V1] CONFIDENCE_PROMPT_TEMPLATE
    YES / NO
    │
    ├── YES ────────────────────────────────► generate node (normal path)
    │
    └── NO
            │
            ▼
        reformulate_with_hint node
        [reformulate query signalling context was insufficient]
            │
            ▼
        retrieve (retry)
            │
            ▼
        generate node  (regardless of second retrieval result)

On second retrieval, generation proceeds unconditionally. The retry is a single best-effort improvement, not a gate. The model generates with whatever context is available — the user receives a response, not a refusal.

Relevance check

CONFIDENCE_PROMPT_TEMPLATE evaluates whether the retrieved context contains at least one passage relevant to the question. It returns YES or NO — no graduated score.

The relevance check is only applied to RETRIEVAL queries here. CODE_GENERATION has the parser as a stronger post-generation signal; the pre-generation relevance check would add latency with lower signal value than the parser provides.

Reformulation hint

When context is insufficient, the reformulation node receives a hint that standard context was not found. This produces a semantically different reformulation — broader synonyms, alternative phrasings — rather than a near-duplicate of the original query.

[CONTEXT_INSUFFICIENT]
The previous retrieval did not return relevant context for: "{original_query}"
Reformulate this query using broader terms or alternative phrasing.

Graph changes

New nodes

Node Graph Trigger
validate_code build_graph After generate_code
generate_code_retry build_graph After validate_code when INVALID
check_context_relevance build_graph + build_prepare_graph After retrieve, before generate (RETRIEVAL only)
reformulate_with_hint build_graph + build_prepare_graph After check_context_relevance when NO

Updated flow — build_graph

flowchart TD
    START([start]) --> CL[classify]

    CL -->|RETRIEVAL| RF[reformulate]
    CL -->|CODE_GENERATION| RF
    CL -->|CONVERSATIONAL| RC[respond_conversational]
    CL -->|PLATFORM| RP[respond_platform]

    RF --> RT[retrieve]

    RT -->|CODE_GENERATION| GC[generate_code]
    RT -->|RETRIEVAL| CR{check_context\nrelevance}

    CR -->|YES| GE[generate]
    CR -->|NO| RH[reformulate_with_hint]
    RH --> RT2[retrieve retry]
    RT2 --> GE

    GC --> VC{validate_code\nParser gRPC}
    VC -->|VALID| END([end])
    VC -->|INVALID + trace| GCR[generate_code_retry\ntrace-guided]
    GCR --> VC2{validate_code\nParser gRPC}
    VC2 -->|VALID| END
    VC2 -->|INVALID| END

    GE --> END
    RC --> END
    RP --> END

New AgentState fields

class AgentState(TypedDict):
    ...
    # PTVL fields
    parser_trace:       str   # raw parser trace from first validation attempt (empty if valid)
    validation_status:  str   # see validation status values below
    context_relevant:   bool  # result of CONFIDENCE_PROMPT check (RETRIEVAL only)

Validation status values

Value Meaning When set
"" (empty) Valid — no issues detected Parser returned VALID on first or second attempt
INVALID_UNRESOLVED Parser ran, code failed both attempts Two parser calls made, both returned INVALID
PARSER_UNAVAILABLE Parser was unreachable or circuit is open No parser call was made or all calls timed out

These are semantically distinct signals. INVALID_UNRESOLVED means the engine has evidence the code is wrong. PARSER_UNAVAILABLE means the engine has no evidence either way — the code may be correct. Clients must not treat them equivalently.

validation_status is surfaced to the client via AgentResponse:

message AgentResponse {
  string text              = 1;
  string avap_code         = 2;
  bool   is_final          = 3;
  string validation_status = 4;  // "" | "INVALID_UNRESOLVED" | "PARSER_UNAVAILABLE"
}

Clients that do not read validation_status are unaffected — the field defaults to empty string.


Routing contract additions (RC-07, RC-08)

These rules extend the contract defined in ADR-0008.

RC-07 — Parser validation gate (priority: high)

Every CODE_GENERATION response MUST be submitted to the AVAP Parser before delivery to the client, unless AVAP_PARSER_TIMEOUT=0 or the parser service is unreachable.

route(q) = CODE_GENERATION → parser_validate(response) before yield

A CODE_GENERATION response returned without parser validation due to parser unavailability MUST be logged as [ptvl] parser unavailable — returning unvalidated.

RC-08 — Retry budget (priority: medium)

Each request has a maximum of 1 retry regardless of type. A CODE_GENERATION request that fails parser validation twice returns the second attempt with validation_status=true. A RETRIEVAL request whose context is insufficient reformulates once and generates unconditionally on the second retrieval.

No request may enter more than one retry cycle.


Consequences

Positive

  • Syntactically invalid AVAP code no longer reaches users silently. validation_status gives the client a typed signal: INVALID_UNRESOLVED (evidence of bad code) vs PARSER_UNAVAILABLE (no evidence either way) — clients can respond differently to each.
  • The parser trace makes retries targeted rather than blind — the LLM corrects specific lines, not the whole response.
  • Circuit breaker prevents parser outages from adding latency to every CODE_GENERATION request. After 3 consecutive failures the engine stops trying for 30 seconds.
  • Context relevance check catches retrievals that return topically adjacent but non-answering chunks, reducing fluent-but-ungrounded responses.
  • AVAP_PARSER_TIMEOUT=0 allows development without the parser service — no hard dependency at startup.

Negative / Trade-offs

  • CODE_GENERATION latency: +1 parser gRPC call per request (~50200ms for valid code). +1 LLM generation call + 1 parser call on invalid code (~12s additional).
  • RETRIEVAL latency: +1 LLM call (relevance check) on every request. At qwen3:1.7b local inference, this adds ~300500ms to every RETRIEVAL request — not negligible.
  • The parser becomes a soft production dependency for CODE_GENERATION. Parser outages degrade validation silently; monitoring must alert on sustained parser unavailable log volume.
  • The context relevance check is a generative model doing a binary classification task — the same architectural mismatch noted in ADR-0008 for the classifier. It is the correct interim solution while no discriminative relevance model exists.

Open questions

  1. RETRIEVAL latency budget: The +300500ms from the relevance LLM call may be unacceptable for the VS Code extension use case where streaming latency is user-visible. A discriminative relevance model (embedding similarity between query vector and context vector, cosine threshold) would be ~1ms and eliminate this cost entirely. Deferred to a future amendment.

  2. validation_status UX: The proto field is defined but the client behavior is not specified. What should the VS Code extension or AVS Platform display when validation_status=true? Requires a product decision outside this ADR's scope.

  3. Parser version pinning: Inherited from ADR-0007 open question 2. Parser upgrades may alter what is considered valid AVAP. A policy for handling parser version changes in the production pipeline has not been defined.


Future Path

The context relevance check for RETRIEVAL (Decision 2) uses a generative LLM for a discriminative task — the same pattern that ADR-0008 identified as tactical debt for the classifier. The correct steady-state implementation is a cosine similarity threshold between the query embedding vector and the average context embedding vector:

relevance_score = cosine(embed(query), mean(embed(chunks)))
if relevance_score < RELEVANCE_THRESHOLD:
    reformulate_with_hint()

This runs in microseconds using the bge-m3 embeddings already computed during retrieval. It replaces the CONFIDENCE_PROMPT_TEMPLATE LLM call entirely and eliminates the +300500ms latency penalty on every RETRIEVAL request.

Trigger for this upgrade: once the RETRIEVAL validation LLM call appears as a measurable latency contribution in Langfuse traces.