ADR0008 Finished and ADR0009 finalized

2026-04-13 20:37:43 -07:00 · 2026-04-13 20:37:43 -07:00 · 8db95e50f4
parent 6118a77078
commit 8db95e50f4
2 changed files with 60 additions and 11 deletions
--- a/20
+++ b/20
@ -3,6 +3,26 @@
 All notable changes to the **Brunix Assistance Engine** will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

 ---
+## [1.7.1] - 2026-04-13
+
+### Added
+- ENGINE: `_run_generate_tests(user_request, generated_code, llm)` — module-level helper in `graph.py` that encapsulates the full test generation logic (LLM call, thinking stripping, JSON extraction, 15s timeout). Both the `generate_tests` LangGraph node and `AskAgentStream` delegate to it.
+- ENGINE: `AskAgentStream` — test generation now runs before `_call_parser` in the streaming state machine. After a code block is buffered, `_run_generate_tests` is called to produce `test_inputs`/`test_list`; both are passed to `_call_parser` (first attempt and retry). `AskAgentStream` now has full execution + assertion validation parity with `AskAgent`.
+- ENGINE: `_call_parser` — assertion KeyError guard. When the parser returns `{'success': False, 'error': "'variable_name'"}` (Python `KeyError` from evaluating a non-existent variable in an assertion), the engine drops the offending assertion and retries instead of invoking the LLM code-fix path. Prevents correct code from being incorrectly rewritten due to a test generation defect.
+
+### Changed
+- ENGINE: `generate_tests` LangGraph node refactored to delegate to `_run_generate_tests` (no behaviour change).
+- ENGINE: `TEST_GENERATION_PROMPT` — strengthened `<avap_variables_rule>`: explicit rule that `addResult(x)` sends `x` to the HTTP response and does **not** create a variable; assertions must never reference names that only appear inside `addResult()`. Updated example to reinforce the correct pattern.
+
+### Fixed
+- ENGINE: `AskAgentStream` validation was execution-only with null inputs — `id_interno` resolved to `None` because no `test_inputs` were injected. Fixed by running `_run_generate_tests` before each `_call_parser` call in the streaming path.
+- ENGINE: `generate_tests` was producing assertions on `str(result)` when code used `addResult(id_interno)` — `result` is not a variable, causing parser `KeyError` and incorrect LLM code-fix retry. Fixed by prompt improvement and the KeyError guard.
+
+### Docs
+- `docs/ADR/ADR-0009` updated: streaming state machine flow now shows `_run_generate_tests` step; difference table updated with test generation and assertion validation rows; `generate_tests` node section documents variable naming contract and KeyError guard; `AskAgentStream` graph section notes full parity.
+
+---
+
 ## [1.7.0] - 2026-04-13

 ### Added
--- a/docs/ADR/ADR-0009-per-type-response-validation.md
+++ b/docs/ADR/ADR-0009-per-type-response-validation.md
@ -1,7 +1,7 @@
 # ADR-0009: Per-Type Response Validation Layer

 **Date:** 2026-04-10
-**Last updated:** 2026-04-13
+**Last updated:** 2026-04-13 (rev 2)
 **Status:** Implemented
 **Deciders:** Rafael Ruiz (CTO)
 **Related ADRs:** ADR-0007 (MSVL for RAG Evaluation), ADR-0008 (Adaptive Query Routing), ADR-0003 (Hybrid Retrieval RRF)
@ -16,7 +16,7 @@ ADR-0007 documents that 10–16% of `CODE_GENERATION` responses contain syntacti

 This problem exists in production today. A user receiving a `CODE_GENERATION` response has no indication that the generated code would fail on the PLATON kernel.

-The AVAP Parser gRPC service — established in ADR-0007 as a hard dependency of the evaluation pipeline — is already available in the stack. It returns not just `VALID / INVALID` but a **complete line-by-line execution trace** on failure:
+The AVAP Parser REST HTTP service (Tornado, port 8888) — established in ADR-0007 as a hard dependency of the evaluation pipeline — is already available in the stack. It returns not just `VALID / INVALID` but a **complete line-by-line execution trace** on failure:

 ```
 Line 3: unknown command 'getSHA256' — expected known identifier
@ -113,14 +113,33 @@ generated_code: "addParam(\"client_id\", id_interno)\naddResult(id_interno)"
 # Output
 {
  "test_inputs": {"client_id": "12345"},
-  "test_list":   ["re.match(r'^\\d{5}$', str(id_interno))"]
+  "test_list":   ["re.match(r'^\\d+$', str(id_interno))"]
 }
 ```

 `test_inputs` are injected as request variables when the parser executes the code. `test_list` items are regex assertions evaluated against the output variables after execution.

+**Variable naming contract** (enforced by `TEST_GENERATION_PROMPT`):
+
+- `addParam("request_param_name", avap_var)` — `test_inputs` key is `"request_param_name"`; assertion variable is `avap_var`.
+- `addResult(x)` — adds `x` to the HTTP response body. **Does NOT create a variable.** Assertions must never reference a name that only appears inside `addResult()`.
+- Only variables on the left side of `=`, the second arg of `addParam`, or the first arg of `addVar` are valid assertion targets.
+
 If the LLM call fails or times out (15s hard limit), `generate_tests` returns empty `test_inputs` and `test_list` — validation continues using execution-only (no assertions). This keeps the node non-blocking.

+The core generation logic is implemented in `_run_generate_tests(user_request, generated_code, llm)` — a module-level function in `graph.py`. Both the `generate_tests` LangGraph node and the `AskAgentStream` streaming state machine delegate to it.
+
+#### Assertion KeyError guard in `_call_parser`
+
+If the parser returns `{'success': False, 'error': "'variable_name'"}` — a Python `KeyError` pattern — it means an assertion referenced a variable that does not exist in the execution scope (a test generation defect, not a code defect). In this case `_call_parser`:
+
+1. Logs the event as `[ptvl] assertion KeyError: variable '...' not in scope — dropping affected assertions`.
+2. Filters out any assertion in `test_list` that references `str(bad_var)`.
+3. Retries `_call_parser` with the filtered list.
+4. If all assertions were filtered out, returns `(True, "")` — the code itself executed successfully.
+
+This prevents the LLM fix path from being triggered on a test generation error, which would incorrectly modify correct code.
+
 #### Trace-guided retry

 The parser error is injected into the retry prompt as a structured correction context:
@ -207,7 +226,7 @@ Code is extracted from the LLM response by scanning for the first markdown code

 #### Parser availability — circuit breaker

-A single timeout or connection error does not mean the parser is down. A sustained outage does. Hammering an unavailable gRPC service on every `CODE_GENERATION` request adds latency to every user request with zero benefit.
+A single timeout or connection error does not mean the parser is down. A sustained outage does. Hammering an unavailable REST service on every `CODE_GENERATION` request adds latency to every user request with zero benefit.

 The engine implements a **circuit breaker** with three states:

@ -224,7 +243,7 @@ HALF-OPEN ──[probe fails]───────────► OPEN
 | Cooldown before half-open | 30 seconds | `PARSER_CB_COOLDOWN` |
 | Timeout per call | 2 seconds | `AVAP_PARSER_TIMEOUT` |

-While the circuit is **OPEN**, `CODE_GENERATION` responses are returned immediately with `validation_status = PARSER_UNAVAILABLE` — no gRPC call is attempted. The cooldown prevents thundering-herd reconnection attempts.
+While the circuit is **OPEN**, `CODE_GENERATION` responses are returned immediately with `validation_status = PARSER_UNAVAILABLE` — no HTTP call is attempted. The cooldown prevents thundering-herd reconnection attempts.

 ```
 [ptvl] circuit OPEN — skipping parser, returning unvalidated
@ -291,8 +310,6 @@ Reformulate this query using broader terms or alternative phrasing.

 ---

---
-
 ### Decision 3 — AskAgentStream: streaming state machine for CODE_GENERATION

 The `AskAgentStream` path streams tokens directly to the client. Post-generation validation (Decision 1) cannot run here because tokens are already yielded before the response is complete.
@ -315,7 +332,14 @@ STATE: CODE  (buffering — nothing yielded to client)
  → detect closing ``` after first newline
    │ closing ``` detected
    ▼
-_call_parser(complete_block)
+_extract_avap_code(complete_block)
+    │
+    ▼
+_run_generate_tests(query, avap_code, llm)   ← same helper as build_graph path
+    [produces test_inputs + test_list, 15s timeout]
+    │
+    ▼
+_call_parser(complete_block, test_inputs, test_list)
    │
    ├── VALID ──────────────────────────────► yield code block → back to TEXT
    │
@ -326,7 +350,7 @@ _call_parser(complete_block)
    │   "Fix this AVAP code: {trace}"
    │       │
    │       ▼
-    │   _call_parser(fixed_block)
+    │   _call_parser(fixed_block, test_inputs, test_list)   ← same tests reused
    │       ├── VALID ──────────────────────► yield fixed block → back to TEXT
    │       ├── INVALID ────────────────────► yield fixed block + INVALID_UNRESOLVED → back to TEXT
    │       └── UNAVAILABLE ────────────────► yield fixed block + PARSER_UNAVAILABLE → back to TEXT
@ -337,7 +361,7 @@ _call_parser(complete_block)
 #### Key properties

 - Text before and after the code block streams to the client without delay.
- Only the code block itself introduces latency (one parser call, optionally one LLM fix call + one parser call).
+- Only the code block itself introduces latency (one `_run_generate_tests` LLM call + one parser call, optionally one LLM fix call + one additional parser call).
 - The fix call asks the LLM to correct **only the code block**, not the full response — the text already streamed to the client remains valid.
 - If the response contains multiple code blocks, each block is processed independently in sequence. `validation_status` in the final `is_final=True` message reflects the last block validated.
 - If the stream ends while still in CODE mode (malformed response without closing fence), the buffer is flushed as-is.
@ -347,7 +371,10 @@ _call_parser(complete_block)
 | Property | AskAgent | AskAgentStream |
 |---|---|---|
 | Validation point | Post-generation, pre-delivery | Inline, at code block boundary |
+| Test generation | `generate_tests` LangGraph node | `_run_generate_tests()` called directly before `_call_parser` |
+| Execution + assertion validation | `validate_code` node → `/api/v1/execute` with `test_inputs`/`test_list` | `_call_parser()` with `test_inputs`/`test_list` |
 | Retry mechanism | Full `generate_code_retry` LangGraph node | Targeted LLM fix call (code block only) |
+| Retry reuses same tests | Yes (state carries `test_inputs`/`test_list`) | Yes (`_ti`/`_tl` captured before first call) |
 | Text streaming | N/A (non-streaming) | Uninterrupted |
 | `validation_status` delivery | In `AgentResponse` (only response) | In final `AgentResponse` (`is_final=True`) |

@ -371,6 +398,8 @@ _call_parser(complete_block)

 No new LangGraph nodes. The validation logic runs as a **streaming state machine** inline in `server.py:AskAgentStream`. See Decision 3.

+Test generation and execution-level validation now have full parity with the `AskAgent` path: `_run_generate_tests` is called after each code block is buffered, and `_call_parser` receives the resulting `test_inputs`/`test_list`. Both paths share the same module-level `_run_generate_tests` implementation.
+
 ### Updated flow — `build_graph`

 ```mermaid
@ -442,7 +471,7 @@ message AgentResponse {
 }
 ```

-**Implementation status: complete (2026-04-12).** Field 4 added to `brunix.proto`. Populated in both `AskAgent` (from `final_state`) and `AskAgentStream` (in the `is_final=True` message at end of stream).
+**Implementation status: complete (2026-04-13).** Field 4 added to `brunix.proto`. Populated in both `AskAgent` (from `final_state`) and `AskAgentStream` (in the `is_final=True` message at end of stream). Full test generation + assertion validation parity between both paths achieved 2026-04-13.

 Clients that do not read `validation_status` are unaffected — the field defaults to empty string.