# AVAP Chunker — Language Configuration Reference

> **File:** `scripts/pipelines/ingestion/avap_config.json`
> **Used by:** `avap_chunker.py` (Pipeline B)
> **Last updated:** 2026-03-18

This file is the **grammar definition** for the AVAP language chunker. It tells `avap_chunker.py` how to tokenize, parse, and semantically classify `.avap` source files before they are embedded and ingested into Elasticsearch. Modifying this file changes what the chunker recognises as a block, a statement, or a semantic feature — and therefore what metadata every chunk in the knowledge base carries.

---

## Table of Contents

1. [Top-Level Fields](#1-top-level-fields)
2. [Lexer](#2-lexer)
3. [Blocks](#3-blocks)
4. [Statements](#4-statements)
5. [Semantic Tags](#5-semantic-tags)
6. [How They Work Together](#6-how-they-work-together)
7. [Adding New Constructs](#7-adding-new-constructs)
8. [Full Annotated Example](#8-full-annotated-example)

---

## 1. Top-Level Fields

```json
{
  "language": "avap",
  "version": "1.0",
  "file_extensions": [".avap"]
}
```

| Field | Type | Description |
|---|---|---|
| `language` | string | Human-readable language name. Used in chunker progress reports. |
| `version` | string | Config schema version. Increment when making breaking changes. |
| `file_extensions` | array of strings | File extensions the chunker will process. `.md` files are always processed regardless of this setting. |

---

## 2. Lexer

The lexer section controls how raw source lines are stripped of comments and string literals before pattern matching is applied.

```json
"lexer": {
  "string_delimiters": ["\"", "'"],
  "escape_char": "\\",
  "comment_line":  ["///", "//"],
  "comment_block": { "open": "/*", "close": "*/" },
  "line_oriented": true
}
```

| Field | Type | Description |
|---|---|---|
| `string_delimiters` | array of strings | Characters that open and close string literals. Content inside strings is ignored during pattern matching. |
| `escape_char` | string | Character used to escape the next character inside a string. Prevents `\"` from closing the string. |
| `comment_line` | array of strings | Line comment prefixes, evaluated longest-first. Everything after the matched prefix is stripped. AVAP supports both `///` (documentation comments) and `//` (inline comments). |
| `comment_block.open` | string | Block comment opening delimiter. |
| `comment_block.close` | string | Block comment closing delimiter. Content between `/*` and `*/` is stripped before pattern matching. |
| `line_oriented` | bool | When `true`, the lexer processes one line at a time. Should always be `true` for AVAP. |

**Important:** Comment stripping and string boundary detection happen before any block or statement pattern is evaluated. A keyword inside a string literal or a comment will never trigger a block or statement match.

---

## 3. Blocks

Blocks are **multi-line constructs** with a defined opener and closer. The chunker tracks nesting depth — each opener increments depth, each closer decrements it, and the block ends when depth returns to zero. This correctly handles nested `if()` inside `function{}` and similar cases.

Each block definition produces a chunk with `doc_type` as specified and `block_type` equal to the block `name`.

```json
"blocks": [
  {
    "name": "function",
    "doc_type": "code",
    "opener_pattern": "^\\s*function\\s+(\\w+)\\s*\\(([^)]*)",
    "closer_pattern": "^\\s*\\}\\s*$",
    "extract_signature": true,
    "signature_template": "function {group1}({group2})"
  },
  ...
]
```

### Block fields

| Field | Type | Required | Description |
|---|---|---|---|
| `name` | string | Yes | Identifier for this block type. Used as `block_type` in the chunk metadata and in the `semantic_overlap` context header. |
| `doc_type` | string | Yes | Elasticsearch `doc_type` field value for chunks from this block. |
| `opener_pattern` | regex string | Yes | Pattern matched against the clean (comment-stripped) line to detect the start of this block. Must be anchored at the start (`^`). |
| `closer_pattern` | regex string | Yes | Pattern matched to detect the end of this block. Checked at every line after the opener. |
| `extract_signature` | bool | No (default: `false`) | When `true`, the chunker extracts a compact signature string from the opener line using capture groups, and creates an additional `function_signature` chunk alongside the full block chunk. |
| `signature_template` | string | No | Template for the signature string. Uses `{group1}`, `{group2}`, etc. as placeholders for the regex capture groups from `opener_pattern`. |

### Current block definitions

#### `function`

```
opener:  ^\\s*function\\s+(\\w+)\\s*\\(([^)]*)
closer:  ^\\s*\\}\\s*$
```

Matches any top-level or nested AVAP function declaration. The two capture groups extract the function name (`group1`) and parameter list (`group2`), which are combined into the signature template `function {group1}({group2})`.

Because `extract_signature: true`, every function produces **two chunks**:
1. A `doc_type: "code"`, `block_type: "function"` chunk containing the full function body.
2. A `doc_type: "function_signature"`, `block_type: "function_signature"` chunk containing only the signature string (e.g. `function validateAccess(userId, token)`). This lightweight chunk is indexed separately to enable fast function-name lookup without retrieving the entire body.

Additionally, the function signature is registered in the `SemanticOverlapBuffer`. Subsequent non-function chunks in the same file will receive the current function signature prepended as a context comment (`// contexto: function validateAccess(userId, token)`), keeping the surrounding code semantically grounded.

#### `if`

```
opener:  ^\\s*if\\s*\\(
closer:  ^\\s*end\\s*\\(\\s*\\)
```

Matches AVAP conditional blocks. Note: AVAP uses `end()` as the closer, not `}`.

#### `startLoop`

```
opener:  ^\\s*startLoop\\s*\\(
closer:  ^\\s*endLoop\\s*\\(\\s*\\)
```

Matches AVAP iteration blocks. The closer is `endLoop()`.

#### `try`

```
opener:  ^\\s*try\\s*\\(\\s*\\)
closer:  ^\\s*end\\s*\\(\\s*\\)
```

Matches AVAP error-handling blocks (`try()` … `end()`).

---

## 4. Statements

Statements are **single-line constructs**. Lines that are not part of any block opener or closer are classified against the statement patterns in order. The first match wins. If no pattern matches, the statement is classified as `"statement"` (the fallback).

Consecutive lines with the same statement type are **grouped into a single chunk**, keeping semantically related statements together. When the statement type changes, the current group is flushed as a chunk.

```json
"statements": [
  { "name": "registerEndpoint", "pattern": "^\\s*registerEndpoint\\s*\\(" },
  { "name": "addVar",           "pattern": "^\\s*addVar\\s*\\(" },
  ...
]
```

### Statement fields

| Field | Type | Description |
|---|---|---|
| `name` | string | Used as `block_type` in the chunk metadata. |
| `pattern` | regex string | Matched against the clean line. First match wins — order matters. |

### Current statement definitions

| Name | Matches | AVAP commands |
|---|---|---|
| `registerEndpoint` | API route registration | `registerEndpoint(...)` |
| `addVar` | Variable declaration | `addVar(...)` |
| `io_command` | Input/output operations | `addParam`, `getListLen`, `addResult`, `getQueryParamList` |
| `http_command` | HTTP client calls | `RequestPost`, `RequestGet` |
| `orm_command` | Database ORM operations | `ormDirect`, `ormCheckTable`, `ormCreateTable`, `ormAccessSelect`, `ormAccessInsert`, `ormAccessUpdate` |
| `util_command` | Utility and helper functions | `variableToList`, `itemFromList`, `variableFromJSON`, `AddVariableToJSON`, `encodeSHA256`, `encodeMD5`, `getRegex`, `getDateTime`, `stampToDatetime`, `getTimeStamp`, `randomString`, `replace` |
| `async_command` | Concurrency primitives | `x = go funcName(`, `gather(` |
| `connector` | External service connector | `x = avapConnector(` |
| `modularity` | Module imports | `import`, `include` |
| `assignment` | Variable assignment (catch-all before fallback) | `x = ...` |

**Ordering note:** `registerEndpoint`, `addVar`, and the specific command categories are listed before `assignment` intentionally. `assignment` would match many of them (they all contain `=` or are function calls that could follow an assignment), so the more specific patterns must come first.

---

## 5. Semantic Tags

Semantic tags are **boolean metadata flags** applied to every chunk (both blocks and statements) by scanning the entire chunk content with a regex. A chunk can have multiple tags simultaneously.

The `complexity` field is automatically computed as the count of `true` tags in a chunk's metadata, providing a rough signal of how much AVAP functionality a given chunk exercises.

```json
"semantic_tags": [
  { "tag": "uses_orm",   "pattern": "\\b(ormDirect|ormAccessSelect|...)\\s*\\(" },
  ...
]
```

### Tag fields

| Field | Description |
|---|---|
| `tag` | Key name in the `metadata` object stored in Elasticsearch. Value is always `true` when present. |
| `pattern` | Regex searched (not matched) across the full chunk text. Uses `\b` word boundaries to avoid false positives. |

### Current semantic tags

| Tag | Detected when chunk contains |
|---|---|
| `uses_orm` | Any ORM command: `ormDirect`, `ormCheckTable`, `ormCreateTable`, `ormAccessSelect`, `ormAccessInsert`, `ormAccessUpdate` |
| `uses_http` | HTTP client calls: `RequestPost`, `RequestGet` |
| `uses_connector` | External connector: `avapConnector(` |
| `uses_async` | Concurrency: `go funcName(` or `gather(` |
| `uses_crypto` | Hashing/encoding: `encodeSHA256(`, `encodeMD5(` |
| `uses_auth` | Auth-related commands: `addParam`, `_status` |
| `uses_error_handling` | Error handling block: `try()` |
| `uses_loop` | Loop construct: `startLoop(` |
| `uses_json` | JSON operations: `variableFromJSON(`, `AddVariableToJSON(` |
| `uses_list` | List operations: `variableToList(`, `itemFromList(`, `getListLen(` |
| `uses_regex` | Regular expressions: `getRegex(` |
| `uses_datetime` | Date/time operations: `getDateTime(`, `getTimeStamp(`, `stampToDatetime(` |
| `returns_result` | Returns data to the API caller: `addResult(` |
| `registers_endpoint` | Defines an API route: `registerEndpoint(` |

**How tags are used at retrieval time:** The Elasticsearch mapping stores each tag as a `boolean` field under the `metadata` object. This enables filtered retrieval — for example, a future retrieval enhancement could boost chunks with `metadata.uses_orm: true` for queries that contain ORM-related keywords, improving precision for database-related questions.

---

## 6. How They Work Together

The following example shows how `avap_chunker.py` processes a real `.avap` file using this config:

```avap
// Validate user session
function validateAccess(userId, token) {
    addVar(isValid = false)
    addParam(userId)
    try()
        ormAccessSelect(users, id = userId)
        addVar(isValid = true)
    end()
    addResult(isValid)
}

registerEndpoint(POST, /validate)
```

**Chunks produced:**

| # | `doc_type` | `block_type` | Content | Tags |
|---|---|---|---|---|
| 1 | `code` | `function` | Full function body (lines 2–10) | `uses_auth`, `uses_orm`, `uses_error_handling`, `returns_result` · `complexity: 4` |
| 2 | `function_signature` | `function_signature` | `function validateAccess(userId, token)` | — |
| 3 | `code` | `registerEndpoint` | `registerEndpoint(POST, /validate)` | `registers_endpoint` · `complexity: 1` |

Chunk 1 also receives the function signature as a semantic overlap header because the `SemanticOverlapBuffer` tracks `validateAccess` and injects it as context into any subsequent non-function chunks in the same file.

---

## 7. Adding New Constructs

### Adding a new block type

1. Identify the opener and closer patterns from the AVAP LRM (`docs/LRM/avap.md`).
2. Add an entry to `"blocks"` in `avap_config.json`.
3. If the block introduces a named construct worth indexing independently (like functions), set `"extract_signature": true` and define a `"signature_template"`.
4. Run a smoke test on a representative `.avap` file:
   ```bash
   python scripts/pipelines/ingestion/avap_chunker.py \
     --lang-config scripts/pipelines/ingestion/avap_config.json \
     --docs-path docs/samples \
     --output /tmp/test_chunks.jsonl \
     --no-dedup
   ```
5. Inspect `/tmp/test_chunks.jsonl` and verify the new `block_type` appears with the expected content.
6. Re-run the ingestion pipeline to rebuild the index.

### Adding a new statement category

1. Add an entry to `"statements"` **before** the `assignment` catch-all.
2. Use `^\\s*` to anchor the pattern at the start of the line.
3. Test as above — verify the new `block_type` appears in the JSONL output.

### Adding a new semantic tag

1. Add an entry to `"semantic_tags"`.
2. Use `\\b` word boundaries to prevent false positives on substrings.
3. Add the new tag as a `boolean` field to the Elasticsearch index mapping in `avap_ingestor.py` (`build_index_mapping()`).
4. **Re-index from scratch** — existing documents will not have the new tag unless the index is rebuilt (`--delete` flag).

---

## 8. Full Annotated Example

```jsonc
{
  // Identifies this config as the AVAP v1.0 grammar
  "language": "avap",
  "version": "1.0",
  "file_extensions": [".avap"],     // Only .avap files; .md is always included

  "lexer": {
    "string_delimiters": ["\"", "'"], // Both quote styles used in AVAP
    "escape_char": "\\",
    "comment_line": ["///", "//"],  // /// first — longest match wins
    "comment_block": { "open": "/*", "close": "*/" },
    "line_oriented": true
  },

  "blocks": [
    {
      "name": "function",
      "doc_type": "code",
      // Captures: group1=name, group2=params
      "opener_pattern": "^\\s*function\\s+(\\w+)\\s*\\(([^)]*)",
      "closer_pattern": "^\\s*\\}\\s*$",    // AVAP functions close with }
      "extract_signature": true,
      "signature_template": "function {group1}({group2})"
    },
    {
      "name": "if",
      "doc_type": "code",
      "opener_pattern": "^\\s*if\\s*\\(",
      "closer_pattern": "^\\s*end\\s*\\(\\s*\\)"   // AVAP if closes with end()
    },
    {
      "name": "startLoop",
      "doc_type": "code",
      "opener_pattern": "^\\s*startLoop\\s*\\(",
      "closer_pattern": "^\\s*endLoop\\s*\\(\\s*\\)"
    },
    {
      "name": "try",
      "doc_type": "code",
      "opener_pattern": "^\\s*try\\s*\\(\\s*\\)",
      "closer_pattern": "^\\s*end\\s*\\(\\s*\\)"   // try also closes with end()
    }
  ],

  "statements": [
    // Specific patterns first — must come before the generic "assignment" catch-all
    { "name": "registerEndpoint", "pattern": "^\\s*registerEndpoint\\s*\\(" },
    { "name": "addVar",           "pattern": "^\\s*addVar\\s*\\(" },
    { "name": "io_command",       "pattern": "^\\s*(addParam|getListLen|addResult|getQueryParamList)\\s*\\(" },
    { "name": "http_command",     "pattern": "^\\s*(RequestPost|RequestGet)\\s*\\(" },
    { "name": "orm_command",      "pattern": "^\\s*(ormDirect|ormCheckTable|ormCreateTable|ormAccessSelect|ormAccessInsert|ormAccessUpdate)\\s*\\(" },
    { "name": "util_command",     "pattern": "^\\s*(variableToList|itemFromList|variableFromJSON|AddVariableToJSON|encodeSHA256|encodeMD5|getRegex|getDateTime|stampToDatetime|getTimeStamp|randomString|replace)\\s*\\(" },
    { "name": "async_command",    "pattern": "^\\s*(\\w+\\s*=\\s*go\\s+|gather\\s*\\()" },
    { "name": "connector",        "pattern": "^\\s*\\w+\\s*=\\s*avapConnector\\s*\\(" },
    { "name": "modularity",       "pattern": "^\\s*(import|include)\\s+" },
    { "name": "assignment",       "pattern": "^\\s*\\w+\\s*=\\s*" }  // catch-all
  ],

  "semantic_tags": [
    // Applied to every chunk by full-content regex search (not line-by-line)
    { "tag": "uses_orm",            "pattern": "\\b(ormDirect|ormCheckTable|ormCreateTable|ormAccessSelect|ormAccessInsert|ormAccessUpdate)\\s*\\(" },
    { "tag": "uses_http",           "pattern": "\\b(RequestPost|RequestGet)\\s*\\(" },
    { "tag": "uses_connector",      "pattern": "\\bavapConnector\\s*\\(" },
    { "tag": "uses_async",          "pattern": "\\bgo\\s+\\w+\\s*\\(|\\bgather\\s*\\(" },
    { "tag": "uses_crypto",         "pattern": "\\b(encodeSHA256|encodeMD5)\\s*\\(" },
    { "tag": "uses_auth",           "pattern": "\\b(addParam|_status)\\b" },
    { "tag": "uses_error_handling", "pattern": "\\btry\\s*\\(\\s*\\)" },
    { "tag": "uses_loop",           "pattern": "\\bstartLoop\\s*\\(" },
    { "tag": "uses_json",           "pattern": "\\b(variableFromJSON|AddVariableToJSON)\\s*\\(" },
    { "tag": "uses_list",           "pattern": "\\b(variableToList|itemFromList|getListLen)\\s*\\(" },
    { "tag": "uses_regex",          "pattern": "\\bgetRegex\\s*\\(" },
    { "tag": "uses_datetime",       "pattern": "\\b(getDateTime|getTimeStamp|stampToDatetime)\\s*\\(" },
    { "tag": "returns_result",      "pattern": "\\baddResult\\s*\\(" },
    { "tag": "registers_endpoint",  "pattern": "\\bregisterEndpoint\\s*\\(" }
  ]
}
```