17 KiB
AVAP Chunker — Language Configuration Reference
File:
scripts/pipelines/ingestion/avap_config.jsonUsed by:avap_chunker.py(Pipeline B) Last updated: 2026-03-18
This file is the grammar definition for the AVAP language chunker. It tells avap_chunker.py how to tokenize, parse, and semantically classify .avap source files before they are embedded and ingested into Elasticsearch. Modifying this file changes what the chunker recognises as a block, a statement, or a semantic feature — and therefore what metadata every chunk in the knowledge base carries.
Table of Contents
- Top-Level Fields
- Lexer
- Blocks
- Statements
- Semantic Tags
- How They Work Together
- Adding New Constructs
- Full Annotated Example
1. Top-Level Fields
{
"language": "avap",
"version": "1.0",
"file_extensions": [".avap"]
}
| Field | Type | Description |
|---|---|---|
language |
string | Human-readable language name. Used in chunker progress reports. |
version |
string | Config schema version. Increment when making breaking changes. |
file_extensions |
array of strings | File extensions the chunker will process. .md files are always processed regardless of this setting. |
2. Lexer
The lexer section controls how raw source lines are stripped of comments and string literals before pattern matching is applied.
"lexer": {
"string_delimiters": ["\"", "'"],
"escape_char": "\\",
"comment_line": ["///", "//"],
"comment_block": { "open": "/*", "close": "*/" },
"line_oriented": true
}
| Field | Type | Description |
|---|---|---|
string_delimiters |
array of strings | Characters that open and close string literals. Content inside strings is ignored during pattern matching. |
escape_char |
string | Character used to escape the next character inside a string. Prevents \" from closing the string. |
comment_line |
array of strings | Line comment prefixes, evaluated longest-first. Everything after the matched prefix is stripped. AVAP supports both /// (documentation comments) and // (inline comments). |
comment_block.open |
string | Block comment opening delimiter. |
comment_block.close |
string | Block comment closing delimiter. Content between /* and */ is stripped before pattern matching. |
line_oriented |
bool | When true, the lexer processes one line at a time. Should always be true for AVAP. |
Important: Comment stripping and string boundary detection happen before any block or statement pattern is evaluated. A keyword inside a string literal or a comment will never trigger a block or statement match.
3. Blocks
Blocks are multi-line constructs with a defined opener and closer. The chunker tracks nesting depth — each opener increments depth, each closer decrements it, and the block ends when depth returns to zero. This correctly handles nested if() inside function{} and similar cases.
Each block definition produces a chunk with doc_type as specified and block_type equal to the block name.
"blocks": [
{
"name": "function",
"doc_type": "code",
"opener_pattern": "^\\s*function\\s+(\\w+)\\s*\\(([^)]*)",
"closer_pattern": "^\\s*\\}\\s*$",
"extract_signature": true,
"signature_template": "function {group1}({group2})"
},
...
]
Block fields
| Field | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Identifier for this block type. Used as block_type in the chunk metadata and in the semantic_overlap context header. |
doc_type |
string | Yes | Elasticsearch doc_type field value for chunks from this block. |
opener_pattern |
regex string | Yes | Pattern matched against the clean (comment-stripped) line to detect the start of this block. Must be anchored at the start (^). |
closer_pattern |
regex string | Yes | Pattern matched to detect the end of this block. Checked at every line after the opener. |
extract_signature |
bool | No (default: false) |
When true, the chunker extracts a compact signature string from the opener line using capture groups, and creates an additional function_signature chunk alongside the full block chunk. |
signature_template |
string | No | Template for the signature string. Uses {group1}, {group2}, etc. as placeholders for the regex capture groups from opener_pattern. |
Current block definitions
function
opener: ^\\s*function\\s+(\\w+)\\s*\\(([^)]*)
closer: ^\\s*\\}\\s*$
Matches any top-level or nested AVAP function declaration. The two capture groups extract the function name (group1) and parameter list (group2), which are combined into the signature template function {group1}({group2}).
Because extract_signature: true, every function produces two chunks:
- A
doc_type: "code",block_type: "function"chunk containing the full function body. - A
doc_type: "function_signature",block_type: "function_signature"chunk containing only the signature string (e.g.function validateAccess(userId, token)). This lightweight chunk is indexed separately to enable fast function-name lookup without retrieving the entire body.
Additionally, the function signature is registered in the SemanticOverlapBuffer. Subsequent non-function chunks in the same file will receive the current function signature prepended as a context comment (// contexto: function validateAccess(userId, token)), keeping the surrounding code semantically grounded.
if
opener: ^\\s*if\\s*\\(
closer: ^\\s*end\\s*\\(\\s*\\)
Matches AVAP conditional blocks. Note: AVAP uses end() as the closer, not }.
startLoop
opener: ^\\s*startLoop\\s*\\(
closer: ^\\s*endLoop\\s*\\(\\s*\\)
Matches AVAP iteration blocks. The closer is endLoop().
try
opener: ^\\s*try\\s*\\(\\s*\\)
closer: ^\\s*end\\s*\\(\\s*\\)
Matches AVAP error-handling blocks (try() … end()).
4. Statements
Statements are single-line constructs. Lines that are not part of any block opener or closer are classified against the statement patterns in order. The first match wins. If no pattern matches, the statement is classified as "statement" (the fallback).
Consecutive lines with the same statement type are grouped into a single chunk, keeping semantically related statements together. When the statement type changes, the current group is flushed as a chunk.
"statements": [
{ "name": "registerEndpoint", "pattern": "^\\s*registerEndpoint\\s*\\(" },
{ "name": "addVar", "pattern": "^\\s*addVar\\s*\\(" },
...
]
Statement fields
| Field | Type | Description |
|---|---|---|
name |
string | Used as block_type in the chunk metadata. |
pattern |
regex string | Matched against the clean line. First match wins — order matters. |
Current statement definitions
| Name | Matches | AVAP commands |
|---|---|---|
registerEndpoint |
API route registration | registerEndpoint(...) |
addVar |
Variable declaration | addVar(...) |
io_command |
Input/output operations | addParam, getListLen, addResult, getQueryParamList |
http_command |
HTTP client calls | RequestPost, RequestGet |
orm_command |
Database ORM operations | ormDirect, ormCheckTable, ormCreateTable, ormAccessSelect, ormAccessInsert, ormAccessUpdate |
util_command |
Utility and helper functions | variableToList, itemFromList, variableFromJSON, AddVariableToJSON, encodeSHA256, encodeMD5, getRegex, getDateTime, stampToDatetime, getTimeStamp, randomString, replace |
async_command |
Concurrency primitives | x = go funcName(, gather( |
connector |
External service connector | x = avapConnector( |
modularity |
Module imports | import, include |
assignment |
Variable assignment (catch-all before fallback) | x = ... |
Ordering note: registerEndpoint, addVar, and the specific command categories are listed before assignment intentionally. assignment would match many of them (they all contain = or are function calls that could follow an assignment), so the more specific patterns must come first.
5. Semantic Tags
Semantic tags are boolean metadata flags applied to every chunk (both blocks and statements) by scanning the entire chunk content with a regex. A chunk can have multiple tags simultaneously.
The complexity field is automatically computed as the count of true tags in a chunk's metadata, providing a rough signal of how much AVAP functionality a given chunk exercises.
"semantic_tags": [
{ "tag": "uses_orm", "pattern": "\\b(ormDirect|ormAccessSelect|...)\\s*\\(" },
...
]
Tag fields
| Field | Description |
|---|---|
tag |
Key name in the metadata object stored in Elasticsearch. Value is always true when present. |
pattern |
Regex searched (not matched) across the full chunk text. Uses \b word boundaries to avoid false positives. |
Current semantic tags
| Tag | Detected when chunk contains |
|---|---|
uses_orm |
Any ORM command: ormDirect, ormCheckTable, ormCreateTable, ormAccessSelect, ormAccessInsert, ormAccessUpdate |
uses_http |
HTTP client calls: RequestPost, RequestGet |
uses_connector |
External connector: avapConnector( |
uses_async |
Concurrency: go funcName( or gather( |
uses_crypto |
Hashing/encoding: encodeSHA256(, encodeMD5( |
uses_auth |
Auth-related commands: addParam, _status |
uses_error_handling |
Error handling block: try() |
uses_loop |
Loop construct: startLoop( |
uses_json |
JSON operations: variableFromJSON(, AddVariableToJSON( |
uses_list |
List operations: variableToList(, itemFromList(, getListLen( |
uses_regex |
Regular expressions: getRegex( |
uses_datetime |
Date/time operations: getDateTime(, getTimeStamp(, stampToDatetime( |
returns_result |
Returns data to the API caller: addResult( |
registers_endpoint |
Defines an API route: registerEndpoint( |
How tags are used at retrieval time: The Elasticsearch mapping stores each tag as a boolean field under the metadata object. This enables filtered retrieval — for example, a future retrieval enhancement could boost chunks with metadata.uses_orm: true for queries that contain ORM-related keywords, improving precision for database-related questions.
6. How They Work Together
The following example shows how avap_chunker.py processes a real .avap file using this config:
// Validate user session
function validateAccess(userId, token) {
addVar(isValid = false)
addParam(userId)
try()
ormAccessSelect(users, id = userId)
addVar(isValid = true)
end()
addResult(isValid)
}
registerEndpoint(POST, /validate)
Chunks produced:
| # | doc_type |
block_type |
Content | Tags |
|---|---|---|---|---|
| 1 | code |
function |
Full function body (lines 2–10) | uses_auth, uses_orm, uses_error_handling, returns_result · complexity: 4 |
| 2 | function_signature |
function_signature |
function validateAccess(userId, token) |
— |
| 3 | code |
registerEndpoint |
registerEndpoint(POST, /validate) |
registers_endpoint · complexity: 1 |
Chunk 1 also receives the function signature as a semantic overlap header because the SemanticOverlapBuffer tracks validateAccess and injects it as context into any subsequent non-function chunks in the same file.
7. Adding New Constructs
Adding a new block type
- Identify the opener and closer patterns from the AVAP LRM (
docs/LRM/avap.md). - Add an entry to
"blocks"inavap_config.json. - If the block introduces a named construct worth indexing independently (like functions), set
"extract_signature": trueand define a"signature_template". - Run a smoke test on a representative
.avapfile:python scripts/pipelines/ingestion/avap_chunker.py \ --lang-config scripts/pipelines/ingestion/avap_config.json \ --docs-path docs/samples \ --output /tmp/test_chunks.jsonl \ --no-dedup - Inspect
/tmp/test_chunks.jsonland verify the newblock_typeappears with the expected content. - Re-run the ingestion pipeline to rebuild the index.
Adding a new statement category
- Add an entry to
"statements"before theassignmentcatch-all. - Use
^\\s*to anchor the pattern at the start of the line. - Test as above — verify the new
block_typeappears in the JSONL output.
Adding a new semantic tag
- Add an entry to
"semantic_tags". - Use
\\bword boundaries to prevent false positives on substrings. - Add the new tag as a
booleanfield to the Elasticsearch index mapping inavap_ingestor.py(build_index_mapping()). - Re-index from scratch — existing documents will not have the new tag unless the index is rebuilt (
--deleteflag).
8. Full Annotated Example
{
// Identifies this config as the AVAP v1.0 grammar
"language": "avap",
"version": "1.0",
"file_extensions": [".avap"], // Only .avap files; .md is always included
"lexer": {
"string_delimiters": ["\"", "'"], // Both quote styles used in AVAP
"escape_char": "\\",
"comment_line": ["///", "//"], // /// first — longest match wins
"comment_block": { "open": "/*", "close": "*/" },
"line_oriented": true
},
"blocks": [
{
"name": "function",
"doc_type": "code",
// Captures: group1=name, group2=params
"opener_pattern": "^\\s*function\\s+(\\w+)\\s*\\(([^)]*)",
"closer_pattern": "^\\s*\\}\\s*$", // AVAP functions close with }
"extract_signature": true,
"signature_template": "function {group1}({group2})"
},
{
"name": "if",
"doc_type": "code",
"opener_pattern": "^\\s*if\\s*\\(",
"closer_pattern": "^\\s*end\\s*\\(\\s*\\)" // AVAP if closes with end()
},
{
"name": "startLoop",
"doc_type": "code",
"opener_pattern": "^\\s*startLoop\\s*\\(",
"closer_pattern": "^\\s*endLoop\\s*\\(\\s*\\)"
},
{
"name": "try",
"doc_type": "code",
"opener_pattern": "^\\s*try\\s*\\(\\s*\\)",
"closer_pattern": "^\\s*end\\s*\\(\\s*\\)" // try also closes with end()
}
],
"statements": [
// Specific patterns first — must come before the generic "assignment" catch-all
{ "name": "registerEndpoint", "pattern": "^\\s*registerEndpoint\\s*\\(" },
{ "name": "addVar", "pattern": "^\\s*addVar\\s*\\(" },
{ "name": "io_command", "pattern": "^\\s*(addParam|getListLen|addResult|getQueryParamList)\\s*\\(" },
{ "name": "http_command", "pattern": "^\\s*(RequestPost|RequestGet)\\s*\\(" },
{ "name": "orm_command", "pattern": "^\\s*(ormDirect|ormCheckTable|ormCreateTable|ormAccessSelect|ormAccessInsert|ormAccessUpdate)\\s*\\(" },
{ "name": "util_command", "pattern": "^\\s*(variableToList|itemFromList|variableFromJSON|AddVariableToJSON|encodeSHA256|encodeMD5|getRegex|getDateTime|stampToDatetime|getTimeStamp|randomString|replace)\\s*\\(" },
{ "name": "async_command", "pattern": "^\\s*(\\w+\\s*=\\s*go\\s+|gather\\s*\\()" },
{ "name": "connector", "pattern": "^\\s*\\w+\\s*=\\s*avapConnector\\s*\\(" },
{ "name": "modularity", "pattern": "^\\s*(import|include)\\s+" },
{ "name": "assignment", "pattern": "^\\s*\\w+\\s*=\\s*" } // catch-all
],
"semantic_tags": [
// Applied to every chunk by full-content regex search (not line-by-line)
{ "tag": "uses_orm", "pattern": "\\b(ormDirect|ormCheckTable|ormCreateTable|ormAccessSelect|ormAccessInsert|ormAccessUpdate)\\s*\\(" },
{ "tag": "uses_http", "pattern": "\\b(RequestPost|RequestGet)\\s*\\(" },
{ "tag": "uses_connector", "pattern": "\\bavapConnector\\s*\\(" },
{ "tag": "uses_async", "pattern": "\\bgo\\s+\\w+\\s*\\(|\\bgather\\s*\\(" },
{ "tag": "uses_crypto", "pattern": "\\b(encodeSHA256|encodeMD5)\\s*\\(" },
{ "tag": "uses_auth", "pattern": "\\b(addParam|_status)\\b" },
{ "tag": "uses_error_handling", "pattern": "\\btry\\s*\\(\\s*\\)" },
{ "tag": "uses_loop", "pattern": "\\bstartLoop\\s*\\(" },
{ "tag": "uses_json", "pattern": "\\b(variableFromJSON|AddVariableToJSON)\\s*\\(" },
{ "tag": "uses_list", "pattern": "\\b(variableToList|itemFromList|getListLen)\\s*\\(" },
{ "tag": "uses_regex", "pattern": "\\bgetRegex\\s*\\(" },
{ "tag": "uses_datetime", "pattern": "\\b(getDateTime|getTimeStamp|stampToDatetime)\\s*\\(" },
{ "tag": "returns_result", "pattern": "\\baddResult\\s*\\(" },
{ "tag": "registers_endpoint", "pattern": "\\bregisterEndpoint\\s*\\(" }
]
}