17 KiB

Raw Permalink Blame History

AVAP Chunker — Language Configuration Reference

File: scripts/pipelines/ingestion/avap_config.json Used by: avap_chunker.py (Pipeline B) Last updated: 2026-03-18

This file is the grammar definition for the AVAP language chunker. It tells avap_chunker.py how to tokenize, parse, and semantically classify .avap source files before they are embedded and ingested into Elasticsearch. Modifying this file changes what the chunker recognises as a block, a statement, or a semantic feature — and therefore what metadata every chunk in the knowledge base carries.

Top-Level Fields
Lexer
Blocks
Statements
Semantic Tags
How They Work Together
Adding New Constructs
Full Annotated Example

1. Top-Level Fields

{
  "language": "avap",
  "version": "1.0",
  "file_extensions": [".avap"]
}

Field	Type	Description
`language`	string	Human-readable language name. Used in chunker progress reports.
`version`	string	Config schema version. Increment when making breaking changes.
`file_extensions`	array of strings	File extensions the chunker will process. `.md` files are always processed regardless of this setting.

2. Lexer

The lexer section controls how raw source lines are stripped of comments and string literals before pattern matching is applied.

"lexer": {
  "string_delimiters": ["\"", "'"],
  "escape_char": "\\",
  "comment_line":  ["///", "//"],
  "comment_block": { "open": "/*", "close": "*/" },
  "line_oriented": true
}

Field	Type	Description
`string_delimiters`	array of strings	Characters that open and close string literals. Content inside strings is ignored during pattern matching.
`escape_char`	string	Character used to escape the next character inside a string. Prevents `\"` from closing the string.
`comment_line`	array of strings	Line comment prefixes, evaluated longest-first. Everything after the matched prefix is stripped. AVAP supports both `///` (documentation comments) and `//` (inline comments).
`comment_block.open`	string	Block comment opening delimiter.
`comment_block.close`	string	Block comment closing delimiter. Content between `/` and `/` is stripped before pattern matching.
`line_oriented`	bool	When `true`, the lexer processes one line at a time. Should always be `true` for AVAP.

Important: Comment stripping and string boundary detection happen before any block or statement pattern is evaluated. A keyword inside a string literal or a comment will never trigger a block or statement match.

3. Blocks

Blocks are multi-line constructs with a defined opener and closer. The chunker tracks nesting depth — each opener increments depth, each closer decrements it, and the block ends when depth returns to zero. This correctly handles nested if() inside function{} and similar cases.

Each block definition produces a chunk with doc_type as specified and block_type equal to the block name.

"blocks": [
  {
    "name": "function",
    "doc_type": "code",
    "opener_pattern": "^\\s*function\\s+(\\w+)\\s*\\(([^)]*)",
    "closer_pattern": "^\\s*\\}\\s*$",
    "extract_signature": true,
    "signature_template": "function {group1}({group2})"
  },
  ...
]

Block fields

Field	Type	Required	Description
`name`	string	Yes	Identifier for this block type. Used as `block_type` in the chunk metadata and in the `semantic_overlap` context header.
`doc_type`	string	Yes	Elasticsearch `doc_type` field value for chunks from this block.
`opener_pattern`	regex string	Yes	Pattern matched against the clean (comment-stripped) line to detect the start of this block. Must be anchored at the start (`^`).
`closer_pattern`	regex string	Yes	Pattern matched to detect the end of this block. Checked at every line after the opener.
`extract_signature`	bool	No (default: `false`)	When `true`, the chunker extracts a compact signature string from the opener line using capture groups, and creates an additional `function_signature` chunk alongside the full block chunk.
`signature_template`	string	No	Template for the signature string. Uses `{group1}`, `{group2}`, etc. as placeholders for the regex capture groups from `opener_pattern`.

Current block definitions

`function`

opener:  ^\\s*function\\s+(\\w+)\\s*\\(([^)]*)
closer:  ^\\s*\\}\\s*$

Matches any top-level or nested AVAP function declaration. The two capture groups extract the function name (group1) and parameter list (group2), which are combined into the signature template function {group1}({group2}).

Because extract_signature: true, every function produces two chunks:

A doc_type: "code", block_type: "function" chunk containing the full function body.
A doc_type: "function_signature", block_type: "function_signature" chunk containing only the signature string (e.g. function validateAccess(userId, token)). This lightweight chunk is indexed separately to enable fast function-name lookup without retrieving the entire body.

Additionally, the function signature is registered in the SemanticOverlapBuffer. Subsequent non-function chunks in the same file will receive the current function signature prepended as a context comment (// contexto: function validateAccess(userId, token)), keeping the surrounding code semantically grounded.

`if`

opener:  ^\\s*if\\s*\\(
closer:  ^\\s*end\\s*\\(\\s*\\)

Matches AVAP conditional blocks. Note: AVAP uses end() as the closer, not }.

`startLoop`

opener:  ^\\s*startLoop\\s*\\(
closer:  ^\\s*endLoop\\s*\\(\\s*\\)

Matches AVAP iteration blocks. The closer is endLoop().

`try`

opener:  ^\\s*try\\s*\\(\\s*\\)
closer:  ^\\s*end\\s*\\(\\s*\\)

Matches AVAP error-handling blocks (try() … end()).

4. Statements

Statements are single-line constructs. Lines that are not part of any block opener or closer are classified against the statement patterns in order. The first match wins. If no pattern matches, the statement is classified as "statement" (the fallback).

Consecutive lines with the same statement type are grouped into a single chunk, keeping semantically related statements together. When the statement type changes, the current group is flushed as a chunk.

"statements": [
  { "name": "registerEndpoint", "pattern": "^\\s*registerEndpoint\\s*\\(" },
  { "name": "addVar",           "pattern": "^\\s*addVar\\s*\\(" },
  ...
]

Statement fields

Field	Type	Description
`name`	string	Used as `block_type` in the chunk metadata.
`pattern`	regex string	Matched against the clean line. First match wins — order matters.

Current statement definitions

Name	Matches	AVAP commands
`registerEndpoint`	API route registration	`registerEndpoint(...)`
`addVar`	Variable declaration	`addVar(...)`
`io_command`	Input/output operations	`addParam`, `getListLen`, `addResult`, `getQueryParamList`
`http_command`	HTTP client calls	`RequestPost`, `RequestGet`
`orm_command`	Database ORM operations	`ormDirect`, `ormCheckTable`, `ormCreateTable`, `ormAccessSelect`, `ormAccessInsert`, `ormAccessUpdate`
`util_command`	Utility and helper functions	`variableToList`, `itemFromList`, `variableFromJSON`, `AddVariableToJSON`, `encodeSHA256`, `encodeMD5`, `getRegex`, `getDateTime`, `stampToDatetime`, `getTimeStamp`, `randomString`, `replace`
`async_command`	Concurrency primitives	`x = go funcName(`, `gather(`
`connector`	External service connector	`x = avapConnector(`
`modularity`	Module imports	`import`, `include`
`assignment`	Variable assignment (catch-all before fallback)	`x = ...`

Ordering note: registerEndpoint, addVar, and the specific command categories are listed before assignment intentionally. assignment would match many of them (they all contain = or are function calls that could follow an assignment), so the more specific patterns must come first.

5. Semantic Tags

Semantic tags are boolean metadata flags applied to every chunk (both blocks and statements) by scanning the entire chunk content with a regex. A chunk can have multiple tags simultaneously.

The complexity field is automatically computed as the count of true tags in a chunk's metadata, providing a rough signal of how much AVAP functionality a given chunk exercises.

"semantic_tags": [
  { "tag": "uses_orm",   "pattern": "\\b(ormDirect|ormAccessSelect|...)\\s*\\(" },
  ...
]

Tag fields

Field	Description
`tag`	Key name in the `metadata` object stored in Elasticsearch. Value is always `true` when present.
`pattern`	Regex searched (not matched) across the full chunk text. Uses `\b` word boundaries to avoid false positives.

Current semantic tags

Tag	Detected when chunk contains
`uses_orm`	Any ORM command: `ormDirect`, `ormCheckTable`, `ormCreateTable`, `ormAccessSelect`, `ormAccessInsert`, `ormAccessUpdate`
`uses_http`	HTTP client calls: `RequestPost`, `RequestGet`
`uses_connector`	External connector: `avapConnector(`
`uses_async`	Concurrency: `go funcName(` or `gather(`
`uses_crypto`	Hashing/encoding: `encodeSHA256(`, `encodeMD5(`
`uses_auth`	Auth-related commands: `addParam`, `_status`
`uses_error_handling`	Error handling block: `try()`
`uses_loop`	Loop construct: `startLoop(`
`uses_json`	JSON operations: `variableFromJSON(`, `AddVariableToJSON(`
`uses_list`	List operations: `variableToList(`, `itemFromList(`, `getListLen(`
`uses_regex`	Regular expressions: `getRegex(`
`uses_datetime`	Date/time operations: `getDateTime(`, `getTimeStamp(`, `stampToDatetime(`
`returns_result`	Returns data to the API caller: `addResult(`
`registers_endpoint`	Defines an API route: `registerEndpoint(`

How tags are used at retrieval time: The Elasticsearch mapping stores each tag as a boolean field under the metadata object. This enables filtered retrieval — for example, a future retrieval enhancement could boost chunks with metadata.uses_orm: true for queries that contain ORM-related keywords, improving precision for database-related questions.

6. How They Work Together

The following example shows how avap_chunker.py processes a real .avap file using this config:

// Validate user session
function validateAccess(userId, token) {
    addVar(isValid = false)
    addParam(userId)
    try()
        ormAccessSelect(users, id = userId)
        addVar(isValid = true)
    end()
    addResult(isValid)
}

registerEndpoint(POST, /validate)

Chunks produced:

#	`doc_type`	`block_type`	Content	Tags
1	`code`	`function`	Full function body (lines 2–10)	`uses_auth`, `uses_orm`, `uses_error_handling`, `returns_result` · `complexity: 4`
2	`function_signature`	`function_signature`	`function validateAccess(userId, token)`	—
3	`code`	`registerEndpoint`	`registerEndpoint(POST, /validate)`	`registers_endpoint` · `complexity: 1`

Chunk 1 also receives the function signature as a semantic overlap header because the SemanticOverlapBuffer tracks validateAccess and injects it as context into any subsequent non-function chunks in the same file.

7. Adding New Constructs

Adding a new block type

Identify the opener and closer patterns from the AVAP LRM (docs/LRM/avap.md).
Add an entry to "blocks" in avap_config.json.
If the block introduces a named construct worth indexing independently (like functions), set "extract_signature": true and define a "signature_template".

Run a smoke test on a representative .avap file:

python scripts/pipelines/ingestion/avap_chunker.py \
  --lang-config scripts/pipelines/ingestion/avap_config.json \
  --docs-path docs/samples \
  --output /tmp/test_chunks.jsonl \
  --no-dedup

Inspect /tmp/test_chunks.jsonl and verify the new block_type appears with the expected content.
Re-run the ingestion pipeline to rebuild the index.

Adding a new statement category

Add an entry to "statements" before the assignment catch-all.
Use ^\\s* to anchor the pattern at the start of the line.
Test as above — verify the new block_type appears in the JSONL output.

Adding a new semantic tag

Add an entry to "semantic_tags".
Use \\b word boundaries to prevent false positives on substrings.
Add the new tag as a boolean field to the Elasticsearch index mapping in avap_ingestor.py (build_index_mapping()).
Re-index from scratch — existing documents will not have the new tag unless the index is rebuilt (--delete flag).

8. Full Annotated Example

{
  // Identifies this config as the AVAP v1.0 grammar
  "language": "avap",
  "version": "1.0",
  "file_extensions": [".avap"],     // Only .avap files; .md is always included

  "lexer": {
    "string_delimiters": ["\"", "'"], // Both quote styles used in AVAP
    "escape_char": "\\",
    "comment_line": ["///", "//"],  // /// first — longest match wins
    "comment_block": { "open": "/*", "close": "*/" },
    "line_oriented": true
  },

  "blocks": [
    {
      "name": "function",
      "doc_type": "code",
      // Captures: group1=name, group2=params
      "opener_pattern": "^\\s*function\\s+(\\w+)\\s*\\(([^)]*)",
      "closer_pattern": "^\\s*\\}\\s*$",    // AVAP functions close with }
      "extract_signature": true,
      "signature_template": "function {group1}({group2})"
    },
    {
      "name": "if",
      "doc_type": "code",
      "opener_pattern": "^\\s*if\\s*\\(",
      "closer_pattern": "^\\s*end\\s*\\(\\s*\\)"   // AVAP if closes with end()
    },
    {
      "name": "startLoop",
      "doc_type": "code",
      "opener_pattern": "^\\s*startLoop\\s*\\(",
      "closer_pattern": "^\\s*endLoop\\s*\\(\\s*\\)"
    },
    {
      "name": "try",
      "doc_type": "code",
      "opener_pattern": "^\\s*try\\s*\\(\\s*\\)",
      "closer_pattern": "^\\s*end\\s*\\(\\s*\\)"   // try also closes with end()
    }
  ],

  "statements": [
    // Specific patterns first — must come before the generic "assignment" catch-all
    { "name": "registerEndpoint", "pattern": "^\\s*registerEndpoint\\s*\\(" },
    { "name": "addVar",           "pattern": "^\\s*addVar\\s*\\(" },
    { "name": "io_command",       "pattern": "^\\s*(addParam|getListLen|addResult|getQueryParamList)\\s*\\(" },
    { "name": "http_command",     "pattern": "^\\s*(RequestPost|RequestGet)\\s*\\(" },
    { "name": "orm_command",      "pattern": "^\\s*(ormDirect|ormCheckTable|ormCreateTable|ormAccessSelect|ormAccessInsert|ormAccessUpdate)\\s*\\(" },
    { "name": "util_command",     "pattern": "^\\s*(variableToList|itemFromList|variableFromJSON|AddVariableToJSON|encodeSHA256|encodeMD5|getRegex|getDateTime|stampToDatetime|getTimeStamp|randomString|replace)\\s*\\(" },
    { "name": "async_command",    "pattern": "^\\s*(\\w+\\s*=\\s*go\\s+|gather\\s*\\()" },
    { "name": "connector",        "pattern": "^\\s*\\w+\\s*=\\s*avapConnector\\s*\\(" },
    { "name": "modularity",       "pattern": "^\\s*(import|include)\\s+" },
    { "name": "assignment",       "pattern": "^\\s*\\w+\\s*=\\s*" }  // catch-all
  ],

  "semantic_tags": [
    // Applied to every chunk by full-content regex search (not line-by-line)
    { "tag": "uses_orm",            "pattern": "\\b(ormDirect|ormCheckTable|ormCreateTable|ormAccessSelect|ormAccessInsert|ormAccessUpdate)\\s*\\(" },
    { "tag": "uses_http",           "pattern": "\\b(RequestPost|RequestGet)\\s*\\(" },
    { "tag": "uses_connector",      "pattern": "\\bavapConnector\\s*\\(" },
    { "tag": "uses_async",          "pattern": "\\bgo\\s+\\w+\\s*\\(|\\bgather\\s*\\(" },
    { "tag": "uses_crypto",         "pattern": "\\b(encodeSHA256|encodeMD5)\\s*\\(" },
    { "tag": "uses_auth",           "pattern": "\\b(addParam|_status)\\b" },
    { "tag": "uses_error_handling", "pattern": "\\btry\\s*\\(\\s*\\)" },
    { "tag": "uses_loop",           "pattern": "\\bstartLoop\\s*\\(" },
    { "tag": "uses_json",           "pattern": "\\b(variableFromJSON|AddVariableToJSON)\\s*\\(" },
    { "tag": "uses_list",           "pattern": "\\b(variableToList|itemFromList|getListLen)\\s*\\(" },
    { "tag": "uses_regex",          "pattern": "\\bgetRegex\\s*\\(" },
    { "tag": "uses_datetime",       "pattern": "\\b(getDateTime|getTimeStamp|stampToDatetime)\\s*\\(" },
    { "tag": "returns_result",      "pattern": "\\baddResult\\s*\\(" },
    { "tag": "registers_endpoint",  "pattern": "\\bregisterEndpoint\\s*\\(" }
  ]
}

17 KiB Raw Permalink Blame History Unescape Escape