# ADR-0006: Code Indexing Improvements — Comparative Evaluation of code chunking strategies **Date:** 2026-03-24 **Status:** Proposed **Deciders:** Rafael Ruiz (CTO), MrHouston Engineering --- ## Context Efficient code indexing is a critical component for enabling high-quality code search, retrieval-augmented generation (RAG), and semantic understanding in developer tooling. The main challenge lies in representing source code in a way that preserves its syntactic and semantic structure while remaining suitable for embedding-based retrieval systems. In this context, we explored different strategies to improve the indexing of .avap code files, starting from a naïve approach and progressively moving toward more structured representations based on parsing techniques. ### Alternatives - File-level chunking (baseline): Each .avap file is treated as a single chunk and indexed directly. This approach is simple and fast but ignores internal structure (functions, classes, blocks). - EBNF chunking as metadata: Each .avap file is still treated as a single chunk and indexed directly. However, by using the AVAP EBNF syntax, we extract the AST structure and injects it into the chunk metadata. - Full EBNF chunking: Each .avap file is still treated as a single chunk and indexed directly. The difference between this approach and the last 2, is that the AST is indexed instead the code. - Grammar definition chunking: Code is segmented using a language-specific configuration (`avap_config.json`) instead of one-file chunks. The chunker applies a lexer (comments/strings), identifies multi-line blocks (`function`, `if`, `startLoop`, `try`), classifies single-line statements (`registerEndpoint`, `orm_command`, `http_command`, etc.), and enriches every chunk with semantic tags (`uses_orm`, `uses_http`, `uses_async`, `returns_result`, among others). This strategy also extracts function signatures as dedicated lightweight chunks and propagates local context between nearby chunks (semantic overlap), improving retrieval precision for both API-level and implementation-level queries. ### Indexed docs For each strategy, we created a different Elasticsearch Index with their own characteristics. The 3 first approaches have 33 chunks (1 chunk per file), whereas the last approach has 89 chunks. ### How can we evaluate each strategy? **Evaluation Protocol:** 1. **Golden Dataset** - Generate a set of natural language queries paired with their ground-truth context (filename). - Each query should be answerable by examining one or more code samples. - Example: Query="How do you handle errors in AVAP?" → Context="try_catch_request.avap" 2. **Test Each Strategy** - For each of the 4 chunking strategies, run the same set of queries against the respective Elasticsearch index. - Record the top-10 retrieved chunks for each query. 3. **Metrics** - `NDCG@10`: Normalized discounted cumulative gain at rank 10 (measures ranking quality). - `Recall@10`: Fraction of relevant chunks retrieved in top 10. - `MRR@10`: Mean reciprocal rank (position of first relevant result). 4. **Relevance Judgment** - A chunk is considered relevant if it contains code directly answering the query. - For file-level strategies: entire file is relevant or irrelevant. - For grammar-definition: specific block/statement chunks are relevant even if the full file is not. 5. **Acceptance Criteria** - **Grammar definition must achieve at least a 10% improvement in NDCG@10 over file-level baseline.** - **Recall@10 must not drop by more than 5 absolute percentage points vs file-level.** - **Index size increase must remain below 50% of baseline.** ## Decision ## Rationale ## Consequences