# ADR-0006: Code Indexing Improvements — Comparative Evaluation of code chunking strategies

**Date:** 2026-03-24  
**Status:** Proposed 
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering

---

## Context
Efficient code indexing is a critical component for enabling high-quality code search, retrieval-augmented generation (RAG), and semantic understanding in developer tooling. The main challenge lies in representing source code in a way that preserves its syntactic and semantic structure while remaining suitable for embedding-based retrieval systems. 

In this context, we explored different strategies to improve the indexing of .avap code files, starting from a naïve approach and progressively moving toward more structured representations based on parsing techniques. 

### Alternatives
- File-level chunking (baseline):

    Each .avap file is treated as a single chunk and indexed directly. This approach is simple and fast but ignores internal structure (functions, classes, blocks). 


- EBNF chunking as metadata:

    Each .avap file is still treated as a single chunk and indexed directly. However, by using the AVAP EBNF syntax, we extract the AST structure and injects it into the chunk metadata.


- Full EBNF chunking:

    Each .avap file is still treated as a single chunk and indexed directly. The difference between this approach and the last 2, is that the AST is indexed instead the code.


- Grammar definition chunking:

    Code is segmented using a language-specific configuration (`avap_config.json`) instead of one-file chunks. The chunker applies a lexer (comments/strings), identifies multi-line blocks (`function`, `if`, `startLoop`, `try`), classifies single-line statements (`registerEndpoint`, `orm_command`, `http_command`, etc.), and enriches every chunk with semantic tags (`uses_orm`, `uses_http`, `uses_async`, `returns_result`, among others).

    This strategy also extracts function signatures as dedicated lightweight chunks and propagates local context between nearby chunks (semantic overlap), improving retrieval precision for both API-level and implementation-level queries.


### Indexed docs
For each strategy, we created a different Elasticsearch Index with their own characteristics. The 3 first approaches have 33 chunks (1 chunk per file), whereas the last approach has 89 chunks.


### How can we evaluate each strategy?

**Evaluation Protocol:**

1. **Golden Dataset**
   - Generate a set of natural language queries paired with their ground-truth context (filename).
   - Each query should be answerable by examining one or more code samples.
   - Example: Query="How do you handle errors in AVAP?" → Context="try_catch_request.avap"

2. **Test Each Strategy**
   - For each of the 4 chunking strategies, run the same set of queries against the respective Elasticsearch index.
   - Record the top-10 retrieved chunks for each query.

3. **Metrics**
    - `NDCG@10`: Normalized discounted cumulative gain at rank 10 (measures ranking quality).
    - `Recall@10`: Fraction of relevant chunks retrieved in top 10.
    - `MRR@10`: Mean reciprocal rank (position of first relevant result).

4. **Relevance Judgment**
   - A chunk is considered relevant if it contains code directly answering the query.
   - For file-level strategies: entire file is relevant or irrelevant.
   - For grammar-definition: specific block/statement chunks are relevant even if the full file is not.

5. **Acceptance Criteria**
   - **Grammar definition must achieve at least a 10% improvement in NDCG@10 over file-level baseline.**
   - **Recall@10 must not drop by more than 5 absolute percentage points vs file-level.**
   - **Index size increase must remain below 50% of baseline.**

## Decision


## Rationale


## Consequences