assistance-engine/docs/ADR/ADR-0007-code-indexing-impr...

3.7 KiB

ADR-0006: Code Indexing Improvements — Comparative Evaluation of code chunking strategies

Date: 2026-03-24
Status: Proposed Deciders: Rafael Ruiz (CTO), MrHouston Engineering


Context

Efficient code indexing is a critical component for enabling high-quality code search, retrieval-augmented generation (RAG), and semantic understanding in developer tooling. The main challenge lies in representing source code in a way that preserves its syntactic and semantic structure while remaining suitable for embedding-based retrieval systems.

In this context, we explored different strategies to improve the indexing of .avap code files, starting from a naïve approach and progressively moving toward more structured representations based on parsing techniques.

Alternatives

  • File-level chunking (baseline):

    Each .avap file is treated as a single chunk and indexed directly. This approach is simple and fast but ignores internal structure (functions, classes, blocks).

  • EBNF chunking as metadata:

    Each .avap file is still treated as a single chunk and indexed directly. However, by using the AVAP EBNF syntax, we extract the AST structure and injects it into the chunk metadata.

  • Full EBNF chunking:

    Each .avap file is still treated as a single chunk and indexed directly. The difference between this approach and the last 2, is that the AST is indexed instead the code.

  • Grammar definition chunking:

    Code is segmented using a language-specific configuration (avap_config.json) instead of one-file chunks. The chunker applies a lexer (comments/strings), identifies multi-line blocks (function, if, startLoop, try), classifies single-line statements (registerEndpoint, orm_command, http_command, etc.), and enriches every chunk with semantic tags (uses_orm, uses_http, uses_async, returns_result, among others).

    This strategy also extracts function signatures as dedicated lightweight chunks and propagates local context between nearby chunks (semantic overlap), improving retrieval precision for both API-level and implementation-level queries.

Indexed docs

For each strategy, we created a different Elasticsearch Index with their own characteristics. The 3 first approaches have 33 chunks (1 chunk per file), whereas the last approach has 89 chunks.

How can we evaluate each strategy?

Evaluation Protocol:

  1. Golden Dataset

    • Generate a set of natural language queries paired with their ground-truth context (filename).
    • Each query should be answerable by examining one or more code samples.
    • Example: Query="How do you handle errors in AVAP?" → Context="try_catch_request.avap"
  2. Test Each Strategy

    • For each of the 4 chunking strategies, run the same set of queries against the respective Elasticsearch index.
    • Record the top-10 retrieved chunks for each query.
  3. Metrics

    • NDCG@10: Normalized discounted cumulative gain at rank 10 (measures ranking quality).
    • Recall@10: Fraction of relevant chunks retrieved in top 10.
    • MRR@10: Mean reciprocal rank (position of first relevant result).
  4. Relevance Judgment

    • A chunk is considered relevant if it contains code directly answering the query.
    • For file-level strategies: entire file is relevant or irrelevant.
    • For grammar-definition: specific block/statement chunks are relevant even if the full file is not.
  5. Acceptance Criteria

    • Grammar definition must achieve at least a 10% improvement in NDCG@10 over file-level baseline.
    • Recall@10 must not drop by more than 5 absolute percentage points vs file-level.
    • Index size increase must remain below 50% of baseline.

Decision

Rationale

Consequences