assistance-engine/docs/ADR/ADR-0007-code-indexing-impr...

79 lines
3.7 KiB
Markdown

# ADR-0006: Code Indexing Improvements — Comparative Evaluation of code chunking strategies
**Date:** 2026-03-24
**Status:** Proposed
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering
---
## Context
Efficient code indexing is a critical component for enabling high-quality code search, retrieval-augmented generation (RAG), and semantic understanding in developer tooling. The main challenge lies in representing source code in a way that preserves its syntactic and semantic structure while remaining suitable for embedding-based retrieval systems.
In this context, we explored different strategies to improve the indexing of .avap code files, starting from a naïve approach and progressively moving toward more structured representations based on parsing techniques.
### Alternatives
- File-level chunking (baseline):
Each .avap file is treated as a single chunk and indexed directly. This approach is simple and fast but ignores internal structure (functions, classes, blocks).
- EBNF chunking as metadata:
Each .avap file is still treated as a single chunk and indexed directly. However, by using the AVAP EBNF syntax, we extract the AST structure and injects it into the chunk metadata.
- Full EBNF chunking:
Each .avap file is still treated as a single chunk and indexed directly. The difference between this approach and the last 2, is that the AST is indexed instead the code.
- Grammar definition chunking:
Code is segmented using a language-specific configuration (`avap_config.json`) instead of one-file chunks. The chunker applies a lexer (comments/strings), identifies multi-line blocks (`function`, `if`, `startLoop`, `try`), classifies single-line statements (`registerEndpoint`, `orm_command`, `http_command`, etc.), and enriches every chunk with semantic tags (`uses_orm`, `uses_http`, `uses_async`, `returns_result`, among others).
This strategy also extracts function signatures as dedicated lightweight chunks and propagates local context between nearby chunks (semantic overlap), improving retrieval precision for both API-level and implementation-level queries.
### Indexed docs
For each strategy, we created a different Elasticsearch Index with their own characteristics. The 3 first approaches have 33 chunks (1 chunk per file), whereas the last approach has 89 chunks.
### How can we evaluate each strategy?
**Evaluation Protocol:**
1. **Golden Dataset**
- Generate a set of natural language queries paired with their ground-truth context (filename).
- Each query should be answerable by examining one or more code samples.
- Example: Query="How do you handle errors in AVAP?" → Context="try_catch_request.avap"
2. **Test Each Strategy**
- For each of the 4 chunking strategies, run the same set of queries against the respective Elasticsearch index.
- Record the top-10 retrieved chunks for each query.
3. **Metrics**
- `NDCG@10`: Normalized discounted cumulative gain at rank 10 (measures ranking quality).
- `Recall@10`: Fraction of relevant chunks retrieved in top 10.
- `MRR@10`: Mean reciprocal rank (position of first relevant result).
4. **Relevance Judgment**
- A chunk is considered relevant if it contains code directly answering the query.
- For file-level strategies: entire file is relevant or irrelevant.
- For grammar-definition: specific block/statement chunks are relevant even if the full file is not.
5. **Acceptance Criteria**
- **Grammar definition must achieve at least a 10% improvement in NDCG@10 over file-level baseline.**
- **Recall@10 must not drop by more than 5 absolute percentage points vs file-level.**
- **Index size increase must remain below 50% of baseline.**
## Decision
## Rationale
## Consequences