Merge branch 'mrh-online-dev' of github.com:BRUNIX-AI/assistance-engine into mrh-online-dev
This commit is contained in:
commit
71eb85cc89
|
|
@ -17,6 +17,8 @@ services:
|
|||
OLLAMA_URL: ${OLLAMA_URL}
|
||||
OLLAMA_MODEL_NAME: ${OLLAMA_MODEL_NAME}
|
||||
OLLAMA_EMB_MODEL_NAME: ${OLLAMA_EMB_MODEL_NAME}
|
||||
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
|
||||
ANTHROPIC_MODEL: ${ANTHROPIC_MODEL}
|
||||
PROXY_THREAD_WORKERS: 10
|
||||
|
||||
extra_hosts:
|
||||
|
|
|
|||
|
|
@ -89,19 +89,19 @@ Benchmark confirmation (BEIR evaluation, three datasets):
|
|||
|
||||
Qwen2.5-1.5B is eliminated. **Qwen3-Embedding-0.6B is the validated baseline.**
|
||||
|
||||
### Why a comparative evaluation is required before adopting Qwen3
|
||||
### Why a comparative evaluation was required before adopting Qwen3
|
||||
|
||||
Qwen3-Embedding-0.6B's benchmark results were obtained on English-only datasets. They eliminate Qwen2.5-1.5B decisively but do not characterise Qwen3's behaviour on the multilingual mixed corpus that AVAP represents. A second candidate — **BGE-M3** — presents theoretical advantages for this specific corpus that cannot be assessed without empirical comparison.
|
||||
Qwen3-Embedding-0.6B's benchmark results were obtained on English-only datasets. They eliminated Qwen2.5-1.5B decisively but did not characterise Qwen3's behaviour on the multilingual mixed corpus that AVAP represents. A second candidate — **BGE-M3** — presented theoretical advantages for this specific corpus that could not be assessed without empirical comparison.
|
||||
|
||||
The index rebuild required to adopt any model is destructive and must be done once. Given that the embedding model directly determines the quality of all RAG retrieval in production, adopting a model without a direct comparison between the two viable candidates would not meet the due diligence required for a decision of this impact.
|
||||
The index rebuild required to adopt any model is destructive and must be done once. Given that the embedding model directly determines the quality of all RAG retrieval in production, adopting a model without a direct comparison between the two viable candidates would not have met the due diligence required for a decision of this impact.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Conduct a **head-to-head comparative evaluation** of BGE-M3 and Qwen3-Embedding-0.6B under identical conditions before adopting either as the production embedding model.
|
||||
A **head-to-head comparative evaluation** of BGE-M3 and Qwen3-Embedding-0.6B is being conducted under identical conditions before either is adopted as the production embedding model.
|
||||
|
||||
The model that demonstrates superior performance under the evaluation criteria defined below will be adopted. This ADR moves to Accepted upon completion of that evaluation, with the selected model documented as the outcome.
|
||||
The model that demonstrates superior performance under the evaluation criteria defined below is adopted. This ADR moves to Accepted upon completion of that evaluation, with the selected model documented as the outcome.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -130,11 +130,11 @@ The model that demonstrates superior performance under the evaluation criteria d
|
|||
- Higher MTEB retrieval score than Qwen3-Embedding-0.6B in the programming domain
|
||||
|
||||
**Limitations:**
|
||||
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact — no empirical results for this corpus
|
||||
- Not yet benchmarked on CodeXGLUE, CoSQA or SciFact at the time of candidate selection — no prior empirical results for this corpus
|
||||
- 8,192 token context window — sufficient for current corpus (max chunk: 833 tokens, 10.2% utilization) but lower headroom for future corpus growth
|
||||
- Requires tokenizer alignment: `HF_EMB_MODEL_NAME` must be updated to `BAAI/bge-m3` alongside `OLLAMA_EMB_MODEL_NAME` to keep chunk token counting consistent
|
||||
|
||||
**Corpus fit assessment:** The intra-chunk multilingual mixing (18.9% of chunks) and the Spanish prose component (79 narrative chunks) are the corpus characteristics most likely to differentiate BGE-M3 from Qwen3. The BEIR and EvaluateRAG evaluations will determine whether this theoretical advantage translates to measurable retrieval improvement.
|
||||
**Corpus fit assessment:** The intra-chunk multilingual mixing (18.9% of chunks) and the Spanish prose component (79 narrative chunks) are the corpus characteristics most likely to differentiate BGE-M3 from Qwen3. The BEIR and EvaluateRAG evaluations determine whether this theoretical advantage translates to measurable retrieval improvement.
|
||||
|
||||
### VRAM
|
||||
|
||||
|
|
@ -148,15 +148,15 @@ Both candidates output 1024-dimensional vectors. The Elasticsearch index mapping
|
|||
|
||||
## Evaluation Protocol
|
||||
|
||||
Both models are evaluated under identical conditions. Results must be documented in `research/embeddings/` before this ADR is closed.
|
||||
Both models are evaluated under identical conditions. All results are documented in `research/embeddings/`.
|
||||
|
||||
**Step 1 — BEIR benchmarks**
|
||||
|
||||
Run CodeXGLUE, CoSQA and SciFact with **BGE-M3** using the same BEIR evaluation scripts and configuration used for Qwen3-Embedding-0.6B. Qwen3-Embedding-0.6B results already exist in `research/embeddings/` and serve as the baseline. Report NDCG@k, MAP@k, Recall@k and Precision@k at k = 1, 3, 5, 10, 100.
|
||||
CodeXGLUE, CoSQA and SciFact were run with **BGE-M3** using the same BEIR evaluation scripts and configuration used for Qwen3-Embedding-0.6B. Qwen3-Embedding-0.6B results already existed in `research/embeddings/` and served as the baseline. Reported metrics: NDCG@k, MAP@k, Recall@k and Precision@k at k = 1, 3, 5, 10, 100.
|
||||
|
||||
**Step 2 — EvaluateRAG on AVAP corpus**
|
||||
|
||||
Rebuild the Elasticsearch index twice — once with each model — and run `EvaluateRAG` against the production AVAP golden dataset for both. Report RAGAS scores: faithfulness, answer_relevancy, context_recall, context_precision, and global score with verdict.
|
||||
The Elasticsearch index is rebuilt twice — once with each model — and `EvaluateRAG` is run against the production AVAP golden dataset for both. Reported RAGAS scores: faithfulness, answer_relevancy, context_recall, context_precision, and global score with verdict.
|
||||
|
||||
**Selection criterion**
|
||||
|
||||
|
|
@ -170,16 +170,101 @@ All margin comparisons use **absolute percentage points** in NDCG@10 (e.g., 0.39
|
|||
|
||||
If the EvaluateRAG global scores are within 5 absolute percentage points of each other, the BEIR results determine the outcome under the following conditions:
|
||||
|
||||
- BGE-M3 must exceed Qwen3-Embedding-0.6B by more than 2 absolute percentage points on mean NDCG@10 across all three BEIR datasets, AND
|
||||
- BGE-M3 must not underperform Qwen3-Embedding-0.6B by more than 2 absolute percentage points on CoSQA NDCG@10 specifically.
|
||||
- BGE-M3 exceeds Qwen3-Embedding-0.6B by more than 2 absolute percentage points on mean NDCG@10 across all three BEIR datasets, AND
|
||||
- BGE-M3 does not underperform Qwen3-Embedding-0.6B by more than 2 absolute percentage points on CoSQA NDCG@10 specifically.
|
||||
|
||||
If neither condition is met — that is, if EvaluateRAG scores are within 5 points and BGE-M3 does not clear both BEIR thresholds — Qwen3-Embedding-0.6B is adopted. It carries lower integration risk, its benchmarks are already documented, and it is the validated baseline for the system.
|
||||
|
||||
---
|
||||
|
||||
## Rationale
|
||||
|
||||
### Step 1 results — BEIR head-to-head comparison
|
||||
|
||||
BGE-M3 benchmarks were completed on the same three BEIR datasets using identical evaluation scripts and configuration. Full results are stored in `research/embeddings/embedding_eval_results/emb_models_result.json`. The following tables compare both candidates side by side.
|
||||
|
||||
**CodeXGLUE** (code retrieval from GitHub repositories):
|
||||
|
||||
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
||||
|---|---|---|---|---|
|
||||
| NDCG | 1 | **0.9520** | 0.9497 | +0.23 pp |
|
||||
| NDCG | 5 | **0.9738** | 0.9717 | +0.21 pp |
|
||||
| NDCG | 10 | **0.9749** | 0.9734 | +0.15 pp |
|
||||
| NDCG | 100 | **0.9763** | 0.9745 | +0.18 pp |
|
||||
| Recall | 1 | **0.9520** | 0.9497 | +0.23 pp |
|
||||
| Recall | 5 | **0.9892** | 0.9876 | +0.16 pp |
|
||||
| Recall | 10 | 0.9928 | **0.9930** | −0.02 pp |
|
||||
| Recall | 100 | **0.9989** | 0.9981 | +0.08 pp |
|
||||
|
||||
Both models perform near-identically on CodeXGLUE. All deltas are below 0.25 absolute percentage points. This dataset does not differentiate the candidates.
|
||||
|
||||
**CoSQA** (natural language queries over code — most representative proxy for AVAP retrieval):
|
||||
|
||||
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
||||
|---|---|---|---|---|
|
||||
| NDCG | 1 | 0.1160 | **0.1740** | −5.80 pp |
|
||||
| NDCG | 5 | 0.2383 | **0.3351** | −9.68 pp |
|
||||
| NDCG | 10 | 0.2878 | **0.3909** | −10.31 pp |
|
||||
| NDCG | 100 | 0.3631 | **0.4510** | −8.79 pp |
|
||||
| Recall | 1 | 0.1160 | **0.1740** | −5.80 pp |
|
||||
| Recall | 5 | 0.3660 | **0.5020** | −13.60 pp |
|
||||
| Recall | 10 | 0.5160 | **0.6700** | −15.40 pp |
|
||||
| Recall | 100 | 0.8740 | **0.9520** | −7.80 pp |
|
||||
|
||||
Qwen3-Embedding-0.6B outperforms BGE-M3 on CoSQA by a wide margin at every k. The NDCG@10 gap is 10.31 absolute percentage points. CoSQA is the most representative proxy for the AVAP retrieval use case — it pairs natural language queries with code snippets — making this the most significant BEIR result.
|
||||
|
||||
**SciFact** (scientific prose — out-of-domain control):
|
||||
|
||||
| Metric | k | BGE-M3 | Qwen3-Emb-0.6B | Delta (BGE-M3 − Qwen3) |
|
||||
|---|---|---|---|---|
|
||||
| NDCG | 1 | 0.5100 | **0.5533** | −4.33 pp |
|
||||
| NDCG | 5 | 0.6190 | **0.6593** | −4.03 pp |
|
||||
| NDCG | 10 | 0.6431 | **0.6785** | −3.54 pp |
|
||||
| NDCG | 100 | 0.6705 | **0.7056** | −3.51 pp |
|
||||
| Recall | 1 | 0.4818 | **0.5243** | −4.25 pp |
|
||||
| Recall | 5 | 0.7149 | **0.7587** | −4.38 pp |
|
||||
| Recall | 10 | 0.7834 | **0.8144** | −3.10 pp |
|
||||
| Recall | 100 | 0.9037 | **0.9367** | −3.30 pp |
|
||||
|
||||
Qwen3-Embedding-0.6B leads BGE-M3 on SciFact by 3–4 absolute percentage points across all metrics. The gap is consistent but narrower than on CoSQA.
|
||||
|
||||
### BEIR summary — NDCG@10 comparison
|
||||
|
||||
| Dataset | BGE-M3 | Qwen3-Emb-0.6B | Delta | Leader |
|
||||
|---|---|---|---|---|
|
||||
| CodeXGLUE | 0.9749 | 0.9734 | +0.15 pp | BGE-M3 (marginal) |
|
||||
| CoSQA | 0.2878 | **0.3909** | −10.31 pp | **Qwen3** |
|
||||
| SciFact | 0.6431 | **0.6785** | −3.54 pp | **Qwen3** |
|
||||
| **Mean** | **0.6353** | **0.6809** | **−4.56 pp** | **Qwen3** |
|
||||
|
||||
Qwen3-Embedding-0.6B leads on mean NDCG@10 by 4.56 absolute percentage points, driven primarily by a 10.31 pp advantage on CoSQA.
|
||||
|
||||
### Application of tiebreaker criteria to BEIR results
|
||||
|
||||
Per the evaluation protocol, if EvaluateRAG global scores are within 5 absolute percentage points, the BEIR tiebreaker applies. The tiebreaker requires BGE-M3 to meet **both** conditions:
|
||||
|
||||
1. **BGE-M3 must exceed Qwen3 by more than 2 pp on mean NDCG@10.** Result: BGE-M3 trails by 4.56 pp. **Condition not met.**
|
||||
2. **BGE-M3 must not underperform Qwen3 by more than 2 pp on CoSQA NDCG@10.** Result: BGE-M3 trails by 10.31 pp. **Condition not met.**
|
||||
|
||||
Neither tiebreaker condition is satisfied. Under the defined protocol, if the EvaluateRAG evaluation results in a tie (within 5 pp), the BEIR tiebreaker defaults to Qwen3-Embedding-0.6B.
|
||||
|
||||
### Step 2 results — EvaluateRAG on AVAP corpus
|
||||
|
||||
At this moment, we are not in possesion of the golden dataset, cannot proceed with step 2.
|
||||
|
||||
_Pending. Results will be documented here upon completion of the EvaluateRAG evaluation for both models._
|
||||
|
||||
### Preliminary assessment
|
||||
|
||||
The BEIR benchmarks — the secondary decision signal — favour Qwen3-Embedding-0.6B across both the most representative dataset (CoSQA, −10.31 pp) and the out-of-domain control (SciFact, −3.54 pp), with CodeXGLUE effectively tied. BGE-M3's theoretical advantage from multilingual contrastive training does not translate to superior performance on these English-only benchmarks.
|
||||
|
||||
The EvaluateRAG evaluation — the primary decision signal — remains pending. It is the only evaluation that directly measures retrieval quality on the actual AVAP corpus with its intra-chunk multilingual mixing. BGE-M3's architectural fit for multilingual content could still produce a measurable advantage on the production corpus that the English-only BEIR benchmarks cannot capture. No final model selection will be made until EvaluateRAG results are available for both candidates.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- **Index rebuild required** regardless of which model is adopted. Vectors from Qwen2.5-1.5B are incompatible with either candidate. The existing index must be deleted before re-ingestion.
|
||||
- **Index rebuild required** regardless of which model is adopted. Vectors from Qwen2.5-1.5B are incompatible with either candidate. The existing index is deleted before re-ingestion.
|
||||
- **Two index rebuilds required for the evaluation.** One per candidate for the EvaluateRAG step. Given the current corpus size (190 chunks, 11,498 tokens), rebuild time is not a meaningful constraint.
|
||||
- **Tokenizer alignment for BGE-M3.** If BGE-M3 is selected, both `OLLAMA_EMB_MODEL_NAME` and `HF_EMB_MODEL_NAME` must be updated. Updating only `OLLAMA_EMB_MODEL_NAME` causes the chunker to estimate token counts using the wrong vocabulary — a silent bug that produces inconsistent chunk sizes without raising any error.
|
||||
- **Future model changes.** Any future replacement of the embedding model must follow the same evaluation protocol — BEIR benchmarks on the same three datasets plus EvaluateRAG — before an ADR update is accepted. Results must be documented in `research/embeddings/`.
|
||||
- **Tokenizer alignment for BGE-M3.** If BGE-M3 is selected, both `OLLAMA_EMB_MODEL_NAME` and `HF_EMB_MODEL_NAME` are updated. Updating only `OLLAMA_EMB_MODEL_NAME` causes the chunker to estimate token counts using the wrong vocabulary — a silent bug that produces inconsistent chunk sizes without raising any error.
|
||||
- **Future model changes.** Any future replacement of the embedding model follows the same evaluation protocol — BEIR benchmarks on the same three datasets plus EvaluateRAG — before an ADR update is accepted. Results are documented in `research/embeddings/`.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,78 @@
|
|||
# ADR-0006: Code Indexing Improvements — Comparative Evaluation of code chunking strategies
|
||||
|
||||
**Date:** 2026-03-24
|
||||
**Status:** Proposed
|
||||
**Deciders:** Rafael Ruiz (CTO), MrHouston Engineering
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
Efficient code indexing is a critical component for enabling high-quality code search, retrieval-augmented generation (RAG), and semantic understanding in developer tooling. The main challenge lies in representing source code in a way that preserves its syntactic and semantic structure while remaining suitable for embedding-based retrieval systems.
|
||||
|
||||
In this context, we explored different strategies to improve the indexing of .avap code files, starting from a naïve approach and progressively moving toward more structured representations based on parsing techniques.
|
||||
|
||||
### Alternatives
|
||||
- File-level chunking (baseline):
|
||||
|
||||
Each .avap file is treated as a single chunk and indexed directly. This approach is simple and fast but ignores internal structure (functions, classes, blocks).
|
||||
|
||||
|
||||
- EBNF chunking as metadata:
|
||||
|
||||
Each .avap file is still treated as a single chunk and indexed directly. However, by using the AVAP EBNF syntax, we extract the AST structure and injects it into the chunk metadata.
|
||||
|
||||
|
||||
- Full EBNF chunking:
|
||||
|
||||
Each .avap file is still treated as a single chunk and indexed directly. The difference between this approach and the last 2, is that the AST is indexed instead the code.
|
||||
|
||||
|
||||
- Grammar definition chunking:
|
||||
|
||||
Code is segmented using a language-specific configuration (`avap_config.json`) instead of one-file chunks. The chunker applies a lexer (comments/strings), identifies multi-line blocks (`function`, `if`, `startLoop`, `try`), classifies single-line statements (`registerEndpoint`, `orm_command`, `http_command`, etc.), and enriches every chunk with semantic tags (`uses_orm`, `uses_http`, `uses_async`, `returns_result`, among others).
|
||||
|
||||
This strategy also extracts function signatures as dedicated lightweight chunks and propagates local context between nearby chunks (semantic overlap), improving retrieval precision for both API-level and implementation-level queries.
|
||||
|
||||
|
||||
### Indexed docs
|
||||
For each strategy, we created a different Elasticsearch Index with their own characteristics. The 3 first approaches have 33 chunks (1 chunk per file), whereas the last approach has 89 chunks.
|
||||
|
||||
|
||||
### How can we evaluate each strategy?
|
||||
|
||||
**Evaluation Protocol:**
|
||||
|
||||
1. **Golden Dataset**
|
||||
- Generate a set of natural language queries paired with their ground-truth context (filename).
|
||||
- Each query should be answerable by examining one or more code samples.
|
||||
- Example: Query="How do you handle errors in AVAP?" → Context="try_catch_request.avap"
|
||||
|
||||
2. **Test Each Strategy**
|
||||
- For each of the 4 chunking strategies, run the same set of queries against the respective Elasticsearch index.
|
||||
- Record the top-10 retrieved chunks for each query.
|
||||
|
||||
3. **Metrics**
|
||||
- `NDCG@10`: Normalized discounted cumulative gain at rank 10 (measures ranking quality).
|
||||
- `Recall@10`: Fraction of relevant chunks retrieved in top 10.
|
||||
- `MRR@10`: Mean reciprocal rank (position of first relevant result).
|
||||
|
||||
4. **Relevance Judgment**
|
||||
- A chunk is considered relevant if it contains code directly answering the query.
|
||||
- For file-level strategies: entire file is relevant or irrelevant.
|
||||
- For grammar-definition: specific block/statement chunks are relevant even if the full file is not.
|
||||
|
||||
5. **Acceptance Criteria**
|
||||
- **Grammar definition must achieve at least a 10% improvement in NDCG@10 over file-level baseline.**
|
||||
- **Recall@10 must not drop by more than 5 absolute percentage points vs file-level.**
|
||||
- **Index size increase must remain below 50% of baseline.**
|
||||
|
||||
## Decision
|
||||
|
||||
|
||||
|
||||
## Rationale
|
||||
|
||||
|
||||
|
||||
## Consequences
|
||||
|
||||
|
|
@ -1,3 +1,3 @@
|
|||
addParam(emails,emails)
|
||||
getQueryParamList(lista_correos)
|
||||
addParam("emails", emails)
|
||||
getQueryParamList("lista_correos", lista_correos)
|
||||
addResult(lista_correos)
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
addParam(sal_par,saldo)
|
||||
addParam("sal_par",saldo)
|
||||
if(saldo, 0, ">")
|
||||
permitir = True
|
||||
else()
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
addParam(userrype, user_type)
|
||||
addParam(sells, compras)
|
||||
addParam("userrype", user_type)
|
||||
addParam("sells", compras)
|
||||
if(None, None, " user_type == 'VIP' or compras > 100")
|
||||
addVar(descuento, 0.20)
|
||||
end()
|
||||
|
|
|
|||
|
|
@ -1,4 +1,4 @@
|
|||
addParam(password,pass_nueva)
|
||||
addParam("password",pass_nueva)
|
||||
pass_antigua = "password"
|
||||
if(pass_nueva, pass_antigua, "!=")
|
||||
addVar(cambio, "Contraseña actualizada")
|
||||
|
|
|
|||
|
|
@ -1,5 +1,7 @@
|
|||
try()
|
||||
ormDirect("UPDATE table_inexistente SET a=1", res)
|
||||
exception(e)
|
||||
addVar(_status,500)
|
||||
addResult("Error de base de datos")
|
||||
addVar(_status, 500)
|
||||
addVar(error_msg, "Error de base de datos")
|
||||
addResult(error_msg)
|
||||
end()
|
||||
|
|
@ -1,5 +1,6 @@
|
|||
try()
|
||||
RequestGet("https://api.test.com/data", 0, 0, respuesta)
|
||||
RequestGet("https://api.test.com/data", 0, 0, respuesta, None)
|
||||
exception(e)
|
||||
addVar(error_trace, "Fallo de conexión: %s" % e)
|
||||
addVar(error_trace, e)
|
||||
addResult(error_trace)
|
||||
end()
|
||||
|
|
@ -1,5 +1,8 @@
|
|||
addParam("rol", r)
|
||||
if(r, ["admin", "editor", "root"], "in")
|
||||
acceso = False
|
||||
|
||||
if(None, None, "r == 'admin' or r == 'editor' or r == 'root'")
|
||||
acceso = True
|
||||
end()
|
||||
|
||||
addResult(acceso)
|
||||
|
|
@ -0,0 +1,228 @@
|
|||
start: program
|
||||
|
||||
program: separator* line_or_comment (separator+ line_or_comment)* separator*
|
||||
|
||||
?line_or_comment: simple_stmt comment?
|
||||
| compound_stmt
|
||||
| comment
|
||||
| BLOCK_COMMENT
|
||||
|
||||
?separator: EOL+
|
||||
|
||||
comment: DOC_COMMENT | LINE_COMMENT
|
||||
|
||||
EOL: /\r?\n/
|
||||
|
||||
DOC_COMMENT.2: /\/\/\/[^\r\n]*/
|
||||
LINE_COMMENT.1: /\/\/[^\r\n]*/
|
||||
BLOCK_COMMENT: /\/\*[\s\S]*?\*\//
|
||||
|
||||
?simple_stmt: assignment
|
||||
| return_stmt
|
||||
| system_command
|
||||
| io_command
|
||||
| async_command
|
||||
| connector_cmd
|
||||
| db_command
|
||||
| http_command
|
||||
| util_command
|
||||
| modularity_cmd
|
||||
| call_stmt
|
||||
|
||||
?compound_stmt: function_decl
|
||||
| if_stmt
|
||||
| loop_stmt
|
||||
| try_stmt
|
||||
|
||||
assignment: identifier "=" expression
|
||||
|
||||
call_stmt: identifier "(" argument_list? ")"
|
||||
| identifier "=" identifier "." identifier "(" argument_list? ")"
|
||||
| identifier "." identifier "(" argument_list? ")"
|
||||
|
||||
system_command: register_cmd
|
||||
| addvar_cmd
|
||||
|
||||
register_cmd: "registerEndpoint" "(" stringliteral "," stringliteral "," list_display "," stringliteral "," identifier "," identifier ")"
|
||||
|
||||
addvar_cmd: "addVar" "(" addvar_arg "," addvar_arg ")"
|
||||
|
||||
addvar_arg: identifier
|
||||
| literal
|
||||
| "$" identifier
|
||||
|
||||
identifier: IDENTIFIER
|
||||
|
||||
system_variable: "_status"
|
||||
|
||||
io_command: addparam_cmd
|
||||
| getlistlen_cmd
|
||||
| addresult_cmd
|
||||
| getparamlist_cmd
|
||||
|
||||
addparam_cmd: "addParam" "(" stringliteral "," identifier ")"
|
||||
getlistlen_cmd: "getListLen" "(" identifier "," identifier ")"
|
||||
getparamlist_cmd: "getQueryParamList" "(" stringliteral "," identifier ")"
|
||||
addresult_cmd: "addResult" "(" identifier ")"
|
||||
|
||||
if_stmt: "if" "(" if_condition ")" separator block ("else" "(" ")" separator block)? "end" "(" ")"
|
||||
|
||||
if_condition: if_atom "," if_atom "," stringliteral
|
||||
| "None" "," "None" "," stringliteral
|
||||
|
||||
if_atom: identifier
|
||||
| literal
|
||||
|
||||
loop_stmt: "startLoop" "(" identifier "," expression "," expression ")" separator block "endLoop" "(" ")"
|
||||
|
||||
try_stmt: "try" "(" ")" separator block "exception" "(" identifier ")" separator block "end" "(" ")"
|
||||
|
||||
block: separator* line_or_comment (separator+ line_or_comment)* separator*
|
||||
|
||||
async_command: go_stmt
|
||||
| gather_stmt
|
||||
|
||||
go_stmt: identifier "=" "go" identifier "(" argument_list? ")"
|
||||
gather_stmt: identifier "=" "gather" "(" identifier ("," expression)? ")"
|
||||
|
||||
connector_cmd: connector_instantiation
|
||||
|
||||
connector_instantiation: identifier "=" "avapConnector" "(" stringliteral ")"
|
||||
|
||||
http_command: req_post_cmd
|
||||
| req_get_cmd
|
||||
|
||||
req_post_cmd: "RequestPost" "(" expression "," expression "," expression "," expression "," identifier "," expression ")"
|
||||
req_get_cmd: "RequestGet" "(" expression "," expression "," expression "," identifier "," expression ")"
|
||||
|
||||
db_command: orm_direct
|
||||
| orm_check
|
||||
| orm_create
|
||||
| orm_select
|
||||
| orm_insert
|
||||
| orm_update
|
||||
|
||||
orm_direct: "ormDirect" "(" expression "," identifier ")"
|
||||
orm_check: "ormCheckTable" "(" expression "," identifier ")"
|
||||
orm_create: "ormCreateTable" "(" expression "," expression "," expression "," identifier ")"
|
||||
|
||||
orm_select: "ormAccessSelect" "(" orm_fields "," expression ("," expression)? "," identifier ")"
|
||||
|
||||
orm_fields: "*"
|
||||
| expression
|
||||
|
||||
orm_insert: "ormAccessInsert" "(" expression "," expression "," identifier ")"
|
||||
orm_update: "ormAccessUpdate" "(" expression "," expression "," expression "," expression "," identifier ")"
|
||||
|
||||
util_command: json_list_cmd
|
||||
| crypto_cmd
|
||||
| regex_cmd
|
||||
| datetime_cmd
|
||||
| stamp_cmd
|
||||
| string_cmd
|
||||
| replace_cmd
|
||||
|
||||
json_list_cmd: "variableToList" "(" expression "," identifier ")"
|
||||
| "itemFromList" "(" identifier "," expression "," identifier ")"
|
||||
| "variableFromJSON" "(" identifier "," expression "," identifier ")"
|
||||
| "AddVariableToJSON" "(" expression "," expression "," identifier ")"
|
||||
|
||||
crypto_cmd: "encodeSHA256" "(" identifier_or_string "," identifier ")"
|
||||
| "encodeMD5" "(" identifier_or_string "," identifier ")"
|
||||
|
||||
regex_cmd: "getRegex" "(" identifier "," stringliteral "," identifier ")"
|
||||
|
||||
datetime_cmd: "getDateTime" "(" stringliteral "," expression "," stringliteral "," identifier ")"
|
||||
|
||||
stamp_cmd: "stampToDatetime" "(" expression "," stringliteral "," expression "," identifier ")"
|
||||
| "getTimeStamp" "(" stringliteral "," stringliteral "," expression "," identifier ")"
|
||||
|
||||
string_cmd: "randomString" "(" expression "," expression "," identifier ")"
|
||||
|
||||
replace_cmd: "replace" "(" identifier_or_string "," stringliteral "," stringliteral "," identifier ")"
|
||||
|
||||
function_decl: "function" identifier "(" param_list? ")" "{" separator block "}"
|
||||
|
||||
param_list: identifier ("," identifier)*
|
||||
|
||||
return_stmt: "return" "(" expression? ")"
|
||||
|
||||
modularity_cmd: include_stmt
|
||||
| import_stmt
|
||||
|
||||
include_stmt: "include" stringliteral
|
||||
import_stmt: "import" ("<" identifier ">" | stringliteral)
|
||||
|
||||
?expression: logical_or
|
||||
|
||||
?logical_or: logical_and ("or" logical_and)*
|
||||
?logical_and: logical_not ("and" logical_not)*
|
||||
|
||||
?logical_not: "not" logical_not
|
||||
| comparison
|
||||
|
||||
?comparison: arithmetic (comp_op arithmetic)*
|
||||
|
||||
comp_op: "==" | "!=" | "<" | ">" | "<=" | ">=" | "in" | "is"
|
||||
|
||||
?arithmetic: term (("+" | "-") term)*
|
||||
?term: factor (("*" | "/" | "%") factor)*
|
||||
|
||||
?factor: ("+" | "-") factor
|
||||
| power
|
||||
|
||||
?power: primary ("**" factor)?
|
||||
|
||||
?primary: atom postfix*
|
||||
|
||||
postfix: "." identifier
|
||||
| "[" expression "]"
|
||||
| "[" expression? ":" expression? (":" expression?)? "]"
|
||||
| "(" argument_list? ")"
|
||||
|
||||
?atom: identifier
|
||||
| "$" identifier
|
||||
| literal
|
||||
| "(" expression ")"
|
||||
| list_display
|
||||
| dict_display
|
||||
|
||||
list_display: "[" argument_list? "]"
|
||||
| "[" expression "for" identifier "in" expression if_clause? "]"
|
||||
|
||||
if_clause: "if" expression
|
||||
|
||||
dict_display: "{" key_datum_list? "}"
|
||||
|
||||
key_datum_list: key_datum ("," key_datum)*
|
||||
key_datum: expression ":" expression
|
||||
|
||||
argument_list: expression ("," expression)*
|
||||
|
||||
number: FLOATNUMBER
|
||||
| INTEGER
|
||||
|
||||
literal: stringliteral
|
||||
| number
|
||||
| boolean
|
||||
| "None"
|
||||
|
||||
boolean: "True" | "False"
|
||||
|
||||
INTEGER: /[0-9]+/
|
||||
FLOATNUMBER: /(?:[0-9]+\.[0-9]*|\.[0-9]+)/
|
||||
|
||||
stringliteral: STRING_DOUBLE
|
||||
| STRING_SINGLE
|
||||
|
||||
# STRING_DOUBLE: /"([^"\\]|\\["'\\ntr0])*"/
|
||||
# STRING_SINGLE: /'([^'\\]|\\["'\\ntr0])*'/
|
||||
STRING_DOUBLE: /"([^"\\]|\\.)*"/
|
||||
STRING_SINGLE: /'([^'\\]|\\.)*'/
|
||||
|
||||
identifier_or_string: identifier
|
||||
| stringliteral
|
||||
|
||||
IDENTIFIER: /[A-Za-z_][A-Za-z0-9_]*/
|
||||
|
||||
%ignore /[ \t]+/
|
||||
|
|
@ -0,0 +1,371 @@
|
|||
import json
|
||||
from copy import deepcopy
|
||||
from dataclasses import replace
|
||||
from pathlib import Path
|
||||
from typing import Any, Union
|
||||
|
||||
from lark import Lark, Tree
|
||||
from chonkie import (
|
||||
Chunk,
|
||||
ElasticHandshake,
|
||||
FileFetcher,
|
||||
MarkdownChef,
|
||||
TextChef,
|
||||
TokenChunker,
|
||||
MarkdownDocument
|
||||
)
|
||||
from elasticsearch import Elasticsearch
|
||||
from loguru import logger
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
from scripts.pipelines.tasks.embeddings import OllamaEmbeddings
|
||||
from src.config import settings
|
||||
|
||||
COMMAND_METADATA_NAMES = {
|
||||
# system
|
||||
"register_cmd": "registerEndpoint",
|
||||
"addvar_cmd": "addVar",
|
||||
"addparam_cmd": "addParam",
|
||||
"getlistlen_cmd": "getListLen",
|
||||
"getparamlist_cmd": "getQueryParamList",
|
||||
"addresult_cmd": "addResult",
|
||||
|
||||
# async
|
||||
"go_stmt": "go",
|
||||
"gather_stmt": "gather",
|
||||
|
||||
# connector
|
||||
"connector_instantiation": "avapConnector",
|
||||
|
||||
# http
|
||||
"req_post_cmd": "RequestPost",
|
||||
"req_get_cmd": "RequestGet",
|
||||
|
||||
# db
|
||||
"orm_direct": "ormDirect",
|
||||
"orm_check": "ormCheckTable",
|
||||
"orm_create": "ormCreateTable",
|
||||
"orm_select": "ormAccessSelect",
|
||||
"orm_insert": "ormAccessInsert",
|
||||
"orm_update": "ormAccessUpdate",
|
||||
|
||||
# util
|
||||
"json_list_cmd": "json_list_ops",
|
||||
"crypto_cmd": "crypto_ops",
|
||||
"regex_cmd": "getRegex",
|
||||
"datetime_cmd": "getDateTime",
|
||||
"stamp_cmd": "timestamp_ops",
|
||||
"string_cmd": "randomString",
|
||||
"replace_cmd": "replace",
|
||||
|
||||
# modularity
|
||||
"include_stmt": "include",
|
||||
"import_stmt": "import",
|
||||
|
||||
# generic statements
|
||||
"assignment": "assignment",
|
||||
"call_stmt": "call",
|
||||
"return_stmt": "return",
|
||||
"if_stmt": "if",
|
||||
"loop_stmt": "startLoop",
|
||||
"try_stmt": "try",
|
||||
"function_decl": "function",
|
||||
}
|
||||
|
||||
|
||||
def _extract_command_metadata(ast: Tree | None) -> dict[str, bool]:
|
||||
if ast is None:
|
||||
return {}
|
||||
|
||||
used_commands: set[str] = set()
|
||||
|
||||
for subtree in ast.iter_subtrees():
|
||||
if subtree.data in COMMAND_METADATA_NAMES:
|
||||
used_commands.add(COMMAND_METADATA_NAMES[subtree.data])
|
||||
|
||||
return {command_name: True for command_name in sorted(used_commands)}
|
||||
|
||||
|
||||
def _get_text(element) -> str:
|
||||
for attr in ("text", "content", "markdown"):
|
||||
value = getattr(element, attr, None)
|
||||
if isinstance(value, str):
|
||||
return value
|
||||
raise AttributeError(
|
||||
f"Could not extract text from element of type {type(element).__name__}"
|
||||
)
|
||||
|
||||
|
||||
def _merge_markdown_document(processed_doc: MarkdownDocument) -> MarkdownDocument:
|
||||
elements = []
|
||||
|
||||
for chunk in processed_doc.chunks:
|
||||
elements.append(("chunk", chunk.start_index, chunk.end_index, chunk))
|
||||
|
||||
for code in processed_doc.code:
|
||||
elements.append(("code", code.start_index, code.end_index, code))
|
||||
|
||||
for table in processed_doc.tables:
|
||||
elements.append(("table", table.start_index, table.end_index, table))
|
||||
|
||||
elements.sort(key=lambda item: (item[1], item[2]))
|
||||
|
||||
merged_chunks = []
|
||||
current_chunk = None
|
||||
current_parts = []
|
||||
current_end_index = None
|
||||
current_token_count = None
|
||||
|
||||
def flush():
|
||||
nonlocal current_chunk, current_parts, current_end_index, current_token_count
|
||||
|
||||
if current_chunk is None:
|
||||
return
|
||||
|
||||
merged_text = "\n\n".join(part for part in current_parts if part)
|
||||
|
||||
merged_chunks.append(
|
||||
replace(
|
||||
current_chunk,
|
||||
text=merged_text,
|
||||
end_index=current_end_index,
|
||||
token_count=current_token_count,
|
||||
)
|
||||
)
|
||||
|
||||
current_chunk = None
|
||||
current_parts = []
|
||||
current_end_index = None
|
||||
current_token_count = None
|
||||
|
||||
for kind, _, _, element in elements:
|
||||
if kind == "chunk":
|
||||
flush()
|
||||
current_chunk = element
|
||||
current_parts = [_get_text(element)]
|
||||
current_end_index = element.end_index
|
||||
current_token_count = element.token_count
|
||||
continue
|
||||
|
||||
if current_chunk is None:
|
||||
continue
|
||||
|
||||
current_parts.append(_get_text(element))
|
||||
current_end_index = max(current_end_index, element.end_index)
|
||||
current_token_count += getattr(element, "token_count", 0)
|
||||
|
||||
flush()
|
||||
|
||||
fused_processed_doc = deepcopy(processed_doc)
|
||||
fused_processed_doc.chunks = merged_chunks
|
||||
fused_processed_doc.code = processed_doc.code
|
||||
fused_processed_doc.tables = processed_doc.tables
|
||||
|
||||
return fused_processed_doc
|
||||
|
||||
|
||||
class ElasticHandshakeWithMetadata(ElasticHandshake):
|
||||
"""Extended ElasticHandshake that preserves chunk metadata in Elasticsearch."""
|
||||
|
||||
def _create_bulk_actions(self, chunks: list[dict]) -> list[dict[str, Any]]:
|
||||
"""Generate bulk actions including metadata."""
|
||||
actions = []
|
||||
embeddings = self.embedding_model.embed_batch([chunk["chunk"].text for chunk in chunks])
|
||||
|
||||
for i, chunk in enumerate(chunks):
|
||||
source = {
|
||||
"text": chunk["chunk"].text,
|
||||
"embedding": embeddings[i],
|
||||
"start_index": chunk["chunk"].start_index,
|
||||
"end_index": chunk["chunk"].end_index,
|
||||
"token_count": chunk["chunk"].token_count,
|
||||
}
|
||||
|
||||
# Include metadata if it exists
|
||||
if chunk.get("extra_metadata"):
|
||||
source.update(chunk["extra_metadata"])
|
||||
|
||||
actions.append({
|
||||
"_index": self.index_name,
|
||||
"_id": self._generate_id(i, chunk["chunk"]),
|
||||
"_source": source,
|
||||
})
|
||||
|
||||
return actions
|
||||
|
||||
def write(self, chunks: Union[Chunk, list[Chunk]]) -> list[dict[str, Any]]:
|
||||
"""Write the chunks to the Elasticsearch index using the bulk API."""
|
||||
if isinstance(chunks, Chunk):
|
||||
chunks = [chunks]
|
||||
|
||||
actions = self._create_bulk_actions(chunks)
|
||||
|
||||
# Use the bulk helper to efficiently write the documents
|
||||
from elasticsearch.helpers import bulk
|
||||
|
||||
success, errors = bulk(self.client, actions, raise_on_error=False)
|
||||
|
||||
if errors:
|
||||
logger.warning(f"Encountered {len(errors)} errors during bulk indexing.") # type: ignore
|
||||
# Optionally log the first few errors for debugging
|
||||
for i, error in enumerate(errors[:5]): # type: ignore
|
||||
logger.error(f"Error {i + 1}: {error}")
|
||||
|
||||
logger.info(f"Chonkie wrote {success} chunks to Elasticsearch index: {self.index_name}")
|
||||
|
||||
return actions
|
||||
|
||||
|
||||
def fetch_documents(docs_folder_path: str, docs_extension: list[str]) -> list[Path]:
|
||||
"""
|
||||
Fetch files from a folder that match the specified extensions.
|
||||
|
||||
Args:
|
||||
docs_folder_path (str): Path to the folder containing documents
|
||||
docs_extension (list[str]): List of file extensions to filter by (e.g., [".md", ".avap"])
|
||||
|
||||
Returns:
|
||||
List of Paths to the fetched documents
|
||||
"""
|
||||
fetcher = FileFetcher()
|
||||
docs_path = fetcher.fetch(dir=f"{settings.proj_root}/{docs_folder_path}", ext=docs_extension)
|
||||
return docs_path
|
||||
|
||||
|
||||
def process_documents(docs_path: list[Path]) -> list[dict[str, Any]]:
|
||||
"""
|
||||
Process documents by applying appropriate chefs and chunking strategies based on file type.
|
||||
|
||||
Args:
|
||||
docs_path: List of Paths to the documents to be processed.
|
||||
|
||||
Returns:
|
||||
List of dicts with "chunk" (Chunk object) and "extra_metadata" (dict with file info).
|
||||
"""
|
||||
processed_docs = []
|
||||
custom_tokenizer = AutoTokenizer.from_pretrained(settings.hf_emb_model_name)
|
||||
|
||||
chef_md = MarkdownChef(tokenizer=custom_tokenizer)
|
||||
chef_txt = TextChef()
|
||||
chunker = TokenChunker(tokenizer=custom_tokenizer)
|
||||
|
||||
with open(settings.proj_root / "research/code_indexing/BNF/avap.lark", encoding="utf-8") as grammar:
|
||||
lark_parser = Lark(
|
||||
grammar.read(),
|
||||
parser="lalr",
|
||||
propagate_positions=True,
|
||||
start="program",
|
||||
)
|
||||
|
||||
for doc_path in docs_path:
|
||||
doc_extension = doc_path.suffix.lower()
|
||||
|
||||
if doc_extension == ".md":
|
||||
processed_doc = chef_md.process(doc_path)
|
||||
fused_doc = _merge_markdown_document(processed_doc)
|
||||
chunked_doc = fused_doc.chunks
|
||||
specific_metadata = {
|
||||
"file_type": "avap_docs",
|
||||
"filename": doc_path.name,
|
||||
}
|
||||
|
||||
elif doc_extension == ".avap":
|
||||
processed_doc = chef_txt.process(doc_path)
|
||||
|
||||
try:
|
||||
ast = lark_parser.parse(processed_doc.content)
|
||||
except Exception as e:
|
||||
logger.error(f"Error parsing AVAP code in {doc_path.name}: {e}")
|
||||
ast = None
|
||||
|
||||
chunked_doc = chunker.chunk(processed_doc.content)
|
||||
|
||||
specific_metadata = {
|
||||
"file_type": "avap_code",
|
||||
"filename": doc_path.name,
|
||||
**_extract_command_metadata(ast),
|
||||
}
|
||||
|
||||
else:
|
||||
continue
|
||||
|
||||
for chunk in chunked_doc:
|
||||
processed_docs.append(
|
||||
{
|
||||
"chunk": chunk,
|
||||
"extra_metadata": {**specific_metadata},
|
||||
}
|
||||
)
|
||||
|
||||
return processed_docs
|
||||
|
||||
|
||||
def ingest_documents(
|
||||
chunked_docs: list[dict[str, Chunk | dict[str, Any]]],
|
||||
es_index: str,
|
||||
es_request_timeout: int,
|
||||
es_max_retries: int,
|
||||
es_retry_on_timeout: bool,
|
||||
delete_es_index: bool,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""
|
||||
Ingest processed documents into an Elasticsearch index.
|
||||
|
||||
Args:
|
||||
chunked_docs (list[dict[str, Any]]): List of dicts with "chunk" and "metadata" keys
|
||||
es_index (str): Name of the Elasticsearch index to ingest into
|
||||
es_request_timeout (int): Timeout for Elasticsearch requests in seconds
|
||||
es_max_retries (int): Maximum number of retries for Elasticsearch requests
|
||||
es_retry_on_timeout (bool): Whether to retry on Elasticsearch request timeouts
|
||||
delete_es_index (bool): Whether to delete the existing Elasticsearch index before ingestion
|
||||
|
||||
Returns:
|
||||
List of dicts with Elasticsearch response for each chunk
|
||||
"""
|
||||
logger.info(
|
||||
f"Instantiating Elasticsearch client with URL: {settings.elasticsearch_local_url}..."
|
||||
)
|
||||
es = Elasticsearch(
|
||||
hosts=settings.elasticsearch_local_url,
|
||||
request_timeout=es_request_timeout,
|
||||
max_retries=es_max_retries,
|
||||
retry_on_timeout=es_retry_on_timeout,
|
||||
)
|
||||
|
||||
if delete_es_index and es.indices.exists(index=es_index):
|
||||
logger.info(f"Deleting existing Elasticsearch index: {es_index}...")
|
||||
es.indices.delete(index=es_index)
|
||||
|
||||
handshake = ElasticHandshakeWithMetadata(
|
||||
client=es,
|
||||
index_name=es_index,
|
||||
embedding_model=OllamaEmbeddings(model=settings.ollama_emb_model_name),
|
||||
)
|
||||
|
||||
logger.info(
|
||||
f"Ingesting {len(chunked_docs)} chunks into Elasticsearch index: {es_index}..."
|
||||
)
|
||||
elasticsearch_chunks = handshake.write(chunked_docs)
|
||||
|
||||
return elasticsearch_chunks
|
||||
|
||||
|
||||
def export_documents(elasticsearch_chunks: list[dict[str, Any]], output_path: str) -> None:
|
||||
"""
|
||||
Export processed documents to JSON files in the specified output folder.
|
||||
|
||||
Args:
|
||||
elasticsearch_chunks (list[dict[str, Any]]): List of dicts with Elasticsearch response for each chunk
|
||||
output_path (str): Path to the file where the JSON will be saved
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
output_path = settings.proj_root / output_path
|
||||
|
||||
for chunk in elasticsearch_chunks:
|
||||
chunk["_source"]["embedding"] = chunk["_source"]["embedding"].tolist() # For JSON serialization
|
||||
|
||||
with output_path.open("w", encoding="utf-8") as f:
|
||||
json.dump(elasticsearch_chunks, f, ensure_ascii=False, indent=4)
|
||||
|
||||
logger.info(f"Exported processed documents to {output_path}")
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,89 @@
|
|||
{"chunk_id": "5208d7435c0286ab", "source_file": "docs/samples/hash_SHA256_para_integridad.avap", "doc_type": "code", "block_type": "encodeSHA256", "section": "", "start_line": 1, "end_line": 1, "content": "encodeSHA256(\"payload_data\", checksum)", "metadata": {"uses_crypto": true, "uses_string_ops": true, "complexity": 2}, "token_estimate": 9}
|
||||
{"chunk_id": "e5e9b70428937778", "source_file": "docs/samples/hash_SHA256_para_integridad.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 2, "end_line": 2, "content": "encodeSHA256(\"payload_data\", checksum)\naddResult(checksum)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 14}
|
||||
{"chunk_id": "49d6b31967a1db93", "source_file": "docs/samples/hello_world.avap", "doc_type": "code", "block_type": "registerEndpoint", "section": "", "start_line": 1, "end_line": 1, "content": "registerEndpoint(\"/hello_world\",\"GET\",[],\"HELLO_WORLD\",main,result)", "metadata": {"registers_endpoint": true, "complexity": 1}, "token_estimate": 17}
|
||||
{"chunk_id": "e7ececd11823d42a", "source_file": "docs/samples/hello_world.avap", "doc_type": "code", "block_type": "addVar", "section": "", "start_line": 2, "end_line": 2, "content": "registerEndpoint(\"/hello_world\",\"GET\",[],\"HELLO_WORLD\",main,result)\naddVar(name,\"Alberto\")", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 24}
|
||||
{"chunk_id": "f103d7719754088f", "source_file": "docs/samples/hello_world.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 3, "end_line": 3, "content": "addVar(name,\"Alberto\")\nresult = \"Hello,\" + name", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 14}
|
||||
{"chunk_id": "4b1ab59c1acb224c", "source_file": "docs/samples/hello_world.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 4, "end_line": 4, "content": "result = \"Hello,\" + name\naddResult(result)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 12}
|
||||
{"chunk_id": "682adaeeb528f778", "source_file": "docs/samples/hola_mundo.avap", "doc_type": "code", "block_type": "addVar", "section": "", "start_line": 1, "end_line": 1, "content": "addVar(mensaje, \"Hola mundo desde AVAP\")", "metadata": {"complexity": 0}, "token_estimate": 12}
|
||||
{"chunk_id": "9bb665ca8d7590f7", "source_file": "docs/samples/hola_mundo.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 2, "end_line": 2, "content": "addVar(mensaje, \"Hola mundo desde AVAP\")\naddResult(mensaje)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 17}
|
||||
{"chunk_id": "ed0136ad03a51e7e", "source_file": "docs/samples/captura_de_listas_multiples.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 1, "content": "addParam(\"emails\", emails)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 7}
|
||||
{"chunk_id": "899291ac8959ae3e", "source_file": "docs/samples/captura_de_listas_multiples.avap", "doc_type": "code", "block_type": "getQueryParamList", "section": "", "start_line": 2, "end_line": 2, "content": "addParam(\"emails\", emails)\ngetQueryParamList(\"lista_correos\", lista_correos)", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 21}
|
||||
{"chunk_id": "0eeff974dcd74729", "source_file": "docs/samples/captura_de_listas_multiples.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 3, "end_line": 3, "content": "getQueryParamList(\"lista_correos\", lista_correos)\naddResult(lista_correos)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 21}
|
||||
{"chunk_id": "b2e95857d059d99d", "source_file": "docs/samples/comparacion_simple.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 1, "content": "addParam(\"lang\", l)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 7}
|
||||
{"chunk_id": "db2fab8dfbe7d460", "source_file": "docs/samples/comparacion_simple.avap", "doc_type": "code", "block_type": "if", "section": "", "start_line": 2, "end_line": 4, "content": "addParam(\"lang\", l)\nif(l, \"es\", \"=\")\n addVar(msg, \"Hola\")\nend()", "metadata": {"uses_conditional": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 25}
|
||||
{"chunk_id": "2628fa886650658a", "source_file": "docs/samples/comparacion_simple.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 5, "end_line": 5, "content": "if(l, \"es\", \"=\")\n addVar(msg, \"Hola\")\nend()\naddResult(msg)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 22}
|
||||
{"chunk_id": "89bddd6830b6a8af", "source_file": "docs/samples/concatenacion_dinamica.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 1, "end_line": 2, "content": "nombre = \"Sistema\"\nlog = \"Evento registrado por: %s\" % nombre", "metadata": {"complexity": 0}, "token_estimate": 18}
|
||||
{"chunk_id": "6797d36c2eb0e38a", "source_file": "docs/samples/concatenacion_dinamica.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 3, "end_line": 3, "content": "nombre = \"Sistema\"\nlog = \"Evento registrado por: %s\" % nombre\naddResult(log)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 23}
|
||||
{"chunk_id": "93008a3bed0ea808", "source_file": "docs/samples/if_desigualdad.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 1, "content": "addParam(\"password\",pass_nueva)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 9}
|
||||
{"chunk_id": "142b2aef2f05fae7", "source_file": "docs/samples/if_desigualdad.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 2, "end_line": 2, "content": "addParam(\"password\",pass_nueva)\npass_antigua = \"password\"", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 16}
|
||||
{"chunk_id": "b03b67f3aab35d7a", "source_file": "docs/samples/if_desigualdad.avap", "doc_type": "code", "block_type": "if", "section": "", "start_line": 3, "end_line": 5, "content": "pass_antigua = \"password\"\nif(pass_nueva, pass_antigua, \"!=\")\n addVar(cambio, \"Contraseña actualizada\")\nend()", "metadata": {"uses_conditional": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 33}
|
||||
{"chunk_id": "99549cab6c8617d8", "source_file": "docs/samples/if_desigualdad.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 6, "end_line": 6, "content": "if(pass_nueva, pass_antigua, \"!=\")\n addVar(cambio, \"Contraseña actualizada\")\nend()\naddResult(cambio)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 31}
|
||||
{"chunk_id": "123dfdacd4160b0d", "source_file": "docs/samples/limpieza_de_strings.avap", "doc_type": "code", "block_type": "replace", "section": "", "start_line": 1, "end_line": 1, "content": "replace(\"REF_1234_OLD\",\"OLD\", \"NEW\", ref_actualizada)", "metadata": {"uses_string_ops": true, "complexity": 1}, "token_estimate": 17}
|
||||
{"chunk_id": "c65655393175720a", "source_file": "docs/samples/limpieza_de_strings.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 2, "end_line": 2, "content": "replace(\"REF_1234_OLD\",\"OLD\", \"NEW\", ref_actualizada)\naddResult(ref_actualizada)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 23}
|
||||
{"chunk_id": "3edbf12e560e22b1", "source_file": "docs/samples/manejo_error_sql_critico.avap", "doc_type": "code", "block_type": "try", "section": "", "start_line": 1, "end_line": 7, "content": "try()\n ormDirect(\"UPDATE table_inexistente SET a=1\", res)\nexception(e)\n addVar(_status, 500)\n addVar(error_msg, \"Error de base de datos\")\n addResult(error_msg)\nend()", "metadata": {"uses_orm": true, "uses_auth": true, "uses_error_handling": true, "uses_exception": true, "returns_result": true, "complexity": 5}, "token_estimate": 51}
|
||||
{"chunk_id": "75bcc1f794c8527f", "source_file": "docs/samples/else_estandar.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 1, "content": "addParam(\"sal_par\",saldo)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 8}
|
||||
{"chunk_id": "99462a4539651e84", "source_file": "docs/samples/else_estandar.avap", "doc_type": "code", "block_type": "if", "section": "", "start_line": 2, "end_line": 6, "content": "addParam(\"sal_par\",saldo)\nif(saldo, 0, \">\")\n permitir = True\nelse()\n permitir = False\nend()", "metadata": {"uses_conditional": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 33}
|
||||
{"chunk_id": "c9134748119a6401", "source_file": "docs/samples/else_estandar.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 7, "end_line": 7, "content": "else()\n permitir = False\nend()\naddResult(permitir)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 16}
|
||||
{"chunk_id": "da88ce6ec35e309a", "source_file": "docs/samples/expresion_compleja.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 2, "content": "addParam(\"userrype\", user_type)\naddParam(\"sells\", compras)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 19}
|
||||
{"chunk_id": "ef826cb80ab05a8c", "source_file": "docs/samples/expresion_compleja.avap", "doc_type": "code", "block_type": "if", "section": "", "start_line": 3, "end_line": 5, "content": "addParam(\"userrype\", user_type)\naddParam(\"sells\", compras)\nif(None, None, \" user_type == 'VIP' or compras > 100\")\n addVar(descuento, 0.20)\nend()", "metadata": {"uses_conditional": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 51}
|
||||
{"chunk_id": "117c5396b3e2f3bd", "source_file": "docs/samples/expresion_compleja.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 6, "end_line": 6, "content": "if(None, None, \" user_type == 'VIP' or compras > 100\")\n addVar(descuento, 0.20)\nend()\naddResult(descuento)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 37}
|
||||
{"chunk_id": "559f8f61eda7ff75", "source_file": "docs/samples/fecha_para_base_de_datos.avap", "doc_type": "code", "block_type": "getDateTime", "section": "", "start_line": 1, "end_line": 1, "content": "getDateTime(\"%Y-%m-%d %H:%M:%S\", 0, \"Europe/Madrid\", sql_date)", "metadata": {"uses_datetime": true, "complexity": 1}, "token_estimate": 27}
|
||||
{"chunk_id": "b40f10f126c22c01", "source_file": "docs/samples/fecha_para_base_de_datos.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 2, "end_line": 2, "content": "getDateTime(\"%Y-%m-%d %H:%M:%S\", 0, \"Europe/Madrid\", sql_date)\naddResult(sql_date)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 32}
|
||||
{"chunk_id": "717f75fe4eb08ecf", "source_file": "docs/samples/bucle_longitud_de_datos.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 1, "end_line": 1, "content": "registros = ['1','2','3']", "metadata": {"complexity": 0}, "token_estimate": 10}
|
||||
{"chunk_id": "8a695ac320884362", "source_file": "docs/samples/bucle_longitud_de_datos.avap", "doc_type": "code", "block_type": "getListLen", "section": "", "start_line": 2, "end_line": 2, "content": "registros = ['1','2','3']\ngetListLen(registros, total)", "metadata": {"uses_list": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 17}
|
||||
{"chunk_id": "9530c2cad477b991", "source_file": "docs/samples/bucle_longitud_de_datos.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 3, "end_line": 3, "content": "getListLen(registros, total)\ncontador = 0", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 11}
|
||||
{"chunk_id": "c4acc74c9b001703", "source_file": "docs/samples/bucle_longitud_de_datos.avap", "doc_type": "code", "block_type": "startLoop", "section": "", "start_line": 4, "end_line": 6, "content": "contador = 0\nstartLoop(idx, 0, 2)\n actual = registros[int(idx)]\nendLoop()", "metadata": {"uses_loop": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 25}
|
||||
{"chunk_id": "80e935fcd6c7a232", "source_file": "docs/samples/bucle_longitud_de_datos.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 7, "end_line": 7, "content": "startLoop(idx, 0, 2)\n actual = registros[int(idx)]\nendLoop()\naddResult(actual)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 24}
|
||||
{"chunk_id": "576b1bc85805eef0", "source_file": "docs/samples/calculo_de_expiracion.avap", "doc_type": "code", "block_type": "getDateTime", "section": "", "start_line": 1, "end_line": 1, "content": "getDateTime(\"\", 86400, \"UTC\", expira)", "metadata": {"uses_datetime": true, "complexity": 1}, "token_estimate": 13}
|
||||
{"chunk_id": "686f254e071d6280", "source_file": "docs/samples/calculo_de_expiracion.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 2, "end_line": 2, "content": "getDateTime(\"\", 86400, \"UTC\", expira)\naddResult(expira)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 18}
|
||||
{"chunk_id": "79fd8fee120921e7", "source_file": "docs/samples/captura_de_id.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 1, "content": "addParam(\"client_id\", id_interno)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 10}
|
||||
{"chunk_id": "03697091447c57d4", "source_file": "docs/samples/captura_de_id.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 2, "end_line": 2, "content": "addParam(\"client_id\", id_interno)\naddResult(id_interno)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 16}
|
||||
{"chunk_id": "2c64510b9ac6042b", "source_file": "docs/samples/try_catch_request.avap", "doc_type": "code", "block_type": "try", "section": "", "start_line": 1, "end_line": 6, "content": "try()\n RequestGet(\"https://api.test.com/data\", 0, 0, respuesta, None)\nexception(e)\n addVar(error_trace, e)\n addResult(error_trace)\nend()", "metadata": {"uses_http": true, "uses_error_handling": true, "uses_exception": true, "returns_result": true, "complexity": 4}, "token_estimate": 42}
|
||||
{"chunk_id": "4d9f72fb03ba6d2b", "source_file": "docs/samples/validacion_de_nulo.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 1, "content": "addParam(\"api_key\", key)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 8}
|
||||
{"chunk_id": "19fa0a3950612c1e", "source_file": "docs/samples/validacion_de_nulo.avap", "doc_type": "code", "block_type": "if", "section": "", "start_line": 2, "end_line": 6, "content": "addParam(\"api_key\", key)\nif(key, None, \"==\")\n addVar(_status, 403)\n addVar(error, \"Acceso denegado: falta API KEY\")\n addResult(error)\nend()", "metadata": {"uses_auth": true, "uses_conditional": true, "returns_result": true, "complexity": 3, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 47}
|
||||
{"chunk_id": "e06fe329097212dd", "source_file": "docs/samples/validacion_in_pertenece_a_lista.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 1, "content": "addParam(\"rol\", r)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 7}
|
||||
{"chunk_id": "285aeb7e911a5075", "source_file": "docs/samples/validacion_in_pertenece_a_lista.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 2, "end_line": 2, "content": "addParam(\"rol\", r)\nacceso = False", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 11}
|
||||
{"chunk_id": "f8ed75075b7b1b13", "source_file": "docs/samples/validacion_in_pertenece_a_lista.avap", "doc_type": "code", "block_type": "if", "section": "", "start_line": 4, "end_line": 6, "content": "acceso = False\nif(None, None, \"r == 'admin' or r == 'editor' or r == 'root'\")\n acceso = True\nend()", "metadata": {"uses_conditional": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 35}
|
||||
{"chunk_id": "b323dedebcbd9036", "source_file": "docs/samples/validacion_in_pertenece_a_lista.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 8, "end_line": 8, "content": "if(None, None, \"r == 'admin' or r == 'editor' or r == 'root'\")\n acceso = True\nend()\naddResult(acceso)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 35}
|
||||
{"chunk_id": "d02cc7019c314251", "source_file": "docs/samples/construccion_dinamica_de_objeto.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 1, "end_line": 1, "content": "datos_cliente = \"datos\"", "metadata": {"complexity": 0}, "token_estimate": 6}
|
||||
{"chunk_id": "c1528242fcd85a68", "source_file": "docs/samples/construccion_dinamica_de_objeto.avap", "doc_type": "code", "block_type": "addVar", "section": "", "start_line": 2, "end_line": 2, "content": "datos_cliente = \"datos\"\naddVar(clave, \"cliente_vip\")", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 16}
|
||||
{"chunk_id": "d335da8caf95ac8d", "source_file": "docs/samples/construccion_dinamica_de_objeto.avap", "doc_type": "code", "block_type": "addVariableToJSON", "section": "", "start_line": 3, "end_line": 3, "content": "addVar(clave, \"cliente_vip\")\nAddvariableToJSON(clave, datos_cliente, mi_json_final)", "metadata": {"uses_json": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 24}
|
||||
{"chunk_id": "27067ebe43e3b05d", "source_file": "docs/samples/construccion_dinamica_de_objeto.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 4, "end_line": 4, "content": "AddvariableToJSON(clave, datos_cliente, mi_json_final)\naddResult(mi_json_final)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 20}
|
||||
{"chunk_id": "a25dfc3b319135d3", "source_file": "docs/samples/contador_de_parametros.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 1, "content": "addParam(\"data_list\", mi_lista)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 9}
|
||||
{"chunk_id": "d96fd663666733fe", "source_file": "docs/samples/contador_de_parametros.avap", "doc_type": "code", "block_type": "getListLen", "section": "", "start_line": 2, "end_line": 2, "content": "addParam(\"data_list\", mi_lista)\ngetListLen(mi_lista, cantidad)", "metadata": {"uses_list": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 16}
|
||||
{"chunk_id": "9905db6de1ea3067", "source_file": "docs/samples/contador_de_parametros.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 3, "end_line": 3, "content": "getListLen(mi_lista, cantidad)\naddResult(cantidad)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 12}
|
||||
{"chunk_id": "7c239ad53392d63d", "source_file": "docs/samples/conversion_timestamp_legible.avap", "doc_type": "code", "block_type": "stampToDatetime", "section": "", "start_line": 1, "end_line": 1, "content": "stampToDatetime(1708726162, \"%d/%m/%Y\", 0, fecha_human)", "metadata": {"uses_datetime": true, "complexity": 1}, "token_estimate": 22}
|
||||
{"chunk_id": "c4dc5d3c081101a5", "source_file": "docs/samples/conversion_timestamp_legible.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 2, "end_line": 2, "content": "stampToDatetime(1708726162, \"%d/%m/%Y\", 0, fecha_human)\naddResult(fecha_human)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 28}
|
||||
{"chunk_id": "2905488dffcbd7ba", "source_file": "docs/samples/referencia_por_valor.avap", "doc_type": "code", "block_type": "addVar", "section": "", "start_line": 1, "end_line": 2, "content": "addVar(base, 1000)\naddVar(copia, $base)", "metadata": {"complexity": 0}, "token_estimate": 16}
|
||||
{"chunk_id": "82e05ef62a72de87", "source_file": "docs/samples/referencia_por_valor.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 3, "end_line": 3, "content": "addVar(base, 1000)\naddVar(copia, $base)\naddResult(copia)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 21}
|
||||
{"chunk_id": "a6727546f328e768", "source_file": "docs/samples/respuesta_multiple.avap", "doc_type": "code", "block_type": "addVar", "section": "", "start_line": 1, "end_line": 2, "content": "addVar(code, 200)\naddVar(status, \"Success\")", "metadata": {"complexity": 0}, "token_estimate": 14}
|
||||
{"chunk_id": "ce12abd61c278bec", "source_file": "docs/samples/respuesta_multiple.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 3, "end_line": 4, "content": "addVar(code, 200)\naddVar(status, \"Success\")\naddResult(code)\naddResult(status)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 22}
|
||||
{"chunk_id": "45b0086b13784a7d", "source_file": "docs/samples/salida_bucle_correcta.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 1, "end_line": 1, "content": "encontrado = False", "metadata": {"complexity": 0}, "token_estimate": 5}
|
||||
{"chunk_id": "c6df33b0e7eac0ff", "source_file": "docs/samples/salida_bucle_correcta.avap", "doc_type": "code", "block_type": "startLoop", "section": "", "start_line": 2, "end_line": 7, "content": "encontrado = False\nstartLoop(i, 1, 10)\n if(i, 5, \"==\")\n encontrado = True\n i = 11 \n end()\nendLoop()", "metadata": {"uses_loop": true, "uses_conditional": true, "complexity": 2, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 42}
|
||||
{"chunk_id": "02edc488f13b7367", "source_file": "docs/samples/salida_bucle_correcta.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 8, "end_line": 8, "content": "i = 11 \n end()\nendLoop()\naddResult(encontrado)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 17}
|
||||
{"chunk_id": "c8dbbbf6cb64c10d", "source_file": "docs/samples/funcion_de_suma.avap", "doc_type": "code", "block_type": "function", "section": "", "start_line": 1, "end_line": 4, "content": "function suma(a, b){\n total = a + b\n return(total)\n }", "metadata": {"complexity": 0}, "token_estimate": 19}
|
||||
{"chunk_id": "1065800a57207e04", "source_file": "docs/samples/funcion_de_suma.avap", "doc_type": "function_signature", "block_type": "function_signature", "section": "", "start_line": 1, "end_line": 1, "content": "function suma(a, b)", "metadata": {"complexity": 0, "full_block_start": 1, "full_block_end": 4}, "token_estimate": 6}
|
||||
{"chunk_id": "1ef5fa8a4a980012", "source_file": "docs/samples/funcion_de_suma.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 5, "end_line": 5, "content": "// contexto: function suma(a, b)\nresultado = suma(10, 20)", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "function_sig"}, "token_estimate": 18}
|
||||
{"chunk_id": "ff7df988add5bbef", "source_file": "docs/samples/funcion_de_suma.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 6, "end_line": 6, "content": "// contexto: function suma(a, b)\naddResult(resultado)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "function_sig"}, "token_estimate": 13}
|
||||
{"chunk_id": "b8682e4f71d9d7c3", "source_file": "docs/samples/funcion_validacion_acceso.avap", "doc_type": "code", "block_type": "function", "section": "", "start_line": 1, "end_line": 7, "content": "function es_valido(token){\n response = False\n if(token, \"SECRET\", \"=\")\n response = True\n end()\n return(response)\n }", "metadata": {"uses_conditional": true, "complexity": 1}, "token_estimate": 34}
|
||||
{"chunk_id": "a1cfc36abdf661a0", "source_file": "docs/samples/funcion_validacion_acceso.avap", "doc_type": "function_signature", "block_type": "function_signature", "section": "", "start_line": 1, "end_line": 1, "content": "function es_valido(token)", "metadata": {"complexity": 0, "full_block_start": 1, "full_block_end": 7}, "token_estimate": 6}
|
||||
{"chunk_id": "66706bf4b7d3aede", "source_file": "docs/samples/funcion_validacion_acceso.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 8, "end_line": 8, "content": "// contexto: function es_valido(token)\nautorizado = es_valido(\"SECRET\")", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "function_sig"}, "token_estimate": 18}
|
||||
{"chunk_id": "5932e6b75c40b7db", "source_file": "docs/samples/funcion_validacion_acceso.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 9, "end_line": 9, "content": "// contexto: function es_valido(token)\naddResult(autorizado)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "function_sig"}, "token_estimate": 15}
|
||||
{"chunk_id": "4be60a16d7cc7c4d", "source_file": "docs/samples/generador_de_tokens_aleatorios.avap", "doc_type": "code", "block_type": "randomString", "section": "", "start_line": 1, "end_line": 1, "content": "randomString(\"[A-Z]\\d\", 32, token_seguridad)", "metadata": {"uses_string_ops": true, "complexity": 1}, "token_estimate": 15}
|
||||
{"chunk_id": "1810ca839b071a65", "source_file": "docs/samples/generador_de_tokens_aleatorios.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 2, "end_line": 2, "content": "randomString(\"[A-Z]\\d\", 32, token_seguridad)\naddResult(token_seguridad)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 21}
|
||||
{"chunk_id": "ed8b4a4e75a71762", "source_file": "docs/samples/obtencion_timestamp.avap", "doc_type": "code", "block_type": "getDateTime", "section": "", "start_line": 1, "end_line": 1, "content": "getDateTime(\"\", 0, \"UTC\", ahora)", "metadata": {"uses_datetime": true, "complexity": 1}, "token_estimate": 11}
|
||||
{"chunk_id": "05d2d0c8e6266861", "source_file": "docs/samples/obtencion_timestamp.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 2, "end_line": 2, "content": "getDateTime(\"\", 0, \"UTC\", ahora)\naddResult(ahora)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 17}
|
||||
{"chunk_id": "02d7b0e4a1e1f09c", "source_file": "docs/samples/ormAccessCreate.avap", "doc_type": "code", "block_type": "orm_command", "section": "", "start_line": 1, "end_line": 1, "content": "ormCheckTable(tabla_pruebas,resultado_comprobacion)", "metadata": {"uses_orm": true, "complexity": 1}, "token_estimate": 13}
|
||||
{"chunk_id": "6daea421c5a1d565", "source_file": "docs/samples/ormAccessCreate.avap", "doc_type": "code", "block_type": "if", "section": "", "start_line": 2, "end_line": 4, "content": "ormCheckTable(tabla_pruebas,resultado_comprobacion)\nif(resultado_comprobacion,False,'==')\n ormCreateTable(\"username,age\",'VARCHAR,INTEGER',tabla_pruebas,resultado_creacion)\nend()", "metadata": {"uses_orm": true, "uses_conditional": true, "complexity": 2, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 45}
|
||||
{"chunk_id": "47d660e6c1f124d1", "source_file": "docs/samples/ormAccessCreate.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 5, "end_line": 6, "content": "if(resultado_comprobacion,False,'==')\n ormCreateTable(\"username,age\",'VARCHAR,INTEGER',tabla_pruebas,resultado_creacion)\nend()\naddResult(resultado_comprobacion)\naddResult(resultado_creacion)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 45}
|
||||
{"chunk_id": "b15daff2028a2136", "source_file": "docs/samples/paginacion_dinamica_recursos.avap", "doc_type": "code", "block_type": "addParam", "section": "", "start_line": 1, "end_line": 2, "content": "addParam(\"page\", p)\naddParam(\"size\", s)", "metadata": {"uses_auth": true, "complexity": 1}, "token_estimate": 14}
|
||||
{"chunk_id": "8f1fa0e84c981765", "source_file": "docs/samples/paginacion_dinamica_recursos.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 3, "end_line": 6, "content": "addParam(\"page\", p)\naddParam(\"size\", s)\nregistros = [\"u1\", \"u2\", \"u3\", \"u4\", \"u5\", \"u6\"]\noffset = int(p) * int(s)\nlimite = offset + int(s)\ncontador = 0", "metadata": {"complexity": 0, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 62}
|
||||
{"chunk_id": "e27ce4178666239b", "source_file": "docs/samples/paginacion_dinamica_recursos.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 7, "end_line": 8, "content": "offset = int(p) * int(s)\nlimite = offset + int(s)\ncontador = 0\naddResult(offset)\naddResult(limite)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 32}
|
||||
{"chunk_id": "9a66c0e4c49bbbcb", "source_file": "docs/samples/paginacion_dinamica_recursos.avap", "doc_type": "code", "block_type": "startLoop", "section": "", "start_line": 9, "end_line": 13, "content": "addResult(offset)\naddResult(limite)\nstartLoop(i, 2, limite)\n actual = registros[int(i)]\n titulo = \"reg_%s\" % i\n AddvariableToJSON(titulo, actual, pagina_json)\nendLoop()", "metadata": {"uses_loop": true, "uses_json": true, "complexity": 2, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 53}
|
||||
{"chunk_id": "77c985068f6f9269", "source_file": "docs/samples/paginacion_dinamica_recursos.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 14, "end_line": 14, "content": "titulo = \"reg_%s\" % i\n AddvariableToJSON(titulo, actual, pagina_json)\nendLoop()\naddResult(pagina_json)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 32}
|
||||
{"chunk_id": "aeb4f87681bdc8b4", "source_file": "docs/samples/asignacion_booleana.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 1, "end_line": 2, "content": "nivel = 5\nes_admin = nivel >= 10", "metadata": {"complexity": 0}, "token_estimate": 12}
|
||||
{"chunk_id": "5f0f938196d5e573", "source_file": "docs/samples/asignacion_booleana.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 3, "end_line": 3, "content": "nivel = 5\nes_admin = nivel >= 10\naddResult(es_admin)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 18}
|
||||
{"chunk_id": "42fb50109876864c", "source_file": "docs/samples/asignacion_matematica.avap", "doc_type": "code", "block_type": "assignment", "section": "", "start_line": 1, "end_line": 3, "content": "subtotal = 150.50\niva = subtotal * 0.21\ntotal = subtotal + iva", "metadata": {"complexity": 0}, "token_estimate": 22}
|
||||
{"chunk_id": "6019c2adc7750c04", "source_file": "docs/samples/asignacion_matematica.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 4, "end_line": 4, "content": "subtotal = 150.50\niva = subtotal * 0.21\ntotal = subtotal + iva\naddResult(total)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 27}
|
||||
{"chunk_id": "e2f6a0de7e7f9dc1", "source_file": "docs/samples/bucle_1_10.avap", "doc_type": "code", "block_type": "startLoop", "section": "", "start_line": 1, "end_line": 4, "content": "startLoop(i,1,10)\n item = \"item_%s\" % i\n AddvariableToJSON(item,'valor_generado',mi_json)\nendLoop()", "metadata": {"uses_loop": true, "uses_json": true, "complexity": 2}, "token_estimate": 36}
|
||||
{"chunk_id": "ce1f2fab7c807537", "source_file": "docs/samples/bucle_1_10.avap", "doc_type": "code", "block_type": "addResult", "section": "", "start_line": 5, "end_line": 5, "content": "item = \"item_%s\" % i\n AddvariableToJSON(item,'valor_generado',mi_json)\nendLoop()\naddResult(mi_json)", "metadata": {"returns_result": true, "complexity": 1, "has_overlap": true, "overlap_type": "line_tail"}, "token_estimate": 32}
|
||||
|
|
@ -0,0 +1,64 @@
|
|||
import typer
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from scripts.pipelines.tasks.chunk import (
|
||||
fetch_documents,
|
||||
process_documents,
|
||||
export_documents,
|
||||
ingest_documents
|
||||
)
|
||||
|
||||
app = typer.Typer()
|
||||
|
||||
|
||||
@app.command()
|
||||
def elasticsearch_ingestion(
|
||||
docs_folder_path: str = "docs/samples",
|
||||
output_path: str = "research/code_indexing/chunks/chunks_EBNF_metadata.json",
|
||||
docs_extension: list[str] = [".avap"],
|
||||
es_index: str = "avap-code-indexing-ebnf-metadata",
|
||||
es_request_timeout: int = 120,
|
||||
es_max_retries: int = 5,
|
||||
es_retry_on_timeout: bool = True,
|
||||
delete_es_index: bool = True
|
||||
) -> None:
|
||||
"""
|
||||
Pipeline to ingest documents into an Elasticsearch index.
|
||||
The pipeline includes fetching documents from a specified folder, processing them into chunks, and then ingesting those chunks into the specified Elasticsearch index.
|
||||
|
||||
Args:
|
||||
docs_folder_path (str): Path to the folder containing documents to be ingested. Default is "docs/samples".
|
||||
docs_extension (list[str]): List of file extensions to filter by (e.g., [".md", ".avap"]). Default is [".md", ".avap"].
|
||||
es_index (str): Name of the Elasticsearch index to ingest documents into. Default is "avap-docs-test-v3".
|
||||
es_request_timeout (int): Timeout in seconds for Elasticsearch requests. Default is 120 seconds.
|
||||
es_max_retries (int): Maximum number of retries for Elasticsearch requests in case of failure. Default is 5 retries.
|
||||
es_retry_on_timeout (bool): Whether to retry Elasticsearch requests on timeout. Default is True.
|
||||
delete_es_index (bool): Whether to delete the existing Elasticsearch index before ingestion. Default is True.
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
logger.info("Starting Elasticsearch ingestion pipeline...")
|
||||
logger.info(f"Fetching files from {docs_folder_path}...")
|
||||
docs_path = fetch_documents(docs_folder_path, docs_extension)
|
||||
|
||||
logger.info("Processing docs...")
|
||||
chunked_docs = process_documents(docs_path)
|
||||
|
||||
logger.info(f"Ingesting chunks in Elasticsearch index: {es_index}...")
|
||||
elasticsearch_docs = ingest_documents(chunked_docs, es_index, es_request_timeout, es_max_retries,
|
||||
es_retry_on_timeout, delete_es_index)
|
||||
|
||||
logger.info(f"Exporting processed documents to {output_path}...")
|
||||
export_documents(elasticsearch_docs, output_path)
|
||||
|
||||
logger.info(f"Finished ingesting in {es_index}.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
app()
|
||||
except Exception as exc:
|
||||
logger.exception(exc)
|
||||
raise
|
||||
|
|
@ -0,0 +1,198 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "d520f6c3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"from datasets import load_dataset\n",
|
||||
"\n",
|
||||
"import boto3\n",
|
||||
"from botocore.config import Config\n",
|
||||
"from langchain_core.messages import SystemMessage, HumanMessage\n",
|
||||
"\n",
|
||||
"from src.utils.llm_factory import create_chat_model\n",
|
||||
"from src.config import settings"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e08b9060",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create LLM isntance"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "81111a86",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"config = Config(\n",
|
||||
" region_name=\"us-east-1\",\n",
|
||||
" connect_timeout=10, \n",
|
||||
" read_timeout=600, \n",
|
||||
")\n",
|
||||
"\n",
|
||||
"client = boto3.client(\"bedrock-runtime\", config=config)\n",
|
||||
"\n",
|
||||
"llm = create_chat_model(\n",
|
||||
" provider=\"bedrock\",\n",
|
||||
" client=client,\n",
|
||||
" model=\"global.anthropic.claude-sonnet-4-6\",\n",
|
||||
" temperature=0,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "045f8e81",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Load AVAP data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "07dea3fe",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open(settings.proj_root / \"docs/LRM/avap.md\", \"r\") as f:\n",
|
||||
" avap_docs = f.read()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "adbbe8b6",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Loaded 33 AVAP samples\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"samples_dir = settings.proj_root / \"docs/samples\"\n",
|
||||
"avap_samples = []\n",
|
||||
"\n",
|
||||
"for avap_file in sorted(samples_dir.glob(\"*.avap\")):\n",
|
||||
" with open(avap_file, \"r\") as f:\n",
|
||||
" code = f.read()\n",
|
||||
" \n",
|
||||
" avap_samples.append({\n",
|
||||
" \"file\": avap_file.name,\n",
|
||||
" \"code\": code\n",
|
||||
" })\n",
|
||||
"\n",
|
||||
"# Display as JSON\n",
|
||||
"avap_samples_json = json.dumps(avap_samples, indent=2, ensure_ascii=False)\n",
|
||||
"print(f\"Loaded {len(avap_samples)} AVAP samples\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7a15e09a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prompt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "895a170f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"GOLDEN_DATASET_PROMPT = SystemMessage(\n",
|
||||
" content=f\"\"\"\n",
|
||||
" You are an AI agent responsible for generating a golden dataset of queries for AVAP code retrieval and understanding.\n",
|
||||
"\n",
|
||||
" You will receive a JSON array of AVAP code samples, each with a 'file' name and 'code' content.\n",
|
||||
"\n",
|
||||
" Your task is to:\n",
|
||||
" 1. Analyze each AVAP code sample.\n",
|
||||
" 2. Generate 2-3 natural language queries that can be answered by examining that specific code.\n",
|
||||
" 3. Output a JSON array where each element has:\n",
|
||||
" - \"query\": A natural language question about AVAP code implementation, best practices, or specific constructs.\n",
|
||||
" - \"context\": The filename of the code sample that provides the context/answer for this query.\n",
|
||||
"\n",
|
||||
" Requirements:\n",
|
||||
" - Queries should be diverse: ask about functions, control flow, API operations, error handling, etc.\n",
|
||||
" - Queries must be answerable using ONLY the provided code samples.\n",
|
||||
" - Queries should be framed as natural developer questions (e.g., \"How do you handle errors in AVAP?\" or \"Show me an example of looping over a list\").\n",
|
||||
" - Use natural English (or Spanish if context is Spanish-language code).\n",
|
||||
" - Do not reference exact variable names unless necessary; focus on the patterns and constructs used.\n",
|
||||
" - Output MUST be valid JSON array format.\n",
|
||||
"\n",
|
||||
" AVAP Code Samples:\n",
|
||||
" {avap_samples_json}\n",
|
||||
"\n",
|
||||
" Output format (JSON array):\n",
|
||||
" [\n",
|
||||
" {{\"query\": \"...\", \"context\": \"filename.avap\"}},\n",
|
||||
" {{\"query\": \"...\", \"context\": \"filename.avap\"}},\n",
|
||||
" ...\n",
|
||||
" ]\n",
|
||||
" \"\"\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a3123199",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "98c4f93c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "723352ee",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "assistance-engine",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
|
|
@ -0,0 +1,462 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b15c29f3",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Loaded 30 tasks. 'code' fields cleared.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'task_id': 1,\n",
|
||||
" 'text': \"Captura el parámetro 'username' de la petición HTTP y devuélvelo como resultado. Si no existe, la variable será None.\",\n",
|
||||
" 'code': '',\n",
|
||||
" 'test_inputs': {'username': 'alice'},\n",
|
||||
" 'test_list': [\"re.match(r'^alice$', str(username))\"]}"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"import copy\n",
|
||||
"\n",
|
||||
"from src.config import settings\n",
|
||||
"\n",
|
||||
"INPUT_PATH = settings.proj_root / \"synthetic_datasets/synthetic_data_generated_bedrock.json\"\n",
|
||||
"OUTPUT_PATH = settings.proj_root / \"synthetic_datasets/multipl_e_synthetic_dataset.json\"\n",
|
||||
"\n",
|
||||
"with open(INPUT_PATH) as f:\n",
|
||||
" dataset = json.load(f)\n",
|
||||
"\n",
|
||||
"# Deep copy with code emptied\n",
|
||||
"tasks = copy.deepcopy(dataset)\n",
|
||||
"for task in tasks:\n",
|
||||
" task[\"code\"] = \"\"\n",
|
||||
"\n",
|
||||
"print(f\"Loaded {len(tasks)} tasks. 'code' fields cleared.\")\n",
|
||||
"tasks[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "d469eaa5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import subprocess\n",
|
||||
"import time\n",
|
||||
"import re\n",
|
||||
"\n",
|
||||
"GRPC_HOST = \"localhost:50052\"\n",
|
||||
"SERVICE = \"brunix.AssistanceEngine/AskAgent\"\n",
|
||||
"SESSION_ID = \"dev-test-123\"\n",
|
||||
"\n",
|
||||
"AVAP_BLOCK_RE = re.compile(r\"```avap\\s*\\n(.*?)```\", re.DOTALL)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def ask_agent(query: str) -> str:\n",
|
||||
" \"\"\"Call gRPC AskAgent and extract code from ```avap``` blocks in the response.\"\"\"\n",
|
||||
" payload = json.dumps({\"query\": query, \"session_id\": SESSION_ID})\n",
|
||||
" cmd = [\n",
|
||||
" \"grpcurl\", \"-plaintext\",\n",
|
||||
" \"-d\", payload,\n",
|
||||
" GRPC_HOST,\n",
|
||||
" SERVICE,\n",
|
||||
" ]\n",
|
||||
" result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)\n",
|
||||
" if result.returncode != 0:\n",
|
||||
" raise RuntimeError(f\"grpcurl failed: {result.stderr}\")\n",
|
||||
"\n",
|
||||
" # Collect all text fragments from the streamed responses\n",
|
||||
" raw = result.stdout.strip()\n",
|
||||
" full_text = \"\"\n",
|
||||
" for block in raw.split(\"\\n}\\n\"):\n",
|
||||
" block = block.strip()\n",
|
||||
" if not block:\n",
|
||||
" continue\n",
|
||||
" if not block.endswith(\"}\"):\n",
|
||||
" block += \"}\"\n",
|
||||
" try:\n",
|
||||
" msg = json.loads(block)\n",
|
||||
" full_text += msg.get(\"text\", \"\")\n",
|
||||
" except json.JSONDecodeError:\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" # Extract code from ```avap ... ``` blocks\n",
|
||||
" matches = AVAP_BLOCK_RE.findall(full_text)\n",
|
||||
" return \"\\n\".join(m.strip() for m in matches) if matches else \"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "9d2dc8c1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[1/30] Task 1: Captura el parámetro 'username' de la petición HTTP y devuélvelo como resultado....\n",
|
||||
" -> Got 188 chars of code\n",
|
||||
"[2/30] Task 2: Recibe el parámetro 'email' y establece el código de estado HTTP en 200. Devuelv...\n",
|
||||
" -> Got 79 chars of code\n",
|
||||
"[3/30] Task 3: Recibe el parámetro 'password', genera su hash SHA-256 y devuelve el hash como r...\n",
|
||||
" -> Got 73 chars of code\n",
|
||||
"[4/30] Task 4: Recibe el parámetro 'text', reemplaza todos los espacios por guiones bajos y dev...\n",
|
||||
" -> Got 63 chars of code\n",
|
||||
"[5/30] Task 5: Genera un token aleatorio de 32 caracteres alfanuméricos y devuélvelo como resul...\n",
|
||||
" -> Got 90 chars of code\n",
|
||||
"[6/30] Task 6: Recibe el parámetro 'age'. Si age es mayor que 18, devuelve 'adulto'; de lo cont...\n",
|
||||
" -> Got 131 chars of code\n",
|
||||
"[7/30] Task 7: Recibe el parámetro 'score'. Si score es igual a 100, establece _status en 200 y...\n",
|
||||
" -> Got 134 chars of code\n",
|
||||
"[8/30] Task 8: Crea una lista con el elemento 'item1', obtén su longitud y devuelve la longitud...\n",
|
||||
" -> Got 78 chars of code\n",
|
||||
"[9/30] Task 9: Recibe el parámetro 'items' como lista de query params, obtén su longitud y devu...\n",
|
||||
" -> Got 85 chars of code\n",
|
||||
"[10/30] Task 10: Recibe el parámetro 'data' como JSON, extrae el campo 'name' y devuélvelo como r...\n",
|
||||
" -> Got 66 chars of code\n",
|
||||
"[11/30] Task 11: Crea un objeto JSON vacío, agrega el campo 'status' con valor 'ok' y devuelve el...\n",
|
||||
" -> Got 61 chars of code\n",
|
||||
"[12/30] Task 12: Recibe el parámetro 'password', genera su hash MD5 y devuelve el hash como resul...\n",
|
||||
" -> Got 44 chars of code\n",
|
||||
"[13/30] Task 13: Obtén la fecha y hora actual en formato 'YYYY-MM-DD' en la zona horaria 'UTC' y ...\n",
|
||||
" -> Got 85 chars of code\n",
|
||||
"[14/30] Task 14: Recibe el parámetro 'epoch', conviértelo a string de fecha en formato 'YYYY-MM-D...\n",
|
||||
" -> Got 94 chars of code\n",
|
||||
"[15/30] Task 15: Recibe el parámetro 'date_str' en formato 'YYYY-MM-DD', conviértelo a epoch y de...\n",
|
||||
" -> Got 102 chars of code\n",
|
||||
"[16/30] Task 16: Define una función que recibe un número y devuelve su cuadrado. Llama a la funci...\n",
|
||||
" -> Got 89 chars of code\n",
|
||||
"[17/30] Task 17: Define una función que recibe dos números y devuelve su suma. Llama a la función...\n",
|
||||
" -> Got 89 chars of code\n",
|
||||
"[18/30] Task 18: Usa un bloque try/exception para intentar dividir el parámetro 'num' entre 0. Si...\n",
|
||||
" -> Got 116 chars of code\n",
|
||||
"[19/30] Task 19: Recibe el parámetro 'url', realiza una petición GET a esa URL con timeout de 500...\n",
|
||||
" -> Got 86 chars of code\n",
|
||||
"[20/30] Task 20: Recibe los parámetros 'url' y 'body', realiza una petición POST con timeout de 3...\n",
|
||||
" -> Got 115 chars of code\n",
|
||||
"[21/30] Task 21: Instancia un conector externo con UUID '20908e93260147acb2636967021fbf5d', llama...\n",
|
||||
" -> Got 131 chars of code\n",
|
||||
"[22/30] Task 22: Lanza una función 'fetchData' de forma asíncrona con go, espera el resultado con...\n",
|
||||
" -> Got 81 chars of code\n",
|
||||
"[23/30] Task 23: Recibe el parámetro 'n', itera desde 0 hasta n acumulando la suma y devuelve la ...\n",
|
||||
" -> Got 126 chars of code\n",
|
||||
"[24/30] Task 24: Recibe el parámetro 'value'. Usando if Modo 2, si value es mayor que 0 y menor q...\n",
|
||||
" -> Got 180 chars of code\n",
|
||||
"[25/30] Task 25: Realiza una consulta ORM a la tabla 'users' seleccionando todos los campos sin f...\n",
|
||||
" -> Got 69 chars of code\n",
|
||||
"[26/30] Task 26: Recibe los parámetros 'username' y 'email', inserta un registro en la tabla 'use...\n",
|
||||
" -> Got 74 chars of code\n",
|
||||
"[27/30] Task 27: Recibe el parámetro 'user_id', actualiza el campo 'active' a 1 en la tabla 'user...\n",
|
||||
" -> Got 82 chars of code\n",
|
||||
"[28/30] Task 28: Importa la librería nativa 'math', calcula el cuadrado de 9 usando una función y...\n",
|
||||
" -> Got 48 chars of code\n",
|
||||
"[29/30] Task 29: Recibe el parámetro 'items_json' como JSON con una lista bajo la clave 'items'. ...\n",
|
||||
" -> Got 81 chars of code\n",
|
||||
"[30/30] Task 30: Recibe el parámetro 'token'. Si el token tiene exactamente 32 caracteres (usando...\n",
|
||||
" -> Got 145 chars of code\n",
|
||||
"\n",
|
||||
"Done. 30 succeeded, 0 errors.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Process all tasks – call the agent for each one\n",
|
||||
"errors = []\n",
|
||||
"\n",
|
||||
"for i, task in enumerate(tasks):\n",
|
||||
" query = task[\"text\"]\n",
|
||||
" task_id = task[\"task_id\"]\n",
|
||||
" print(f\"[{i + 1}/{len(tasks)}] Task {task_id}: {query[:80]}...\")\n",
|
||||
"\n",
|
||||
" try:\n",
|
||||
" code = ask_agent(query)\n",
|
||||
" task[\"code\"] = code\n",
|
||||
" print(f\" -> Got {len(code)} chars of code\")\n",
|
||||
" except Exception as e:\n",
|
||||
" errors.append({\"task_id\": task_id, \"error\": str(e)})\n",
|
||||
" print(f\" -> ERROR: {e}\")\n",
|
||||
"\n",
|
||||
" time.sleep(0.5) # small delay between requests\n",
|
||||
"\n",
|
||||
"print(f\"\\nDone. {len(tasks) - len(errors)} succeeded, {len(errors)} errors.\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "3ce3ef4a",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Task 1:\n",
|
||||
" text: Captura el parámetro 'username' de la petición HTTP y devuélvelo como resultado.\n",
|
||||
" code: addParam(\"username\", targetUsername) # or: targetUsername = \"username\"\n",
|
||||
"targetUsername = addVar(targetUsername, \"None\")\n",
|
||||
"\n",
|
||||
"Task 2:\n",
|
||||
" text: Recibe el parámetro 'email' y establece el código de estado HTTP en 200. Devuelv\n",
|
||||
" code: addVar(_status, 200) # OK\n",
|
||||
"addParam(\"email\", targetEmail)\n",
|
||||
"addResult(targetEmail)\n",
|
||||
"\n",
|
||||
"Task 3:\n",
|
||||
" text: Recibe el parámetro 'password', genera su hash SHA-256 y devuelve el hash como r\n",
|
||||
" code: hash = generateSHA256(password)\n",
|
||||
"addVar(_status, 200) # OK\n",
|
||||
"addResult(hash)\n",
|
||||
"\n",
|
||||
"Task 4:\n",
|
||||
" text: Recibe el parámetro 'text', reemplaza todos los espacios por guiones bajos y dev\n",
|
||||
" code: replaceSpacesWithDashes(text)\n",
|
||||
"addResult(text.replace(\" \", \"-\"))\n",
|
||||
"\n",
|
||||
"Task 5:\n",
|
||||
" text: Genera un token aleatorio de 32 caracteres alfanuméricos y devuélvelo como resul\n",
|
||||
" code: randomString(\"[0-9a-zA-Z]\", 32, \"dash\")\n",
|
||||
"addResult(randomString(\"[0-9a-zA-Z]\", 32, \"dash\"))\n",
|
||||
"\n",
|
||||
"Task 6:\n",
|
||||
" text: Recibe el parámetro 'age'. Si age es mayor que 18, devuelve 'adulto'; de lo cont\n",
|
||||
" code: if(age > 18):\n",
|
||||
" addVar(_status, 200) # OK\n",
|
||||
"else:\n",
|
||||
" addVar(_status, 403) # Forbidden\n",
|
||||
"addResult(\"adulto\" if age > 18 el\n",
|
||||
"\n",
|
||||
"Task 7:\n",
|
||||
" text: Recibe el parámetro 'score'. Si score es igual a 100, establece _status en 200 y\n",
|
||||
" code: if(score == 100):\n",
|
||||
" addVar(_status, 200) # OK\n",
|
||||
"else:\n",
|
||||
" addVar(_status, 400)\n",
|
||||
"addResult(\"perfecto\" if score == 100 else\n",
|
||||
"\n",
|
||||
"Task 8:\n",
|
||||
" text: Crea una lista con el elemento 'item1', obtén su longitud y devuelve la longitud\n",
|
||||
" code: variableToList(\"item1\", targetList)\n",
|
||||
"getListLen(targetList, len)\n",
|
||||
"addResult(len)\n",
|
||||
"\n",
|
||||
"Task 9:\n",
|
||||
" text: Recibe el parámetro 'items' como lista de query params, obtén su longitud y devu\n",
|
||||
" code: getQueryParamList(\"paramName\", targetList)\n",
|
||||
"getListLen(targetList, len)\n",
|
||||
"addResult(len)\n",
|
||||
"\n",
|
||||
"Task 10:\n",
|
||||
" text: Recibe el parámetro 'data' como JSON, extrae el campo 'name' y devuélvelo como r\n",
|
||||
" code: variableFromJSON(\"data\", \"name\", targetName)\n",
|
||||
"addResult(targetName)\n",
|
||||
"\n",
|
||||
"Task 11:\n",
|
||||
" text: Crea un objeto JSON vacío, agrega el campo 'status' con valor 'ok' y devuelve el\n",
|
||||
" code: emptyObject()\n",
|
||||
"addVar(\"status\", \"ok\")\n",
|
||||
"addResult(emptyObject())\n",
|
||||
"\n",
|
||||
"Task 12:\n",
|
||||
" text: Recibe el parámetro 'password', genera su hash MD5 y devuelve el hash como resul\n",
|
||||
" code: hash = generateMD5(password)\n",
|
||||
"addResult(hash)\n",
|
||||
"\n",
|
||||
"Task 13:\n",
|
||||
" text: Obtén la fecha y hora actual en formato 'YYYY-MM-DD' en la zona horaria 'UTC' y \n",
|
||||
" code: getDateTime(\"UTC\", \"local\", 0, targetDate)\n",
|
||||
"addResult(targetDate.strftime(\"%Y-%m-%d\"))\n",
|
||||
"\n",
|
||||
"Task 14:\n",
|
||||
" text: Recibe el parámetro 'epoch', conviértelo a string de fecha en formato 'YYYY-MM-D\n",
|
||||
" code: getDateTime(\"UTC\", \"local\", 0, targetDate)\n",
|
||||
"addResult(targetDate.strftime(\"%Y-%m-%d %H:%M:%S\"))\n",
|
||||
"\n",
|
||||
"Task 15:\n",
|
||||
" text: Recibe el parámetro 'date_str' en formato 'YYYY-MM-DD', conviértelo a epoch y de\n",
|
||||
" code: getDateTime(\"UTC\", \"local\", 0, targetDate)\n",
|
||||
"addResult(targetDate.strftime(\"%Y-%m-%d\").replace(\"-\", \"\"))\n",
|
||||
"\n",
|
||||
"Task 16:\n",
|
||||
" text: Define una función que recibe un número y devuelve su cuadrado. Llama a la funci\n",
|
||||
" code: def square(n):\n",
|
||||
" result = n * n\n",
|
||||
" return result\n",
|
||||
"\n",
|
||||
"result = square(5)\n",
|
||||
"addResult(result)\n",
|
||||
"\n",
|
||||
"Task 17:\n",
|
||||
" text: Define una función que recibe dos números y devuelve su suma. Llama a la función\n",
|
||||
" code: def add(a, b):\n",
|
||||
" result = a + b\n",
|
||||
" return result\n",
|
||||
"\n",
|
||||
"result = add(5, 3)\n",
|
||||
"addResult(result)\n",
|
||||
"\n",
|
||||
"Task 18:\n",
|
||||
" text: Usa un bloque try/exception para intentar dividir el parámetro 'num' entre 0. Si\n",
|
||||
" code: try:\n",
|
||||
" result = num / 0\n",
|
||||
"except ZeroDivisionError:\n",
|
||||
" addVar(_status, 403) # Forbidden\n",
|
||||
"addResult(\"error_division\")\n",
|
||||
"\n",
|
||||
"Task 19:\n",
|
||||
" text: Recibe el parámetro 'url', realiza una petición GET a esa URL con timeout de 500\n",
|
||||
" code: addVar(_status, 200) # OK\n",
|
||||
"addParam(\"url\", targetUrl)\n",
|
||||
"addResult(getResponse(targetUrl))\n",
|
||||
"\n",
|
||||
"Task 20:\n",
|
||||
" text: Recibe los parámetros 'url' y 'body', realiza una petición POST con timeout de 3\n",
|
||||
" code: addVar(_status, 200) # OK\n",
|
||||
"addParam(\"url\", targetUrl)\n",
|
||||
"addParam(\"body\", targetBody)\n",
|
||||
"addResult(getResponse(targetUrl))\n",
|
||||
"\n",
|
||||
"Task 21:\n",
|
||||
" text: Instancia un conector externo con UUID '20908e93260147acb2636967021fbf5d', llama\n",
|
||||
" code: belvo_connector = avapConnector(\"20908e93260147acb2636967021fbf5d\")\n",
|
||||
"addVar(_status, 200) # OK\n",
|
||||
"addResult(getStatus(belvo_\n",
|
||||
"\n",
|
||||
"Task 22:\n",
|
||||
" text: Lanza una función 'fetchData' de forma asíncrona con go, espera el resultado con\n",
|
||||
" code: go fetchData()\n",
|
||||
"resultado = gather(\"fetchData\", timeout=2000)\n",
|
||||
"addResult(resultado)\n",
|
||||
"\n",
|
||||
"Task 23:\n",
|
||||
" text: Recibe el parámetro 'n', itera desde 0 hasta n acumulando la suma y devuelve la \n",
|
||||
" code: def sum(n):\n",
|
||||
" result = 0\n",
|
||||
" for i in range(n + 1):\n",
|
||||
" result += i\n",
|
||||
" return result\n",
|
||||
"\n",
|
||||
"result = sum(5)\n",
|
||||
"addResult(r\n",
|
||||
"\n",
|
||||
"Task 24:\n",
|
||||
" text: Recibe el parámetro 'value'. Usando if Modo 2, si value es mayor que 0 y menor q\n",
|
||||
" code: if(value > 0 and value < 100):\n",
|
||||
" addVar(_status, 200) # OK\n",
|
||||
"else:\n",
|
||||
" addVar(_status, 403) # Forbidden\n",
|
||||
"addResult(\"rango\n",
|
||||
"\n",
|
||||
"Task 25:\n",
|
||||
" text: Realiza una consulta ORM a la tabla 'users' seleccionando todos los campos sin f\n",
|
||||
" code: ormAccessSelect(\"*\", \"users\", \"\", targetUsers)\n",
|
||||
"addResult(targetUsers)\n",
|
||||
"\n",
|
||||
"Task 26:\n",
|
||||
" text: Recibe los parámetros 'username' y 'email', inserta un registro en la tabla 'use\n",
|
||||
" code: ormAccessInsert(\"username\", \"email\", \"\", targetUser)\n",
|
||||
"addResult(targetUser)\n",
|
||||
"\n",
|
||||
"Task 27:\n",
|
||||
" text: Recibe el parámetro 'user_id', actualiza el campo 'active' a 1 en la tabla 'user\n",
|
||||
" code: ormAccessUpdate(\"active = 1\", \"users\", \"id = ?\", targetUser)\n",
|
||||
"addResult(targetUser)\n",
|
||||
"\n",
|
||||
"Task 28:\n",
|
||||
" text: Importa la librería nativa 'math', calcula el cuadrado de 9 usando una función y\n",
|
||||
" code: import math\n",
|
||||
"result = square(9)\n",
|
||||
"addResult(result)\n",
|
||||
"\n",
|
||||
"Task 29:\n",
|
||||
" text: Recibe el parámetro 'items_json' como JSON con una lista bajo la clave 'items'. \n",
|
||||
" code: getQueryParamList(\"items\", targetList)\n",
|
||||
"getListLen(targetList, len)\n",
|
||||
"addResult(len)\n",
|
||||
"\n",
|
||||
"Task 30:\n",
|
||||
" text: Recibe el parámetro 'token'. Si el token tiene exactamente 32 caracteres (usando\n",
|
||||
" code: if(len(token) == 32):\n",
|
||||
" addVar(_status, 200)\n",
|
||||
"else:\n",
|
||||
" addVar(_status, 401)\n",
|
||||
"addResult(\"token_valido\" if len(token) == \n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Preview a few results\n",
|
||||
"for task in tasks:\n",
|
||||
" print(f\"Task {task['task_id']}:\")\n",
|
||||
" print(f\" text: {task['text'][:80]}\")\n",
|
||||
" print(f\" code: {task['code'][:120]}\")\n",
|
||||
" print()\n",
|
||||
"\n",
|
||||
"if errors:\n",
|
||||
" print(\"Errors:\")\n",
|
||||
" for e in errors:\n",
|
||||
" print(f\" Task {e['task_id']}: {e['error']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "d19a6325",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Saved to /home/acano/PycharmProjects/assistance-engine/synthetic_datasets/multipl_e_synthetic_dataset.json\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Save the completed dataset\n",
|
||||
"with open(OUTPUT_PATH, \"w\", encoding=\"utf-8\") as f:\n",
|
||||
" json.dump(tasks, f, ensure_ascii=False, indent=2)\n",
|
||||
"\n",
|
||||
"print(f\"Saved to {OUTPUT_PATH.resolve()}\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "assistance-engine",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
|
|
@ -4,7 +4,7 @@ from dataclasses import replace
|
|||
from pathlib import Path
|
||||
from typing import Any, Union
|
||||
|
||||
from lark import Lark
|
||||
from lark import Lark, Tree
|
||||
from chonkie import (
|
||||
Chunk,
|
||||
ElasticHandshake,
|
||||
|
|
@ -21,6 +21,70 @@ from transformers import AutoTokenizer
|
|||
from scripts.pipelines.tasks.embeddings import OllamaEmbeddings
|
||||
from src.config import settings
|
||||
|
||||
COMMAND_METADATA_NAMES = {
|
||||
# system
|
||||
"register_cmd": "registerEndpoint",
|
||||
"addvar_cmd": "addVar",
|
||||
"addparam_cmd": "addParam",
|
||||
"getlistlen_cmd": "getListLen",
|
||||
"getparamlist_cmd": "getQueryParamList",
|
||||
"addresult_cmd": "addResult",
|
||||
|
||||
# async
|
||||
"go_stmt": "go",
|
||||
"gather_stmt": "gather",
|
||||
|
||||
# connector
|
||||
"connector_instantiation": "avapConnector",
|
||||
|
||||
# http
|
||||
"req_post_cmd": "RequestPost",
|
||||
"req_get_cmd": "RequestGet",
|
||||
|
||||
# db
|
||||
"orm_direct": "ormDirect",
|
||||
"orm_check": "ormCheckTable",
|
||||
"orm_create": "ormCreateTable",
|
||||
"orm_select": "ormAccessSelect",
|
||||
"orm_insert": "ormAccessInsert",
|
||||
"orm_update": "ormAccessUpdate",
|
||||
|
||||
# util
|
||||
"json_list_cmd": "json_list_ops",
|
||||
"crypto_cmd": "crypto_ops",
|
||||
"regex_cmd": "getRegex",
|
||||
"datetime_cmd": "getDateTime",
|
||||
"stamp_cmd": "timestamp_ops",
|
||||
"string_cmd": "randomString",
|
||||
"replace_cmd": "replace",
|
||||
|
||||
# modularity
|
||||
"include_stmt": "include",
|
||||
"import_stmt": "import",
|
||||
|
||||
# generic statements
|
||||
"assignment": "assignment",
|
||||
"call_stmt": "call",
|
||||
"return_stmt": "return",
|
||||
"if_stmt": "if",
|
||||
"loop_stmt": "startLoop",
|
||||
"try_stmt": "try",
|
||||
"function_decl": "function",
|
||||
}
|
||||
|
||||
|
||||
def _extract_command_metadata(ast: Tree | None) -> dict[str, bool]:
|
||||
if ast is None:
|
||||
return {}
|
||||
|
||||
used_commands: set[str] = set()
|
||||
|
||||
for subtree in ast.iter_subtrees():
|
||||
if subtree.data in COMMAND_METADATA_NAMES:
|
||||
used_commands.add(COMMAND_METADATA_NAMES[subtree.data])
|
||||
|
||||
return {command_name: True for command_name in sorted(used_commands)}
|
||||
|
||||
|
||||
def _get_text(element) -> str:
|
||||
for attr in ("text", "content", "markdown"):
|
||||
|
|
@ -168,25 +232,30 @@ def fetch_documents(docs_folder_path: str, docs_extension: list[str]) -> list[Pa
|
|||
return docs_path
|
||||
|
||||
|
||||
def process_documents(docs_path: list[Path]) -> list[dict[str, Chunk | dict[str, Any]]]:
|
||||
def process_documents(docs_path: list[Path]) -> list[dict[str, Any]]:
|
||||
"""
|
||||
Process documents by applying appropriate chefs and chunking strategies based on file type.
|
||||
|
||||
Args:
|
||||
docs_path (list[Path]): List of Paths to the documents to be processed
|
||||
docs_path: List of Paths to the documents to be processed.
|
||||
|
||||
Returns:
|
||||
List of dicts with "chunk" (Chunk object) and "metadata" (dict with file info)
|
||||
List of dicts with "chunk" (Chunk object) and "extra_metadata" (dict with file info).
|
||||
"""
|
||||
processed_docs = []
|
||||
specific_metadata = {}
|
||||
custom_tokenizer = AutoTokenizer.from_pretrained(settings.hf_emb_model_name)
|
||||
|
||||
chef_md = MarkdownChef(tokenizer=custom_tokenizer)
|
||||
chef_txt = TextChef()
|
||||
chunker = TokenChunker(tokenizer=custom_tokenizer)
|
||||
with open(settings.proj_root / "docs/BNF/avap.lark") as grammar:
|
||||
lark_parser = Lark(grammar=grammar, parser="lalr", propagate_positions=True, start="program")
|
||||
|
||||
with open(settings.proj_root / "research/code_indexing/BNF/avap.lark", encoding="utf-8") as grammar:
|
||||
lark_parser = Lark(
|
||||
grammar.read(),
|
||||
parser="lalr",
|
||||
propagate_positions=True,
|
||||
start="program",
|
||||
)
|
||||
|
||||
for doc_path in docs_path:
|
||||
doc_extension = doc_path.suffix.lower()
|
||||
|
|
@ -197,31 +266,36 @@ def process_documents(docs_path: list[Path]) -> list[dict[str, Chunk | dict[str,
|
|||
chunked_doc = fused_doc.chunks
|
||||
specific_metadata = {
|
||||
"file_type": "avap_docs",
|
||||
"filename": doc_path.name
|
||||
"filename": doc_path.name,
|
||||
}
|
||||
|
||||
elif doc_extension == ".avap":
|
||||
processed_doc = chef_txt.process(doc_path)
|
||||
chunked_doc = chunker.chunk(processed_doc.content)
|
||||
|
||||
try:
|
||||
ast = lark_parser.parse(processed_doc.content)
|
||||
except Exception as e:
|
||||
logger.error(f"Error parsing AVAP code in {doc_path.name}: {e}")
|
||||
ast = None
|
||||
|
||||
chunked_doc = chunker.chunk(processed_doc.content)
|
||||
|
||||
specific_metadata = {
|
||||
"file_type": "avap_code",
|
||||
"filename": doc_path.name,
|
||||
"AST": str(ast)
|
||||
**_extract_command_metadata(ast),
|
||||
}
|
||||
|
||||
else:
|
||||
continue
|
||||
|
||||
for chunk in chunked_doc:
|
||||
processed_docs.append({
|
||||
"chunk": chunk,
|
||||
"extra_metadata": {**specific_metadata}
|
||||
})
|
||||
processed_docs.append(
|
||||
{
|
||||
"chunk": chunk,
|
||||
"extra_metadata": {**specific_metadata},
|
||||
}
|
||||
)
|
||||
|
||||
return processed_docs
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue