Commit Graph

5 Commits

Author SHA1 Message Date
acano 5f21544e0b Refactor Elasticsearch ingestion pipeline and add MBPP generation script
- Updated `elasticsearch_ingestion.py` to streamline document processing and ingestion into Elasticsearch.
- Introduced `generate_mbap.py` for generating benchmark problems in AVAP language from a provided LRM.
- Created `prompts.py` to define prompts for converting Python problems to AVAP.
- Enhanced chunk processing in `chunk.py` to support markdown and AVAP documents.
- Added `OllamaEmbeddings` class in `embeddings.py` for handling embeddings with Ollama model.
- Updated dependencies in `uv.lock` to include new packages and versions.
2026-03-11 17:17:44 +01:00
acano 2ad09cc77f feat: Update dependencies and enhance Elasticsearch ingestion pipeline
- Added new dependencies including chonkie and markdown-it-py to requirements.txt.
- Refactored the Elasticsearch ingestion script to read and concatenate documents from specified folders.
- Implemented semantic chunking for documents using the chonkie library.
- Removed the old elasticsearch_ingestion_from_docs.py script as its functionality has been integrated into the main ingestion pipeline.
- Updated README.md to reflect new project structure and environment variables.
- Added a new changelog entry for version 1.4.0 detailing recent changes and enhancements.
2026-03-11 09:50:51 +01:00
acano bf3c7f36d8 feat(chunk): enhance file reading and processing logic
- Updated `read_files` function to return a list of dictionaries containing 'content' and 'title' keys.
- Added logic to handle concatenation of file contents and improved handling of file prefixes.
- Introduced `get_chunk_docs` function to chunk document contents using `SemanticChunker`.
- Added `convert_chunks_to_document` function to convert chunked content into `Document` objects.
- Integrated logging for chunking process.
- Updated dependencies in `uv.lock` to include `chonkie` and other related packages.
2026-03-10 14:36:09 +01:00
acano d951868200 refactor: Simplify Elasticsearch ingestion by removing chunk management module and integrating document building directly 2026-03-05 16:23:27 +01:00
acano 1549069f5a feat: Add Elasticsearch ingestion pipeline and document chunking functionality
- Implemented `elasticsearch_ingestion` function to handle document ingestion into Elasticsearch.
- Created `build_chunks_from_folder` function to read and clean text files, generating document chunks.
- Added logging for better traceability during the ingestion process.
- Updated `uv.lock` to include `boto3` as a new dependency.
2026-03-04 18:21:01 +01:00