313 lines
8.4 KiB
Plaintext
313 lines
8.4 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b657efd2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"from datasets import load_dataset\n",
|
|
"\n",
|
|
"import boto3\n",
|
|
"from botocore.config import Config\n",
|
|
"from langchain_core.messages import SystemMessage, HumanMessage\n",
|
|
"\n",
|
|
"from src.utils.llm_factory import create_chat_model\n",
|
|
"from src.config import RAW_DIR, INTERIM_DIR"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e6e90339",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create llm instance"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "20eecc53",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"config = Config(\n",
|
|
" region_name=\"us-east-1\",\n",
|
|
" connect_timeout=10, \n",
|
|
" read_timeout=600, \n",
|
|
")\n",
|
|
"\n",
|
|
"client = boto3.client(\"bedrock-runtime\", config=config)\n",
|
|
"\n",
|
|
"llm = create_chat_model(\n",
|
|
" provider=\"bedrock\",\n",
|
|
" client=client,\n",
|
|
" model=\"global.anthropic.claude-sonnet-4-6\",\n",
|
|
" temperature=0,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "96f12a22",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Load mbpp data "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "78e29dc2",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"DatasetDict({\n",
|
|
" train: Dataset({\n",
|
|
" features: ['task_id', 'text', 'code', 'test_list', 'test_setup_code', 'challenge_test_list'],\n",
|
|
" num_rows: 374\n",
|
|
" })\n",
|
|
" test: Dataset({\n",
|
|
" features: ['task_id', 'text', 'code', 'test_list', 'test_setup_code', 'challenge_test_list'],\n",
|
|
" num_rows: 500\n",
|
|
" })\n",
|
|
" validation: Dataset({\n",
|
|
" features: ['task_id', 'text', 'code', 'test_list', 'test_setup_code', 'challenge_test_list'],\n",
|
|
" num_rows: 90\n",
|
|
" })\n",
|
|
" prompt: Dataset({\n",
|
|
" features: ['task_id', 'text', 'code', 'test_list', 'test_setup_code', 'challenge_test_list'],\n",
|
|
" num_rows: 10\n",
|
|
" })\n",
|
|
"})"
|
|
]
|
|
},
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"dataset_full = load_dataset(\"mbpp\")\n",
|
|
"dataset_full"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3e7544bb",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Load AVAP data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8e88b2d3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open(RAW_DIR / \"avap.txt\", \"r\") as f:\n",
|
|
" avap_docs = f.read()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c039d79f",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Random sample mbpp"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "b186af30",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Sample 100 random examples from the test split\n",
|
|
"random_test_samples = dataset_full[\"test\"].shuffle(seed=42).select(range(50))\n",
|
|
"\n",
|
|
"# Save text and code in a dictionary\n",
|
|
"test_samples_dict = {\n",
|
|
" str(i): {\n",
|
|
" \"text\": sample[\"text\"],\n",
|
|
" \"code\": sample[\"code\"],\n",
|
|
" }\n",
|
|
" for i, sample in enumerate(random_test_samples)\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "21a0422d",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Prompts"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2c075e29",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"PROMPT_MBPP = SystemMessage(\n",
|
|
" content=f\"\"\"\n",
|
|
" You are an AI agent responsible for generating a synthetic dataset for a programming language called AVAP (Advanced Virtual API Programming).\n",
|
|
"\n",
|
|
" You will receive a JSON object containing programming problems and their solutions written in Python.\n",
|
|
"\n",
|
|
" Your task is to:\n",
|
|
" 1. Convert each Python programming problem into an equivalent problem framed for the AVAP language.\n",
|
|
" 2. Replace the Python solution with a correct implementation written in AVAP.\n",
|
|
"\n",
|
|
" Requirements:\n",
|
|
" - The AVAP solution must follow AVAP syntax.\n",
|
|
" - Use AVAP libraries and constructs when appropriate.\n",
|
|
" - Functions in AVAP are declared using the `function` keyword.\n",
|
|
" - Preserve the original problem intent and difficulty.\n",
|
|
" - Do not produce Python code in the final solution.\n",
|
|
"\n",
|
|
" Documentation usage rules:\n",
|
|
" - You MUST rely exclusively on the AVAP documentation provided below.\n",
|
|
" - Do not invent syntax, functions, or libraries that are not described in the documentation.\n",
|
|
" - If the documentation does not contain enough information to solve the problem, return exactly:\n",
|
|
" \"I don't know how to answer this question based on the provided documentation.\"\n",
|
|
"\n",
|
|
" AVAP Documentation:\n",
|
|
" {avap_docs}\n",
|
|
" \"\"\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cbb459f8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"PROMPT_NO_MBPP = HumanMessage(\n",
|
|
" content=\"\"\"\n",
|
|
" I gave you all the context for a new programming language called AVAP (Advanced Virtual API Programming). \n",
|
|
" With that context, I'd like you to generate a \"golden dataset\" of 100 question and answer where the user query is asking code snippets and the response is the full code. \n",
|
|
" I want the questions and answers to cover the entire context provided, and I want there to be questions and answers at all possible levels, from questions with simple code snippets to more complex questions with more complex code. \n",
|
|
" \n",
|
|
" You must return a .json file with user_query and response for each question and answer\n",
|
|
" \"\"\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ea1e824e",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Generate dataset using mbpp dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8b7dcf2f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"llm_response = llm.invoke([PROMPT_MBPP, HumanMessage(content=str(test_samples_dict))])\n",
|
|
"print(llm_response.content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "c6f5872e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"json_str = llm_response.content.removeprefix(\"```json\").removesuffix(\"```\").strip()\n",
|
|
"synthetic_data = json.loads(json_str)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d26cbba7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open(INTERIM_DIR /'synthetic_datasets/synthetic_data.json', 'w') as f:\n",
|
|
" json.dump(synthetic_data, f)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "fc52b327",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Generate dataset without mbpp dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b16137cb",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"llm_response = llm.invoke([SystemMessage(content=avap_docs), PROMPT_NO_MBPP])\n",
|
|
"print(llm_response.content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "80d207fa",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"json_str = llm_response.content.removeprefix(\"```json\").removesuffix(\"```\").strip()\n",
|
|
"synthetic_data = json.loads(json_str)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "13e53200",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open(INTERIM_DIR /'synthetic_datasets/synthetic_data_no_mbpp.json', 'w') as f:\n",
|
|
" json.dump(synthetic_data, f)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "assistance-engine",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.13"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|