diff --git a/README.md b/README.md index 1916ee3..40f3dfc 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,11 @@ -# python-package-template +

+ +

+

SynthToT +
Generate a synthetic dataset for your data through deliberate problem-solving +

-This is a template repository for Python package projects. - -## In this README :point_down: +##### Table of Content - [Features](#features) - [Usage](#usage) @@ -12,80 +15,130 @@ This is a template repository for Python package projects. - [FAQ](#faq) - [Contributing](#contributing) + +## Introduction + +SynthToT is an simple AI agent system crafted using the Langchain framework developed by [Mathematics and AI Institute.](https://www.matyz.org/en/) It is specifically designed to facilitate the automated generation of synthetic datasets, which are crucial for the training of large language models. SynthToT Agent utilize the renowned [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601) et al. Shunyu Yao, Dian Yu. Tree-of-Thoughts prompting strategy, *"which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices."* + +By implementing this strategy, the SynthToT Agent offers a CLI interface for generating JSON dataset outputs using Tree-of-Thoughts (ToT) reasoning applied to the seed input content. This approach provides a distinctive foundation for creating datasets that are ideal for training state-of-the-art language models, adhering to the following JSON schema: + +**Output JSON Schema** + +```json +[ + { + "input": "", + "perfect_consideration": "", + "number": 10, + "perspective": "", + "example": "", + "solutions": "", + "thought_process": "", + "sorted_solutions": "", + "data_out": "" + }, + // n items of seed input list +] +``` + +**Input JSON Schema** +```json +[ + { + "input": "", + "perfect_consideration": "", + "number": 10, + "perspective": "", + "example": "", + + // Role-based dataset Q&A pairs output format + // + // "example": "{\"messages\": [{\"role\": \"system\", \"content\": \"EcoBot is an eco-friendly AI assistant passionate about sustainability and environmental awareness.\"}, {\"role\": \"user\", \"content\": \"What are some benefits of using solar panels?\"}, {\"role\": \"assistant\", \"content\": \"Solar panels can significantly reduce electricity bills, decrease your carbon footprint, and even increase the value of your property. Plus, they're a renewable energy source, helping to combat climate change.\"}]}" + }, + // n items of seed input list +] +``` + +You can view example input_data and output_data from under the `examples` folder. + ## Features -This template repository comes with all of the boilerplate needed for: +#### Chaining -⚙️ Robust (and free) CI with [GitHub Actions](https://github.com/features/actions): - - Unit tests ran with [PyTest](https://docs.pytest.org) against multiple Python versions and operating systems. - - Type checking with [mypy](https://github.com/python/mypy). - - Linting with [ruff](https://astral.sh/ruff). - - Formatting with [isort](https://pycqa.github.io/isort/) and [black](https://black.readthedocs.io/en/stable/). +- **Initialization:** Customizable parameters such as maximum tokens per response, model type, and sampling temperature. -🤖 [Dependabot](https://github.blog/2020-06-01-keep-all-your-packages-up-to-date-with-dependabot/) configuration to keep your dependencies up-to-date. + - max_tokens: Limits the number of tokens generated per response. + - model: Specifies the language model to use (default is "gpt-3.5-turbo"). + - temperature: Controls the randomness of the output (default is 0). -📄 Great looking API documentation built using [Sphinx](https://www.sphinx-doc.org/en/master/) (run `make docs` to preview). +- **Template Management:** Utilizes a set of predefined templates (template_step1 to template_step5) and corresponding output keys for structured data generation. -🚀 Automatic GitHub and PyPI releases. Just follow the steps in [`RELEASE_PROCESS.md`](./RELEASE_PROCESS.md) to trigger a new release. +- **LLMChain:** Initializes an LLMChain with a specified prompt template and output key, using the selected language model and parameters. + +- **Chain Assembly:** Generates a list of LLM chained instances based on the predefined templates and output keys. ## Usage ### Initial setup -1. [Create a new repository](https://github.com/allenai/python-package-template/generate) from this template with the desired name of your project. - - *Your project name (i.e. the name of the repository) and the name of the corresponding Python package don't necessarily need to match, but you might want to check on [PyPI](https://pypi.org/) first to see if the package name you want is already taken.* +**I. Create a new Conda or virtual environment with the Python version 3.10:** -2. Create a Python 3.8 or newer virtual environment. +```bash +conda create -n synthtotenv python=3.10.11 +``` - *If you're not sure how to create a suitable Python environment, the easiest way is using [Miniconda](https://docs.conda.io/en/latest/miniconda.html). On a Mac, for example, you can install Miniconda using [Homebrew](https://brew.sh/):* +```bash +conda activate synthtotenv +``` - ``` - brew install miniconda - ``` +or create virtual environment with venv. - *Then you can create and activate a new Python environment by running:* +**II. Clone the repository, go into folder, and install requirements:** - ``` - conda create -n synthtot python=3.9 - conda activate synthtot - ``` +Clone from the remote: -3. Now that you have a suitable Python environment, you're ready to personalize this repository. Just run: +```bash +git clone https://github.com/dtunai/SynthToT/ +``` - ``` - pip install -r setup-requirements.txt - python scripts/personalize.py - ``` +Switch to package folder: - And then follow the prompts. +```bash +cd SynthToT +``` - :pencil: *NOTE: This script will overwrite the README in your repository.* +Install requirements: -4. Commit and push your changes, then make sure all GitHub Actions jobs pass. +```bash +pip install -r setup-requirements.txt +``` -5. (Optional) If you plan on publishing your package to PyPI, add repository secrets for `PYPI_USERNAME` and `PYPI_PASSWORD`. To add these, go to "Settings" > "Secrets" > "Actions", and then click "New repository secret". +Build the package: - *If you don't have PyPI account yet, you can [create one for free](https://pypi.org/account/register/).* +```bash +pip install -e . +``` -6. (Optional) If you want to deploy your API docs to [readthedocs.org](https://readthedocs.org), go to the [readthedocs dashboard](https://readthedocs.org/dashboard/import/?) and import your new project. +### Preparing Input Data List - Then click on the "Admin" button, navigate to "Automation Rules" in the sidebar, click "Add Rule", and then enter the following fields: +Now, you're tasked with creating your input data list. This list will serve as the foundation for generating synthetic output and potential solutions using the Tree-of-Thoughts approach by agents chains. Please take a look at `examples` folder for input data examples. - - **Description:** Publish new versions from tags - - **Match:** Custom Match - - **Custom match:** v[vV] - - **Version:** Tag - - **Action:** Activate version +### Using Tool - Then hit "Save". +After creating your input list, now you can seed the list to the SynthToT via a simple CLI interface: - *After your first release, the docs will automatically be published to [your-project-name.readthedocs.io](https://your-project-name.readthedocs.io/).* +```bash +python synthtot/synthtot.py \ + --input-file \ + --output-file \ + --model \ + --max-tokens \ + --temperature +``` ### Creating releases -Creating new GitHub and PyPI releases is easy. The GitHub Actions workflow that comes with this repository will handle all of that for you. -All you need to do is follow the instructions in [RELEASE_PROCESS.md](./RELEASE_PROCESS.md). +Creating new GitHub and PyPI releases is easy. The GitHub Actions workflow that comes with this repository will handle all of that for you. All you need to do is follow the instructions in [RELEASE_PROCESS.md](./RELEASE_PROCESS.md). ## Projects using this template @@ -103,9 +156,8 @@ Here is an incomplete list of some projects that started off with this template: #### Should I use this template even if I don't want to publish my package? -Absolutely! If you don't want to publish your package, just delete the `docs/` directory and the `release` job in [`.github/workflows/main.yml`](https://github.com/allenai/python-package-template/blob/main/.github/workflows/main.yml). +Absolutely! If you don't want to publish your package, just delete the `docs/` directory and the `release` job in [`.github/workflows/main.yml`](https://github.com/dtunai/SynthToT/blob/main/.github/workflows/main.yml). ## Contributing -If you find a bug :bug:, please open a [bug report](https://github.com/allenai/python-package-template/issues/new?assignees=&labels=bug&template=bug_report.md&title=). -If you have an idea for an improvement or new feature :rocket:, please open a [feature request](https://github.com/allenai/python-package-template/issues/new?assignees=&labels=Feature+request&template=feature_request.md&title=). +If you find a bug, please open a bug report. If you have an idea for an improvement or new feature :rocket:, please open a [feature request](https://github.com/dtunai/SynthToT/issues/new?assignees=&labels=Feature+request&template=feature_request.md&title=). diff --git a/assets/Synth-ToT-Logo.png b/assets/Synth-ToT-Logo.png new file mode 100644 index 0000000..ea21ed2 Binary files /dev/null and b/assets/Synth-ToT-Logo.png differ diff --git a/setup.py b/setup.py new file mode 100644 index 0000000..2f641f2 --- /dev/null +++ b/setup.py @@ -0,0 +1,26 @@ +from setuptools import setup, find_packages + +with open("README.md", "r") as fh: + long_description = fh.read() + +with open("setup-requirements.txt", "r") as req_file: + install_requires = req_file.read().splitlines() + +setup( + name="synthtot", + version="0.1.0", + author="Dogukan Uraz Tuna", + author_email="dogukanutuna@gmail.com", + description=" Generate a synthetic dataset for your data through deliberate problem-solving", + long_description=long_description, + long_description_content_type="text/markdown", + url="https://github.com/dtunai/synthtot", + packages=find_packages(), + install_requires=install_requires, + classifiers=[ + "Programming Language :: Python :: 3", + "License :: OSI Approved :: MIT License", + "Operating System :: OS Independent", + ], + python_requires=">=3.10", +) \ No newline at end of file diff --git a/synthtot/data/input_data.json b/synthtot/data/input_data.json index d10fe71..1cb5a43 100644 --- a/synthtot/data/input_data.json +++ b/synthtot/data/input_data.json @@ -4,6 +4,6 @@ "perfect_consideration": "", "number": 3, "perspective": "Mathematician", - "example": "{\"messages\": [{\"role\": \"system\", \"content\": \"GitMaxd is a helpful AI assistant with a penchant for direct and casual communication.\"}, {\"role\": \"user\", \"content\": \"How big are Sulcata tortoises?\"}, {\"role\": \"assistant\", \"content\": \"Sulcata tortoises can easily weigh over 120 lb and have a body that extends over 30 inches in length. That's bigger than some people!\"}]}" + "example": "{\"messages\": [{\"role\": \"system\", \"content\": \"EcoBot is an eco-friendly AI assistant passionate about sustainability and environmental awareness.\"}, {\"role\": \"user\", \"content\": \"What are some benefits of using solar panels?\"}, {\"role\": \"assistant\", \"content\": \"Solar panels can significantly reduce electricity bills, decrease your carbon footprint, and even increase the value of your property. Plus, they're a renewable energy source, helping to combat climate change.\"}]}" } ] \ No newline at end of file diff --git a/synthtot/synthtot.py b/synthtot/synthtot.py index ff5798f..203db12 100644 --- a/synthtot/synthtot.py +++ b/synthtot/synthtot.py @@ -1,8 +1,13 @@ import os import argparse - +import logging from generator.generator import entrypoint +# Set up logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +# Set up argument parser parser = argparse.ArgumentParser( description="Generate synthetic training data with your input data by deliberate problem solving." ) @@ -37,4 +42,5 @@ ) args = parser.parse_args() +# Call the entry point function with the parsed arguments entrypoint(args.input_file, args.output_file, args.model, args.max_tokens, args.temperature)