Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
dtunai committed May 25, 2024
1 parent 2f203f2 commit a8b1280
Show file tree
Hide file tree
Showing 5 changed files with 136 additions and 52 deletions.
152 changes: 102 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# python-package-template
<p align="center">
<img src="./assets/Synth-ToT-Logo.png" width="250"/>
</p>
<h2 align="center"><b>SynthToT</b>
<br>Generate a synthetic dataset for your data through deliberate problem-solving
</h2>

This is a template repository for Python package projects.

## In this README :point_down:
##### Table of Content

- [Features](#features)
- [Usage](#usage)
Expand All @@ -12,80 +15,130 @@ This is a template repository for Python package projects.
- [FAQ](#faq)
- [Contributing](#contributing)


## Introduction

SynthToT is an simple AI agent system crafted using the Langchain framework developed by [Mathematics and AI Institute.](https://www.matyz.org/en/) It is specifically designed to facilitate the automated generation of synthetic datasets, which are crucial for the training of large language models. SynthToT Agent utilize the renowned [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601) et al. Shunyu Yao, Dian Yu. Tree-of-Thoughts prompting strategy, *"which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices."*

By implementing this strategy, the SynthToT Agent offers a CLI interface for generating JSON dataset outputs using Tree-of-Thoughts (ToT) reasoning applied to the seed input content. This approach provides a distinctive foundation for creating datasets that are ideal for training state-of-the-art language models, adhering to the following JSON schema:

**Output JSON Schema**

```json
[
{
"input": "",
"perfect_consideration": "",
"number": 10,
"perspective": "",
"example": "",
"solutions": "",
"thought_process": "",
"sorted_solutions": "",
"data_out": ""
},
// n items of seed input list
]
```

**Input JSON Schema**
```json
[
{
"input": "",
"perfect_consideration": "",
"number": 10,
"perspective": "",
"example": "",

// Role-based dataset Q&A pairs output format
//
// "example": "{\"messages\": [{\"role\": \"system\", \"content\": \"EcoBot is an eco-friendly AI assistant passionate about sustainability and environmental awareness.\"}, {\"role\": \"user\", \"content\": \"What are some benefits of using solar panels?\"}, {\"role\": \"assistant\", \"content\": \"Solar panels can significantly reduce electricity bills, decrease your carbon footprint, and even increase the value of your property. Plus, they're a renewable energy source, helping to combat climate change.\"}]}"
},
// n items of seed input list
]
```

You can view example input_data and output_data from under the `examples` folder.

## Features

This template repository comes with all of the boilerplate needed for:
#### Chaining

⚙️ Robust (and free) CI with [GitHub Actions](https://github.com/features/actions):
- Unit tests ran with [PyTest](https://docs.pytest.org) against multiple Python versions and operating systems.
- Type checking with [mypy](https://github.com/python/mypy).
- Linting with [ruff](https://astral.sh/ruff).
- Formatting with [isort](https://pycqa.github.io/isort/) and [black](https://black.readthedocs.io/en/stable/).
- **Initialization:** Customizable parameters such as maximum tokens per response, model type, and sampling temperature.

🤖 [Dependabot](https://github.blog/2020-06-01-keep-all-your-packages-up-to-date-with-dependabot/) configuration to keep your dependencies up-to-date.
- max_tokens: Limits the number of tokens generated per response.
- model: Specifies the language model to use (default is "gpt-3.5-turbo").
- temperature: Controls the randomness of the output (default is 0).

📄 Great looking API documentation built using [Sphinx](https://www.sphinx-doc.org/en/master/) (run `make docs` to preview).
- **Template Management:** Utilizes a set of predefined templates (template_step1 to template_step5) and corresponding output keys for structured data generation.

🚀 Automatic GitHub and PyPI releases. Just follow the steps in [`RELEASE_PROCESS.md`](./RELEASE_PROCESS.md) to trigger a new release.
- **LLMChain:** Initializes an LLMChain with a specified prompt template and output key, using the selected language model and parameters.

- **Chain Assembly:** Generates a list of LLM chained instances based on the predefined templates and output keys.

## Usage

### Initial setup

1. [Create a new repository](https://github.com/allenai/python-package-template/generate) from this template with the desired name of your project.

*Your project name (i.e. the name of the repository) and the name of the corresponding Python package don't necessarily need to match, but you might want to check on [PyPI](https://pypi.org/) first to see if the package name you want is already taken.*
**I. Create a new Conda or virtual environment with the Python version 3.10:**

2. Create a Python 3.8 or newer virtual environment.
```bash
conda create -n synthtotenv python=3.10.11
```

*If you're not sure how to create a suitable Python environment, the easiest way is using [Miniconda](https://docs.conda.io/en/latest/miniconda.html). On a Mac, for example, you can install Miniconda using [Homebrew](https://brew.sh/):*
```bash
conda activate synthtotenv
```

```
brew install miniconda
```
or create virtual environment with venv.

*Then you can create and activate a new Python environment by running:*
**II. Clone the repository, go into folder, and install requirements:**

```
conda create -n synthtot python=3.9
conda activate synthtot
```
Clone from the remote:

3. Now that you have a suitable Python environment, you're ready to personalize this repository. Just run:
```bash
git clone https://github.com/dtunai/SynthToT/
```

```
pip install -r setup-requirements.txt
python scripts/personalize.py
```
Switch to package folder:

And then follow the prompts.
```bash
cd SynthToT
```

:pencil: *NOTE: This script will overwrite the README in your repository.*
Install requirements:

4. Commit and push your changes, then make sure all GitHub Actions jobs pass.
```bash
pip install -r setup-requirements.txt
```

5. (Optional) If you plan on publishing your package to PyPI, add repository secrets for `PYPI_USERNAME` and `PYPI_PASSWORD`. To add these, go to "Settings" > "Secrets" > "Actions", and then click "New repository secret".
Build the package:

*If you don't have PyPI account yet, you can [create one for free](https://pypi.org/account/register/).*
```bash
pip install -e .
```

6. (Optional) If you want to deploy your API docs to [readthedocs.org](https://readthedocs.org), go to the [readthedocs dashboard](https://readthedocs.org/dashboard/import/?) and import your new project.
### Preparing Input Data List

Then click on the "Admin" button, navigate to "Automation Rules" in the sidebar, click "Add Rule", and then enter the following fields:
Now, you're tasked with creating your input data list. This list will serve as the foundation for generating synthetic output and potential solutions using the Tree-of-Thoughts approach by agents chains. Please take a look at `examples` folder for input data examples.

- **Description:** Publish new versions from tags
- **Match:** Custom Match
- **Custom match:** v[vV]
- **Version:** Tag
- **Action:** Activate version
### Using Tool

Then hit "Save".
After creating your input list, now you can seed the list to the SynthToT via a simple CLI interface:

*After your first release, the docs will automatically be published to [your-project-name.readthedocs.io](https://your-project-name.readthedocs.io/).*
```bash
python synthtot/synthtot.py \
--input-file <INPUT_FILE_PATH> \
--output-file <OUTPUT_FILE_PATH> \
--model <OPENAI_MODEL_NAME> \
--max-tokens <MAX_TOKEN_NUMBER> \
--temperature <TEMPERATURE_FLOAT>
```

### Creating releases

Creating new GitHub and PyPI releases is easy. The GitHub Actions workflow that comes with this repository will handle all of that for you.
All you need to do is follow the instructions in [RELEASE_PROCESS.md](./RELEASE_PROCESS.md).
Creating new GitHub and PyPI releases is easy. The GitHub Actions workflow that comes with this repository will handle all of that for you. All you need to do is follow the instructions in [RELEASE_PROCESS.md](./RELEASE_PROCESS.md).

## Projects using this template

Expand All @@ -103,9 +156,8 @@ Here is an incomplete list of some projects that started off with this template:

#### Should I use this template even if I don't want to publish my package?

Absolutely! If you don't want to publish your package, just delete the `docs/` directory and the `release` job in [`.github/workflows/main.yml`](https://github.com/allenai/python-package-template/blob/main/.github/workflows/main.yml).
Absolutely! If you don't want to publish your package, just delete the `docs/` directory and the `release` job in [`.github/workflows/main.yml`](https://github.com/dtunai/SynthToT/blob/main/.github/workflows/main.yml).

## Contributing

If you find a bug :bug:, please open a [bug report](https://github.com/allenai/python-package-template/issues/new?assignees=&labels=bug&template=bug_report.md&title=).
If you have an idea for an improvement or new feature :rocket:, please open a [feature request](https://github.com/allenai/python-package-template/issues/new?assignees=&labels=Feature+request&template=feature_request.md&title=).
If you find a bug, please open a bug report. If you have an idea for an improvement or new feature :rocket:, please open a [feature request](https://github.com/dtunai/SynthToT/issues/new?assignees=&labels=Feature+request&template=feature_request.md&title=).
Binary file added assets/Synth-ToT-Logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 26 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
from setuptools import setup, find_packages

with open("README.md", "r") as fh:
long_description = fh.read()

with open("setup-requirements.txt", "r") as req_file:
install_requires = req_file.read().splitlines()

setup(
name="synthtot",
version="0.1.0",
author="Dogukan Uraz Tuna",
author_email="[email protected]",
description=" Generate a synthetic dataset for your data through deliberate problem-solving",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/dtunai/synthtot",
packages=find_packages(),
install_requires=install_requires,
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
],
python_requires=">=3.10",
)
2 changes: 1 addition & 1 deletion synthtot/data/input_data.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@
"perfect_consideration": "",
"number": 3,
"perspective": "Mathematician",
"example": "{\"messages\": [{\"role\": \"system\", \"content\": \"GitMaxd is a helpful AI assistant with a penchant for direct and casual communication.\"}, {\"role\": \"user\", \"content\": \"How big are Sulcata tortoises?\"}, {\"role\": \"assistant\", \"content\": \"Sulcata tortoises can easily weigh over 120 lb and have a body that extends over 30 inches in length. That's bigger than some people!\"}]}"
"example": "{\"messages\": [{\"role\": \"system\", \"content\": \"EcoBot is an eco-friendly AI assistant passionate about sustainability and environmental awareness.\"}, {\"role\": \"user\", \"content\": \"What are some benefits of using solar panels?\"}, {\"role\": \"assistant\", \"content\": \"Solar panels can significantly reduce electricity bills, decrease your carbon footprint, and even increase the value of your property. Plus, they're a renewable energy source, helping to combat climate change.\"}]}"
}
]
8 changes: 7 additions & 1 deletion synthtot/synthtot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
import os
import argparse

import logging
from generator.generator import entrypoint

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set up argument parser
parser = argparse.ArgumentParser(
description="Generate synthetic training data with your input data by deliberate problem solving."
)
Expand Down Expand Up @@ -37,4 +42,5 @@
)
args = parser.parse_args()

# Call the entry point function with the parsed arguments
entrypoint(args.input_file, args.output_file, args.model, args.max_tokens, args.temperature)

0 comments on commit a8b1280

Please sign in to comment.