Skip to content
generated from dtunai/Python-Template

SynthToT: Generate synthetic dataset for your training dataset through deliberate problem-solving et al S Yao, 2023.

License

Notifications You must be signed in to change notification settings

dtunai/SynthToT

Repository files navigation

SynthToT

Generate synthetic dataset for your training dataset through deliberate problem-solving


Table of Content

Introduction

SynthToT is an simple AI agent system powered by Langchain. It is specifically designed to facilitate the automated generation of synthetic datasets, which are crucial for the training of large language models. SynthToT Agent utilize the renowned Tree of Thoughts: Deliberate Problem Solving with Large Language Models et al. Shunyu Yao, Dian Yu. Tree-of-Thoughts prompting strategy, "which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices."


By implementing this strategy, the SynthToT Agent offers a CLI interface for generating JSON dataset outputs using Tree-of-Thoughts (ToT) reasoning applied to the seed input content. This approach provides a distinctive foundation for creating datasets that are ideal for training state-of-the-art language models, adhering to the following JSON schema:

Output JSON Schema

[
    {
        "input": "",
        "perfect_consideration": "",
        "number": 10,
        "perspective": "",
        "prompt_template": "",
        "solutions": "",
        "thought_process": "",
        "sorted_solutions": "",
        "data_out": ""
    }
    // N items of seed input list
]

Input JSON Schema

[
    {
        "input": "",
        "perfect_consideration": "",
        "number": 10,
        "perspective": "",
    },
    // N items of seed input list
]

You can view example input_data and output_data from under the examples folder.

Features

Chaining

  • Initialization: Customizable parameters such as maximum tokens per response, model type, and sampling temperature.

    • max_tokens: Limits the number of tokens generated per response.
    • model: Specifies the language model to use (default is "gpt-4").
    • temperature: Controls the randomness of the output (default is 0).
  • Template Management: Utilizes a set of predefined Tree-of-Thoughts templates and corresponding output keys for structured data generation.

  • LLMChain: Initializes an LLMChain with a specified prompt template and output key, using the selected language model and parameters.

  • Chain Assembly: Generates a list of LLM chained instances based on the predefined templates and output keys.

Usage

Initial setup

I. Create a new Conda or virtual environment with the Python version 3.10:

conda create -n synthtotenv python=3.10.11
conda activate synthtotenv

or create virtual environment with venv.

II. Clone the repository, go into folder, and install requirements:

Clone from the remote:

git clone https://github.com/dtunai/SynthToT/

Switch to package folder:

cd SynthToT

Install requirements:

pip install -r setup-requirements.txt

Build the package:

pip install -e .

Preparing Input Data List

Now, you're tasked with creating your input data list. This list will serve as the foundation for generating synthetic output and potential solutions using the Tree-of-Thoughts approach by agents chains. Please take a look at examples folder for input data examples.

Using Tool

After creating your input list, now you can seed the list to the SynthToT via a simple CLI interface:

python synthtot/synthtot.py \
    --input-file <INPUT_FILE_PATH> \
    --output-file <OUTPUT_FILE_PATH> \
    --prompt-pairs-format <system-user-assistant || instruct>
    --model <OPENAI_MODEL_NAME> \
    --max-tokens <MAX_TOKEN_NUMBER> \
    --temperature <TEMPERATURE_FLOAT>

Contributing

If you find a bug, please open a bug report. If you have an idea for an improvement or new feature 🚀, please open a feature request.

About

SynthToT: Generate synthetic dataset for your training dataset through deliberate problem-solving et al S Yao, 2023.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published