First example dataset for instruct datasets has no _component #2215

johnowhitaker · 2024-12-30T22:08:22Z

https://pytorch.org/torchtune/stable/basics/instruct_datasets.html

I based my work (first ever torchtune attempt) on the first example on this page, and had an error until I added _component_: torchtune.datasets.instruct_dataset which is in the later examples but not the first one.

Also, as a general note, it took a surprising amount of looking starting from the "first fine-tune" tutorial (https://pytorch.org/torchtune/stable/tutorials/first_finetune_tutorial.html#) and the how-too guide (https://www.llama.com/docs/how-to-guides/fine-tuning/) to start answering the (IMO important!) question of: "How do I specify what data to train on?!?"

The text was updated successfully, but these errors were encountered:

RdoubleA · 2025-01-01T01:50:46Z

You are correct, the first example is missing the _component_: torchtune.datasets.instruct_dataset field. Good catch. We welcome a PR to quickly patch that, or we can flag it for now and have it patched soon.

to start answering the (IMO important!) question of: "How do I specify what data to train on?!?"

I agree this is the most important question when fine-tuning an LLM. I've been meaning to make this a bit more visible in the documentation by adding a custom data page under "Basics" or "Tutorials", this is still on our todo list.

RdoubleA mentioned this issue Jan 1, 2025

Add a page explaining quickly setting up with custom data in live docs #2221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First example dataset for instruct datasets has no _component #2215

First example dataset for instruct datasets has no _component #2215

johnowhitaker commented Dec 30, 2024

RdoubleA commented Jan 1, 2025

First example dataset for instruct datasets has no _component #2215

First example dataset for instruct datasets has no _component #2215

Comments

johnowhitaker commented Dec 30, 2024

RdoubleA commented Jan 1, 2025