Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add examples for HF datasets #1371

Merged
merged 24 commits into from
Dec 12, 2024
Merged

Add examples for HF datasets #1371

merged 24 commits into from
Dec 12, 2024

Conversation

ramanishsingh
Copy link
Contributor

@ramanishsingh ramanishsingh commented Nov 22, 2024

This PR aims to add examples on how to use torchdata.nodes for dataloading and training models.
These examples aim to cover:

  1. Load and process datasets from HuggingFace.
  2. Using the functionalities of nodes: batching, loading, mapping
  3. Getting batches and training the ML model.
  4. Specific examples include image recognition (MNIST digit recognition), NLP (movie review sentiment classification using BERT), and usage of MultiNodeWeightedSampler in nodes to show the tools available in nodes.

Fixes #1352

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 22, 2024
Copy link

pytorch-bot bot commented Nov 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1371

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c82b5fc with merge base aebad0c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@divyanshk
Copy link
Contributor

@ramanishsingh Can you add pointers to what all these three notebooks aim to cover in the PR top comment?

@ramanishsingh
Copy link
Contributor Author

@ramanishsingh Can you add pointers to what all these three notebooks aim to cover in the PR top comment?

Thanks, done.

@ramanishsingh ramanishsingh marked this pull request as ready for review December 3, 2024 00:37
examples/nodes/utils.py Outdated Show resolved Hide resolved
return data


class MDS_Net(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model like a simple MLP. What does MDS stand for? multidataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but if it is confusing I can modify the name to something simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just call this MLP

@@ -0,0 +1,2001 @@
x,y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just call this folder "data" or "example_data"?

return x


def train_mds_model(datasets, weights, batch_size=512):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a single "train" function that works for all 3 examples? They all look somewhat similar

@andrewkho
Copy link
Contributor

Not sure how to comment on notebooks so will add comments here:
image
do you need to a 4-tuple here, can it just be a 2-tuple?

@andrewkho
Copy link
Contributor

nit: can you rename multiple_datasets to multidatasets?

@@ -0,0 +1,172 @@
{
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a title

"\n",
"\n",
"# Hyperparameters\n",
"batch_size = 2 #batch size is kept low so that we can easily see the bacthes when we print them in the later cells\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the comment it's unnecessary

"batch_size = 2 #batch size is kept low so that we can easily see the bacthes when we print them in the later cells\n",
"\n",
"# Next we batch the inputs, and then apply a collate_fn with another Mapper\n",
"# to stack the tensors between. We use torch.utils.data.default_collate for this\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"# to stack the tensors between. We use torch.utils.data.default_collate for this\n",
"# to stack the tensor. We use torch.utils.data.default_collate for this\n",

" node = PinMemory(node)\n",
"\n",
"# Since nodes are iterators, they need to be manually .reset() between epochs.\n",
"# We can wrap the root node in Loader to convert it to a more conventional Iterable.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"# We can wrap the root node in Loader to convert it to a more conventional Iterable.\n",
"# Instead, we can wrap the root node in Loader to convert it to a more conventional Iterable.\n",

}
],
"source": [
"# Once we have the loader, we can get batches from it over multiple epochs, to train the ML model\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"# Once we have the loader, we can get batches from it over multiple epochs, to train the ML model\n",
"# Once we have the loader, we can get batches from it over multiple epochs, to train the model\n",

@@ -0,0 +1,148 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a title

"# Load IMDB dataset from huggingface datasets and select the \"train\" split\n",
"dataset = load_dataset(\"imdb\", streaming=False)\n",
"dataset = dataset[\"train\"]\n",
"# Since dataset is a Map-style dataset, we can setup a sampler to shuffle the data\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's link to the docs similar to other example

"# a custom map_fn to perform this. Using ParallelMapper allows us to use multiple\n",
"# threads (or processes) to parallelize this work and have it run in the background\n",
"max_len = 512\n",
"batch_size = 2 # Keeping batch size smaller to easily inspect the outputs of a batch\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"batch_size = 2 # Keeping batch size smaller to easily inspect the outputs of a batch\n",
"batch_size = 2\n",

Copy link
Contributor

@andrewkho andrewkho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits but otherwise looks good

"dataset = load_dataset(\"imdb\", streaming=False)\n",
"dataset = dataset[\"train\"]\n",
"# Since dataset is a Map-style dataset, we can setup a sampler to shuffle the data\n",
"# Please refer to the migration guide here https://pytorch.org/data/docs/build/html/migrate_to_nodes_from_utils.html\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave you the wrong link, after nightly run this link is live now: https://pytorch.org/data/main/migrate_to_nodes_from_utils.html

Copy link
Contributor

@andrewkho andrewkho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update URL and then good to go

@ramanishsingh ramanishsingh merged commit 23289fc into main Dec 12, 2024
39 checks passed
andrewkho pushed a commit that referenced this pull request Dec 12, 2024
* intial_commit

* more examples

* run precommit

* hf mnist

* add complete MNIST example

* remove old py file

* add imdb bert

* update mnist example

* update hf_bert example

* add mds example

* update bert example

* update function name

* run precommit

* update bert example

* update mnist notebook

* update mds

* delete ipynb ckpts

* remove mds and simplify examples

* fix some typos

* simplify and remove test train mentiond

* remove headings

* add titles

* fix typo

* update url
andrewkho pushed a commit that referenced this pull request Dec 12, 2024
* intial_commit

* more examples

* run precommit

* hf mnist

* add complete MNIST example

* remove old py file

* add imdb bert

* update mnist example

* update hf_bert example

* add mds example

* update bert example

* update function name

* run precommit

* update bert example

* update mnist notebook

* update mds

* delete ipynb ckpts

* remove mds and simplify examples

* fix some typos

* simplify and remove test train mentiond

* remove headings

* add titles

* fix typo

* update url
@ramanishsingh ramanishsingh deleted the hf_nodes branch December 12, 2024 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[WIP] Examples for demonstrating the usage and incremental value of TorchData Nodes
4 participants