-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add examples for HF datasets #1371
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1371
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c82b5fc with merge base aebad0c (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@ramanishsingh Can you add pointers to what all these three notebooks aim to cover in the PR top comment? |
Thanks, done. |
examples/nodes/utils.py
Outdated
return data | ||
|
||
|
||
class MDS_Net(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model like a simple MLP. What does MDS stand for? multidataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but if it is confusing I can modify the name to something simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just call this MLP
@@ -0,0 +1,2001 @@ | |||
x,y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just call this folder "data" or "example_data"?
examples/nodes/utils.py
Outdated
return x | ||
|
||
|
||
def train_mds_model(datasets, weights, batch_size=512): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a single "train" function that works for all 3 examples? They all look somewhat similar
nit: can you rename |
@@ -0,0 +1,172 @@ | |||
{ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a title
"\n", | ||
"\n", | ||
"# Hyperparameters\n", | ||
"batch_size = 2 #batch size is kept low so that we can easily see the bacthes when we print them in the later cells\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the comment it's unnecessary
"batch_size = 2 #batch size is kept low so that we can easily see the bacthes when we print them in the later cells\n", | ||
"\n", | ||
"# Next we batch the inputs, and then apply a collate_fn with another Mapper\n", | ||
"# to stack the tensors between. We use torch.utils.data.default_collate for this\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"# to stack the tensors between. We use torch.utils.data.default_collate for this\n", | |
"# to stack the tensor. We use torch.utils.data.default_collate for this\n", |
" node = PinMemory(node)\n", | ||
"\n", | ||
"# Since nodes are iterators, they need to be manually .reset() between epochs.\n", | ||
"# We can wrap the root node in Loader to convert it to a more conventional Iterable.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"# We can wrap the root node in Loader to convert it to a more conventional Iterable.\n", | |
"# Instead, we can wrap the root node in Loader to convert it to a more conventional Iterable.\n", |
} | ||
], | ||
"source": [ | ||
"# Once we have the loader, we can get batches from it over multiple epochs, to train the ML model\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"# Once we have the loader, we can get batches from it over multiple epochs, to train the ML model\n", | |
"# Once we have the loader, we can get batches from it over multiple epochs, to train the model\n", |
@@ -0,0 +1,148 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a title
"# Load IMDB dataset from huggingface datasets and select the \"train\" split\n", | ||
"dataset = load_dataset(\"imdb\", streaming=False)\n", | ||
"dataset = dataset[\"train\"]\n", | ||
"# Since dataset is a Map-style dataset, we can setup a sampler to shuffle the data\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's link to the docs similar to other example
examples/nodes/hf_imdb_bert.ipynb
Outdated
"# a custom map_fn to perform this. Using ParallelMapper allows us to use multiple\n", | ||
"# threads (or processes) to parallelize this work and have it run in the background\n", | ||
"max_len = 512\n", | ||
"batch_size = 2 # Keeping batch size smaller to easily inspect the outputs of a batch\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"batch_size = 2 # Keeping batch size smaller to easily inspect the outputs of a batch\n", | |
"batch_size = 2\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits but otherwise looks good
examples/nodes/hf_imdb_bert.ipynb
Outdated
"dataset = load_dataset(\"imdb\", streaming=False)\n", | ||
"dataset = dataset[\"train\"]\n", | ||
"# Since dataset is a Map-style dataset, we can setup a sampler to shuffle the data\n", | ||
"# Please refer to the migration guide here https://pytorch.org/data/docs/build/html/migrate_to_nodes_from_utils.html\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave you the wrong link, after nightly run this link is live now: https://pytorch.org/data/main/migrate_to_nodes_from_utils.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update URL and then good to go
* intial_commit * more examples * run precommit * hf mnist * add complete MNIST example * remove old py file * add imdb bert * update mnist example * update hf_bert example * add mds example * update bert example * update function name * run precommit * update bert example * update mnist notebook * update mds * delete ipynb ckpts * remove mds and simplify examples * fix some typos * simplify and remove test train mentiond * remove headings * add titles * fix typo * update url
* intial_commit * more examples * run precommit * hf mnist * add complete MNIST example * remove old py file * add imdb bert * update mnist example * update hf_bert example * add mds example * update bert example * update function name * run precommit * update bert example * update mnist notebook * update mds * delete ipynb ckpts * remove mds and simplify examples * fix some typos * simplify and remove test train mentiond * remove headings * add titles * fix typo * update url
This PR aims to add examples on how to use torchdata.nodes for dataloading and training models.
These examples aim to cover:
nodes
: batching, loading, mappingMultiNodeWeightedSampler
innodes
to show the tools available innodes
.Fixes #1352