New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add examples for HF datasets #1371

Merged

ramanishsingh merged 24 commits into main from hf_nodes

Dec 12, 2024

Contributor

ramanishsingh commented Nov 22, 2024 •

edited

Loading

This PR aims to add examples on how to use torchdata.nodes for dataloading and training models.
These examples aim to cover:

Load and process datasets from HuggingFace.
Using the functionalities of nodes: batching, loading, mapping
Getting batches and training the ML model.
Specific examples include image recognition (MNIST digit recognition), NLP (movie review sentiment classification using BERT), and usage of MultiNodeWeightedSampler in nodes to show the tools available in nodes.

Fixes #1352


          intial_commit

5520cbd

facebook-github-bot added the CLA Signed label

pytorch-bot bot commented Nov 22, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1371

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c82b5fc with merge base aebad0c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ramanishsingh added 2 commits

November 21, 2024 19:01


          more examples

2af0274


          run precommit

304ac41

divyanshk reviewed

View reviewed changes

examples/nodes/hf_datsets.py Outdated Show resolved Hide resolved

divyanshk reviewed

View reviewed changes

examples/nodes/hf_datsets.py Outdated Show resolved Hide resolved

ramanishsingh added 7 commits

November 25, 2024 21:29


          hf mnist

a085263


          add complete MNIST example


          remove old py file

36e1b5d


          add imdb bert

d906a44


          update mnist example

f4e320b


          update hf_bert example

c796b7b


          add mds example

06ac590

ramanishsingh requested review from divyanshk and andrewkho

December 2, 2024 00:56

Contributor

divyanshk commented Dec 2, 2024

@ramanishsingh Can you add pointers to what all these three notebooks aim to cover in the PR top comment?

Contributor Author

ramanishsingh commented Dec 2, 2024

@ramanishsingh Can you add pointers to what all these three notebooks aim to cover in the PR top comment?

Thanks, done.

ramanishsingh marked this pull request as ready for review

December 3, 2024 00:37

ramanishsingh added 3 commits

December 3, 2024 21:40


          update bert example

290e8e3


          update function name

84013b5


          run precommit

5d37733

andrewkho reviewed

View reviewed changes

examples/nodes/utils.py Outdated Show resolved Hide resolved

ramanishsingh added 3 commits

December 5, 2024 21:05


          update bert example

652d178


          update mnist notebook

55b1472


          update mds

4efcc85

divyanshk reviewed

View reviewed changes

examples/nodes/utils.py Outdated

		return data


		class MDS_Net(nn.Module):

Contributor

divyanshk Dec 6, 2024

The model like a simple MLP. What does MDS stand for? multidataset?

Contributor Author

ramanishsingh Dec 6, 2024

Yes, but if it is confusing I can modify the name to something simpler.

Contributor

andrewkho Dec 6, 2024

Just call this MLP


          delete ipynb ckpts

3afd46c

divyanshk mentioned this pull request

update test_release.yml to v0.10 #1382

Merged

andrewkho reviewed

View reviewed changes

examples/nodes/mds_data/gaussian_dataset.csv Outdated

		@@ -0,0 +1,2001 @@
		x,y

Contributor

andrewkho Dec 6, 2024

Can you just call this folder "data" or "example_data"?

andrewkho reviewed

View reviewed changes

examples/nodes/utils.py Outdated

		return x


		def train_mds_model(datasets, weights, batch_size=512):

Contributor

andrewkho Dec 6, 2024

can we have a single "train" function that works for all 3 examples? They all look somewhat similar

Contributor

andrewkho commented Dec 6, 2024

Not sure how to comment on notebooks so will add comments here:

do you need to a 4-tuple here, can it just be a 2-tuple?

Contributor

andrewkho commented Dec 6, 2024

nit: can you rename multiple_datasets to multidatasets?

ramanishsingh added 3 commits

December 6, 2024 16:11


          remove mds and simplify examples

b16c0b3


          fix some typos

9d8115b


          simplify and remove test train mentiond

05df26e

ramanishsingh requested a review from andrewkho

December 10, 2024 01:04


          remove headings

e2cea9b

andrewkho reviewed

View reviewed changes

examples/nodes/hf_datasets_nodes_mnist.ipynb

		@@ -0,0 +1,172 @@
		{
		"cells": [

Contributor

andrewkho Dec 11, 2024

Add a title

andrewkho reviewed

View reviewed changes

examples/nodes/hf_datasets_nodes_mnist.ipynb Outdated

+                  "\n",
+                  "\n",
+                  "# Hyperparameters\n",
+                  "batch_size = 2 #batch size is kept low so that we can easily see the bacthes when we print them in the later cells\n",

Contributor

andrewkho Dec 11, 2024

Drop the comment it's unnecessary

andrewkho reviewed

View reviewed changes

examples/nodes/hf_datasets_nodes_mnist.ipynb Outdated

+                  "batch_size = 2 #batch size is kept low so that we can easily see the bacthes when we print them in the later cells\n",
+                  "\n",
+                  "# Next we batch the inputs, and then apply a collate_fn with another Mapper\n",
+                  "# to stack the tensors between. We use torch.utils.data.default_collate for this\n",

Contributor

andrewkho Dec 11, 2024

Suggested change

      
                "# to stack the tensors between. We use torch.utils.data.default_collate for this\n",
          
                "# to stack the tensor. We use torch.utils.data.default_collate for this\n",

andrewkho reviewed

View reviewed changes

examples/nodes/hf_datasets_nodes_mnist.ipynb Outdated

+                  "    node = PinMemory(node)\n",
+                  "\n",
+                  "# Since nodes are iterators, they need to be manually .reset() between epochs.\n",
+                  "# We can wrap the root node in Loader to convert it to a more conventional Iterable.\n",

Contributor

andrewkho Dec 11, 2024

Suggested change

      
                "# We can wrap the root node in Loader to convert it to a more conventional Iterable.\n",
          
                "# Instead, we can wrap the root node in Loader to convert it to a more conventional Iterable.\n",

andrewkho reviewed

View reviewed changes

examples/nodes/hf_datasets_nodes_mnist.ipynb Outdated

+                  }
+                 ],
+                 "source": [
+                  "# Once we have the loader, we can get batches from it over multiple epochs, to train the ML model\n",

Contributor

andrewkho Dec 11, 2024

Suggested change

      
                "# Once we have the loader, we can get batches from it over multiple epochs, to train the ML model\n",
          
                "# Once we have the loader, we can get batches from it over multiple epochs, to train the model\n",

andrewkho reviewed

View reviewed changes

examples/nodes/hf_imdb_bert.ipynb

		@@ -0,0 +1,148 @@
		{

Contributor

andrewkho Dec 11, 2024

Add a title

andrewkho reviewed

View reviewed changes

examples/nodes/hf_imdb_bert.ipynb

+                  "# Load IMDB dataset from huggingface datasets and select the \"train\" split\n",
+                  "dataset = load_dataset(\"imdb\", streaming=False)\n",
+                  "dataset = dataset[\"train\"]\n",
+                  "# Since dataset is a Map-style dataset, we can setup a sampler to shuffle the data\n",

Contributor

andrewkho Dec 11, 2024

Let's link to the docs similar to other example

andrewkho reviewed

View reviewed changes

examples/nodes/hf_imdb_bert.ipynb Outdated

+                  "# a custom map_fn to perform this. Using ParallelMapper allows us to use multiple\n",
+                  "# threads (or processes) to parallelize this work and have it run in the background\n",
+                  "max_len = 512\n",
+                  "batch_size = 2 # Keeping batch size smaller to easily inspect the outputs of a batch\n",

Contributor

andrewkho Dec 11, 2024

Suggested change

      
                "batch_size = 2 # Keeping batch size smaller to easily inspect the outputs of a batch\n",
          
                "batch_size = 2\n",

andrewkho requested changes

View reviewed changes

Contributor

andrewkho left a comment

Few nits but otherwise looks good

ramanishsingh added 2 commits

December 11, 2024 14:38


          add titles

a479f5f


          fix typo

f7abbc6

andrewkho approved these changes

View reviewed changes

andrewkho reviewed

View reviewed changes

examples/nodes/hf_imdb_bert.ipynb Outdated

+                  "dataset = load_dataset(\"imdb\", streaming=False)\n",
+                  "dataset = dataset[\"train\"]\n",
+                  "# Since dataset is a Map-style dataset, we can setup a sampler to shuffle the data\n",
+                  "# Please refer to the migration guide here https://pytorch.org/data/docs/build/html/migrate_to_nodes_from_utils.html\n",

Contributor

andrewkho Dec 12, 2024

I gave you the wrong link, after nightly run this link is live now: https://pytorch.org/data/main/migrate_to_nodes_from_utils.html

andrewkho approved these changes

View reviewed changes

Contributor

andrewkho left a comment

update URL and then good to go


          update url

c82b5fc

ramanishsingh merged commit 23289fc into main

39 checks passed

andrewkho pushed a commit that referenced this pull request


          Add examples for HF datasets (#1371)

d64cfc1

* intial_commit

* more examples

* run precommit

* hf mnist

* add complete MNIST example

* remove old py file

* add imdb bert

* update mnist example

* update hf_bert example

* add mds example

* update bert example

* update function name

* run precommit

* update bert example

* update mnist notebook

* update mds

* delete ipynb ckpts

* remove mds and simplify examples

* fix some typos

* simplify and remove test train mentiond

* remove headings

* add titles

* fix typo

* update url

andrewkho pushed a commit that referenced this pull request


          Add examples for HF datasets (#1371)

43e61c5

* intial_commit

* more examples

* run precommit

* hf mnist

* add complete MNIST example

* remove old py file

* add imdb bert

* update mnist example

* update hf_bert example

* add mds example

* update bert example

* update function name

* run precommit

* update bert example

* update mnist notebook

* update mds

* delete ipynb ckpts

* remove mds and simplify examples

* fix some typos

* simplify and remove test train mentiond

* remove headings

* add titles

* fix typo

* update url

ramanishsingh deleted the hf_nodes branch

December 12, 2024 23:05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels