Skip to content

TorchData 0.5.0 Release Notes

Compare
Choose a tag to compare
@ejguan ejguan released this 27 Oct 17:08
· 300 commits to main since this release

TorchData 0.5.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Improvements
  • Bug Fixes
  • Performance
  • Documentation
  • Future Plans
  • Beta Usage Note

Highlights

We are excited to announce the release of TorchData 0.5.0. This release is composed of about 236 commits since 0.4.1, including ones from PyTorch Core since 1.12.1, made by more than 35 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.5.0 updates are focused on consolidating the DataLoader2 and ReadingService APIs and benchmarking. Highlights include:

  • Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found here
  • Consolidated API for DataLoader2 and provided a few ReadingServices, with detailed documentation now available here
  • Provided more comprehensive DataPipe operations, e.g., random_split, repeat, set_length, and prefetch.
  • Provided pre-compiled torchdata binaries for arm64 Apple Silicon

Backwards Incompatible Change

DataPipe

Changed the returned value of MapDataPipe.shuffle to an IterDataPipe (pytorch/pytorch#83202)

IterDataPipe is used to to preserve data order

MapDataPipe.shuffle
0.4.10.5.0
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
      
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
      

on_disk_cache now doesn’t accept generator functions for the argument of filename_fn (#810)

on_disk_cache
0.4.10.5.0
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
…     yield from [url + f/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
      
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
…     yield from [url + f/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
# AssertionError
      

DataLoader2

Imposed single iterator constraint on DataLoader2 (#700)

DataLoader2 with a single iterator
0.4.10.5.0
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl)  # No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2
      
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
# Raises exception, since it1 is no longer valid
      

Deep copy DataPipe during DataLoader2 initialization or restoration (#786, #833)

Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.

Deep copy DataPipe during DataLoader2 constructor
0.4.10.5.0
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
…     print(x, y)
# RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...
      
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
…     print(x, y)
0 0
1 1
2 2
3 3
4 4
      

Deprecations

DataLoader2

Deprecated traverse function and only_datapipe argument (pytorch/pytorch#85667)

Please use traverse_dps with the behavior the same as only_datapipe=True. (#793)

DataPipe traverse function
0.4.10.5.0
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
      
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.
      

New Features

DataPipe

  • Added AIStore DataPipe (#545, #667)
  • Added support for IterDataPipe to trace DataFrames operations (pytorch/pytorch#71931,
  • Added support for DataFrameMakerIterDataPipe to accept dtype_generator to solve unserializable dtype (#537)
  • Added graph snapshotting by counting number of successful yields for IterDataPipe (pytorch/pytorch#79479, pytorch/pytorch#79657)
  • Implemented drop operation for IterDataPipe to drop column(s) (#725)
  • Implemented FullSyncIterDataPipe to synchronize distributed shards (#713)
  • Implemented slice and flatten operations for IterDataPipe (#730)
  • Implemented repeat operation for IterDataPipe (#748)
  • Added LengthSetterIterDataPipe (#747)
  • Added RandomSplitter (without buffer) (#724)
  • Added padden_tokens to max_token_bucketize to bucketize samples based on total padded token length (#789)
  • Implemented thread based PrefetcherIterDataPipe (#770, #818, #826, #842)

DataLoader2

  • Added CacheTimeout Adapter to redefine cache timeout of the DataPipe graph (#571)
  • Added DistribtuedReadingService to support uneven data sharding (#727)
  • Added PrototypeMultiProcessingReadingService
    • Added prefetching (#826)
    • Fixed process termination (#837)
    • Enabled deterministic training in distributed/non-distributed environment (#827)
    • Handled empty queue exception properly (#785)

Releng

  • Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)

Improvements

DataPipe

  • Fixed error message coming from singler iterator constraint (pytorch/pytorch#79547)
  • Enabled profiler record context in __next__ for IterDataPipe (pytorch/pytorch#79757)
  • Raised warning for unpickable local function (#547) (pytorch/pytorch#80232, #547)
  • Cleaned up opened streams on the best effort basis (#560, pytorch/pytorch#78952)
  • Used streaming reading mode for unseekable streams in TarArchiveLoader (#653)
    Improved GDrive 'content-disposition' error message (#654)
  • Added as_tuple argument for CSVParserIterDataPipe` to convert output from list to tuple (#646)
  • Raised Error when HTTPReader get 404 Response (#160) (#569)
  • Added default no-op behavior for flatmap (#749)
  • Added support to validate input_col with the provided map function for DataPipe (pytorch/pytorch#80267, #755, pytorch/pytorch#84279)
  • Made ShufflerIterDataPipe support snapshotting (#83535)
  • Unified implementations between in_batch_shuffle with shuffle for IterDataPipe (#745)
  • Made IterDataPipe.to_map_datapipe loading data lazily (#765)
  • Added kwargs to open files for FSSpecFileLister and FSSpecSaver (#804)
  • Added missing functional name for FileLister (#86497)

DataLoader

DataLoader2

Releng

  • Enabled conda release to support GLIBC_2.27 (#859)

Bug Fixes

DataPipe

Performance

DataLoader2

  • Added benchmarking for DataLoader2
    • Added AWS cloud configurations (#680)
    • Added benchmark from torchvision training references (#714)

Documentation

DataPipe

  • Added examples for data loading with DataPipe
    • Read Criteo TSV and Parquet files and apply TorchArrow operations (#561)
    • Read caltech256 and coco with AIStoreDataPipe (#582)
    • Read from tigergraph database (#783)
  • Improved docstring for DataPipe
  • Added tutorial to load from Cloud Storage Provider including AWS S3, Google Cloud Platform and Azure Blob Storage (#812, #836)
  • Improved tutorial
    • Fixed tutorial for newline on Windows in generate_csv (#675)
    • Improved note on shuffling behavior (#688)
    • Fixed tutorial about shuffing before sharding (#715)
    • Added random_split example (#843)
  • Simplified long type names for online doc (#838)

DataLoader2

  • Improved docstring for DataLoader2 (#581, #817)
  • Added training examples using DataLoader2, ReadingService and DataPipe (#563, #664, #670, #787)

Releng

  • Added contribution guide for third-party library (#663)

Future Plans

We will continue benchmarking over datasets on local disk and cloud storage using TorchData. And, we will continue making DataLoader2 and related ReadingService more stable and provide more features like snapshotting the data pipeline and restoring it from the serialized state. Stay tuned and welcome any feedback.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.