How to use PyTorch dataloader #2222

QiJune · 2020-07-31T01:33:56Z

背景介绍

我们可以发现，PyTorch要求Dataset必须提供 __len__ 接口和 __getitem__接口，这就要求数据集是已知长度的，并且是可以被随机访问的。

这里与TensorFlow不同，TensorFlow的Dataset是可以从一个generator创建的，generator只要求用户实现 __next__接口即可，并不要求 __len__ 接口和 __getitem__ 接口。

因此，我们需要提出一种新的思路。

简单的做法

worker从master那里拿到一个task
worker 使用 recordio_reader提供的接口，把该task包含的record都读到内存中
records是一个即知道长度，又可以随机访问的数组，我们可以从这个数组中创建一个 RecordDataset
RecordDataset中，每一个record都是string类型的，用户需要提供一个feed函数把string类型转换为数值类型。我们发现，这个feed函数实际上就是 PyTorch中的 Transform
最后我们从TransformedDataset中创建一个Dataloader，然后做batch， shuffle等，开始训练

伪代码

while True:
    task = get_task()
    records = [record for r in reader.read_records(task)]
    RecordDataset = create_dataset(records)
    TransfromedDataSet = dataset_fn(RecordDataset)
    dataloader = DataLoader(TransfromedDataSet, shuffle=true, batch_size=32)
    for batch in dataloader:
        self.ps_client.pull_dense_parameters()
        loss = forward(batch)
        loss.backward()
        with torch.no_grad():
            grads = [param.grad.numpy() for param in model.params()]
            self.ps_client.push_gradients(grads)

The text was updated successfully, but these errors were encountered:

brightcoder01 · 2020-07-31T01:40:46Z

RecordDataset = create_dataset(records)
TransfromedDataSet = dataset_fn(RecordDataset)

上面的函数调用是已经将DataSet中所有的数据已经Transform完成了么

QiJune · 2020-07-31T02:14:10Z

@brightcoder01 这个没有。Transform是读数据的时候 on-the-fly 去做的。

skydoorkai · 2020-07-31T03:11:37Z

records = [record for r in reader.read_records(task)]
这样就需要把 task 里面的所有数据都读进来，数据读取 IO 没法和计算并行。
可以我们设计一个 Dataset, 里面有 buffer 在异步的读取数据， __getitem__ 就从 buffer 拿数据，有就直接返回，数据还没有就等待。这样 IO 和计算能并行。

skydoorkai · 2020-07-31T03:25:54Z

https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset

IterableDataset 看着就和 tf generator 一样，只需要提供一个 iter() 接口？

Kelang-Tian · 2020-07-31T05:46:42Z

RecordDataset中，每一个record都是string类型的，用户需要提供一个feed函数把string类型转换为数值类型。我们发现，这个feed函数实际上就是 PyTorch中的 Transform
加入Transform的dataset创建一个具有组合转换的数据集，可以动态的读区数据。DataLoader生成一个多线程的迭代器，重点关注collate_fn，一个list每个元素就是self.data[i]（需要预先定义的Dataset），self.data[i]等价于我们在Dataset里面定义的__getitem__函数？list的长度是batch size，list元素都是__getitem__的结果。重新定义collate_fn实现自定义

Kelang-Tian · 2020-07-31T05:47:47Z

RecordDataset中，每一个record都是string类型的，用户需要提供一个feed函数把string类型转换为数值类型。我们发现，这个feed函数实际上就是 PyTorch中的 Transform
加入Transform的dataset创建一个具有组合转换的数据集，可以动态的读区数据。DataLoader生成一个多线程的迭代器，重点关注collate_fn，一个list每个元素就是self.data[i]（需要预先定义的Dataset），self.data[i]等价于我们在Dataset里面定义的__getitem__函数？list的长度是batch size，list元素都是__getitem__的结果。重新定义collate_fn实现自定义

collate_fn是DataLoader的参数

linkerzhang · 2020-08-01T03:31:52Z

Quick question: for TF and PyTorch users, do they need to follow different APIs to feed data in EDL? @QiJune

Kelang-Tian · 2020-08-01T03:38:01Z

Quick question: for TF and PyTorch users, do they need to follow different APIs to feed data in EDL? @QiJune

In my opinion, feed() is a function that takes a dataset(RecordIO format) as input, pre-processes the data as needed, and returns a dataset containing model_inputs and labels as a pair. They do need to follow different APIs.

QiJune · 2020-08-03T00:32:42Z

@linkerzhang

As @Kelang-Tian writes, the feed() function transforms a training sample in a RecordIO format to a tf.Tensor or torch.Tensor. Users could also add their preprocessing logic. For example, adding 1.0 to the training example value.

I believe that TensorFlow users would like to use tf operators, and PyTorch users would like to use torch functions. It's hard to unify them.

workingloong · 2020-08-03T14:59:24Z

Can we use TensorFlow Dataset APIs to read data and feed the data into Pytorch models?
For example:

for features, labels in dataset:
      features = features.numpy()
      labels = labels.numpy()
      loss = forward(batch)
      loss.backward()

Kelang-Tian · 2020-08-03T15:10:42Z

@workingloong
Yes, we can. Actually, I come up with three methods to load PyTorch data.

As you suggest, reuse the dataset of TensorFlow. Convert dataset from tf eager tensor to NumPy
and send it to the training loop when sending batch data to PyTorch.
We read all the data from a task and saved it into a list, then created a dataset
from a list. In this case, a dataset and dataloader need to be created for
each task.
There a gen function that yields data. We can create an IterableDataset from this gen
function, and sends the data to the training loop

QiJune assigned Yancey1989, Kelang-Tian and brightcoder01 Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use PyTorch dataloader #2222

How to use PyTorch dataloader #2222

QiJune commented Jul 31, 2020 •

edited

Loading

brightcoder01 commented Jul 31, 2020 •

edited

Loading

QiJune commented Jul 31, 2020

skydoorkai commented Jul 31, 2020

skydoorkai commented Jul 31, 2020

Kelang-Tian commented Jul 31, 2020

Kelang-Tian commented Jul 31, 2020

linkerzhang commented Aug 1, 2020 •

edited

Loading

Kelang-Tian commented Aug 1, 2020

QiJune commented Aug 3, 2020

workingloong commented Aug 3, 2020

Kelang-Tian commented Aug 3, 2020

How to use PyTorch dataloader #2222

How to use PyTorch dataloader #2222

Comments

QiJune commented Jul 31, 2020 • edited Loading

背景介绍

简单的做法

伪代码

brightcoder01 commented Jul 31, 2020 • edited Loading

QiJune commented Jul 31, 2020

skydoorkai commented Jul 31, 2020

skydoorkai commented Jul 31, 2020

Kelang-Tian commented Jul 31, 2020

Kelang-Tian commented Jul 31, 2020

linkerzhang commented Aug 1, 2020 • edited Loading

Kelang-Tian commented Aug 1, 2020

QiJune commented Aug 3, 2020

workingloong commented Aug 3, 2020

Kelang-Tian commented Aug 3, 2020

QiJune commented Jul 31, 2020 •

edited

Loading

brightcoder01 commented Jul 31, 2020 •

edited

Loading

linkerzhang commented Aug 1, 2020 •

edited

Loading