Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use PyTorch dataloader #2222

Open
QiJune opened this issue Jul 31, 2020 · 11 comments
Open

How to use PyTorch dataloader #2222

QiJune opened this issue Jul 31, 2020 · 11 comments
Assignees

Comments

@QiJune
Copy link
Collaborator

QiJune commented Jul 31, 2020

背景介绍

PyTorch的 Dataset class定义

我们可以发现,PyTorch要求Dataset必须提供 __len__ 接口和 __getitem__接口,这就要求 数据集是已知长度的,并且是可以被随机访问的。

这里与TensorFlow不同,TensorFlow的Dataset是可以从一个generator创建的,generator只要求用户实现 __next__接口即可,并不要求 __len__ 接口和 __getitem__ 接口。

因此,我们需要提出一种新的思路。

简单的做法

  1. worker从master那里拿到一个task
  2. worker 使用 recordio_reader提供的接口,把该task包含的record都读到内存中
  3. records是一个即知道长度,又可以随机访问的数组,我们可以从这个数组中创建一个 RecordDataset
  4. RecordDataset中,每一个record都是string类型的,用户需要提供一个feed函数把string类型转换为数值类型。我们发现,这个feed函数实际上就是 PyTorch中的 Transform
  5. 最后我们从TransformedDataset中创建一个Dataloader,然后做batch, shuffle等,开始训练

伪代码

while True:
    task = get_task()
    records = [record for r in reader.read_records(task)]
    RecordDataset = create_dataset(records)
    TransfromedDataSet = dataset_fn(RecordDataset)
    dataloader = DataLoader(TransfromedDataSet, shuffle=true, batch_size=32)
    for batch in dataloader:
        self.ps_client.pull_dense_parameters()
        loss = forward(batch)
        loss.backward()
        with torch.no_grad():
            grads = [param.grad.numpy() for param in model.params()]
            self.ps_client.push_gradients(grads)
@brightcoder01
Copy link
Collaborator

brightcoder01 commented Jul 31, 2020

RecordDataset = create_dataset(records)
TransfromedDataSet = dataset_fn(RecordDataset)

上面的函数调用是已经将DataSet中所有的数据已经Transform完成了么

@QiJune
Copy link
Collaborator Author

QiJune commented Jul 31, 2020

@brightcoder01 这个没有。Transform是读数据的时候 on-the-fly 去做的。

@skydoorkai
Copy link
Collaborator

records = [record for r in reader.read_records(task)]
这样就需要把 task 里面的所有数据都读进来,数据读取 IO 没法和计算并行。
可以我们设计一个 Dataset, 里面有 buffer 在异步的读取数据, __getitem__ 就从 buffer 拿数据,有就直接返回,数据还没有就等待。 这样 IO 和计算能并行。

@skydoorkai
Copy link
Collaborator

https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset

IterableDataset 看着就和 tf generator 一样,只需要提供一个 iter() 接口?

@Kelang-Tian
Copy link
Collaborator

RecordDataset中,每一个record都是string类型的,用户需要提供一个feed函数把string类型转换为数值类型。我们发现,这个feed函数实际上就是 PyTorch中的 Transform
加入Transformdataset创建一个具有组合转换的数据集,可以动态的读区数据。DataLoader生成一个多线程的迭代器,重点关注collate_fn,一个list每个元素就是self.data[i](需要预先定义的Dataset),self.data[i]等价于我们在Dataset里面定义的__getitem__函数?list的长度是batch size,list元素都是__getitem__的结果。重新定义collate_fn实现自定义

@Kelang-Tian
Copy link
Collaborator

RecordDataset中,每一个record都是string类型的,用户需要提供一个feed函数把string类型转换为数值类型。我们发现,这个feed函数实际上就是 PyTorch中的 Transform
加入Transformdataset创建一个具有组合转换的数据集,可以动态的读区数据。DataLoader生成一个多线程的迭代器,重点关注collate_fn,一个list每个元素就是self.data[i](需要预先定义的Dataset),self.data[i]等价于我们在Dataset里面定义的__getitem__函数?list的长度是batch size,list元素都是__getitem__的结果。重新定义collate_fn实现自定义

collate_fn是DataLoader的参数

@linkerzhang
Copy link
Member

linkerzhang commented Aug 1, 2020

Quick question: for TF and PyTorch users, do they need to follow different APIs to feed data in EDL? @QiJune

@Kelang-Tian
Copy link
Collaborator

Quick question: for TF and PyTorch users, do they need to follow different APIs to feed data in EDL? @QiJune

In my opinion, feed() is a function that takes a dataset(RecordIO format) as input, pre-processes the data as needed, and returns a dataset containing model_inputs and labels as a pair. They do need to follow different APIs.

@QiJune
Copy link
Collaborator Author

QiJune commented Aug 3, 2020

@linkerzhang

As @Kelang-Tian writes, the feed() function transforms a training sample in a RecordIO format to a tf.Tensor or torch.Tensor. Users could also add their preprocessing logic. For example, adding 1.0 to the training example value.

I believe that TensorFlow users would like to use tf operators, and PyTorch users would like to use torch functions. It's hard to unify them.

@workingloong
Copy link
Collaborator

Can we use TensorFlow Dataset APIs to read data and feed the data into Pytorch models?
For example:

for features, labels in dataset:
      features = features.numpy()
      labels = labels.numpy()
      loss = forward(batch)
      loss.backward()

@Kelang-Tian
Copy link
Collaborator

@workingloong
Yes, we can. Actually, I come up with three methods to load PyTorch data.

  1. As you suggest, reuse the dataset of TensorFlow. Convert dataset from tf eager tensor to NumPy
    and send it to the training loop when sending batch data to PyTorch.
  2. We read all the data from a task and saved it into a list, then created a dataset
    from a list. In this case, a dataset and dataloader need to be created for
    each task.
  3. There a gen function that yields data. We can create an IterableDataset from this gen
    function, and sends the data to the training loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants