Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It takes too long to load frames #1

Open
kawakaming11 opened this issue Jul 6, 2024 · 19 comments
Open

It takes too long to load frames #1

kawakaming11 opened this issue Jul 6, 2024 · 19 comments

Comments

@kawakaming11
Copy link

kawakaming11 commented Jul 6, 2024

Thank you for sharing your great project!

I have one problem with running the code where it takes too long to load frames. I printed how long does it took to load data and the results are as follows. In this case I set --num_workers 10, so Data loading time is very large at 0th and 10th batch. Consequently, it took about 8 days to train one epoch using two A100 GPUs(80GB).

Data loading time (0th batch): 1198.1397542953491
Data loading time (1th batch): 6.466662645339966
Data loading time (2th batch): 2.6589112281799316
Data loading time (3th batch): 0.0001480579376220703
...
Data loading time (9th batch): 0.00015687942504882812
Data loading time (10th batch): 1183.9874858856201
...

I downloaded venice.mp4 and wildlife.mp4 as specified in the readme and trained with those videos, but the same problem occurred with both. Do you have any solutions? Or could you share the training log? Thank you very much!

@fengjk12138
Copy link

I think there may be an issue with the settings in the code, as the number of iterations per epoch is too high. For the original DINO, one video (320,000 frames) corresponds to 772 iterations per epoch, but in the author's DORA code, it amounts to more than 8000 iterations.

I am using 32 cards to reproduce the experiment, but it still requires 500 hours. I think there might be an issue with the author's code. Could the author please respond?

DINO log

QQ_1721027061095

DORA log

QQ_1721027216438

@spriya20
Copy link

I think there may be an issue with the settings in the code, as the number of iterations per epoch is too high. For the original DINO, one video (320,000 frames) corresponds to 772 iterations per epoch, but in the author's DORA code, it amounts to more than 8000 iterations.

I am using 32 cards to reproduce the experiment, but it still requires 500 hours. I think there might be an issue with the author's code. Could the author please respond?

DINO log

QQ_1721027061095

DORA log

QQ_1721027216438

I think this is just caused by the batch size. DINO's code uses a default batch size of 64, while for DoRA the authors mention using a batch size of 16 per gpu for A100 GPUs with 80GB RAM. So you have about 4x more number of iterations per epoch with DoRA.

Irrespective of this, data loading does take a lot of time, and it'd be nice if the authors gave some more info about hyperparameters and how long it took them to train the model.

@fengjk12138
Copy link

I think there may be an issue with the settings in the code, as the number of iterations per epoch is too high. For the original DINO, one video (320,000 frames) corresponds to 772 iterations per epoch, but in the author's DORA code, it amounts to more than 8000 iterations.
I am using 32 cards to reproduce the experiment, but it still requires 500 hours. I think there might be an issue with the author's code. Could the author please respond?

DINO log

QQ_1721027061095

DORA log

QQ_1721027216438

I think this is just caused by the batch size. DINO's code uses a default batch size of 64, while for DoRA the authors mention using a batch size of 16 per gpu for A100 GPUs with 80GB RAM. So you have about 4x more number of iterations per epoch with DoRA.

Irrespective of this, data loading does take a lot of time, and it'd be nice if the authors gave some more info about hyperparameters and how long it took them to train the model.

The batch size is indeed an important issue, but there is also a problem with obtaining the length of an epoch in the official code. The length is too large, far exceeding the length of DINO trained on images.

@ggbondcxl
Copy link

Are you guys reproducing successfully, I'm well below the standard mentioned in the paper after using nvida-dali for data processing acceleration with 100 epochs on imagenet1k

@fengjk12138
Copy link

Are you guys reproducing successfully, I'm well below the standard mentioned in the paper after using nvida-dali for data processing acceleration with 100 epochs on imagenet1k

On my cluster, I used a total batch size of 256, 400 iterations per epoch, and a total of 100 epochs. The data is the video in the author's paper. The final model performance is far worse than what is shown in the paper. My final loss is 8.504165.
Even so, it still took 45 hours to train on my 4-node cluster.

Bro, his paper uses videos for training, why do you use imagenet1k for training?

@ggbondcxl
Copy link

Are you guys reproducing successfully, I'm well below the standard mentioned in the paper after using nvida-dali for data processing acceleration with 100 epochs on imagenet1k

On my cluster, I used a total batch size of 256, 400 iterations per epoch, and a total of 100 epochs. The data is the video in the author's paper. The final model performance is far worse than what is shown in the paper. My final loss is 8.504165. Even so, it still took 45 hours to train on my 4-node cluster.

Bro, his paper uses videos for training, why do you use imagenet1k for training?

Why are you using 400iterations and are you not traversing all the data. Because if you train 100 epochs on the video after that you need to train the classification header on imagenet1k to test its accuracy, can you tell how much memory your 4-node cluster is

@fengjk12138
Copy link

Are you guys reproducing successfully, I'm well below the standard mentioned in the paper after using nvida-dali for data processing acceleration with 100 epochs on imagenet1k

On my cluster, I used a total batch size of 256, 400 iterations per epoch, and a total of 100 epochs. The data is the video in the author's paper. The final model performance is far worse than what is shown in the paper. My final loss is 8.504165. Even so, it still took 45 hours to train on my 4-node cluster.
Bro, his paper uses videos for training, why do you use imagenet1k for training?

Why are you using 400iterations and are you not traversing all the data. Because if you train 100 epochs on the video after that you need to train the classification header on imagenet1k to test its accuracy, can you tell how much memory your 4-node cluster is

I can't iterate over the entire video in each epoch, that would take too long (1250iter/epoch), and I can't afford 8 days of training. I have 32 GPUs, each with 32GB memory

@fengjk12138
Copy link

How long did you train? Did you train on a single video? In order to speed up the training, I processed the video into image frames and used them as data sets for training (when loading, the required frames will be loaded continuously), which saved a lot of data loading time.

@ggbondcxl
Copy link

How long did you train? Did you train on a single video? In order to speed up the training, I processed the video into image frames and used them as data sets for training (when loading, the required frames will be loaded continuously), which saved a lot of data loading time.

I have also processed the video into image frames, which indeed sped things up quite a bit. However, when I used DALI for acceleration with a global batch size of 128, I found that if I choose a configuration of 16*8, it doesn’t fully utilize the entire A800 GPU. Is it the same for you? How is your memory usage? With this setup, it took me just under 4 days, but the results were not ideal. I wonder if it’s because DALI is not aligned with the original transforms. Perhaps we can discuss this further.

@fengjk12138
Copy link

How long did you train? Did you train on a single video? In order to speed up the training, I processed the video into image frames and used them as data sets for training (when loading, the required frames will be loaded continuously), which saved a lot of data loading time.

I have also processed the video into image frames, which indeed sped things up quite a bit. However, when I used DALI for acceleration with a global batch size of 128, I found that if I choose a configuration of 16*8, it doesn’t fully utilize the entire A800 GPU. Is it the same for you? How is your memory usage? With this setup, it took me just under 4 days, but the results were not ideal. I wonder if it’s because DALI is not aligned with the original transforms. Perhaps we can discuss this further.

My single gpu batchsize is 8, and the memory occupies less than 20gb. Maybe the author uses the 40g version of A100 for batchsize 16.

@ggbondcxl
Copy link

How long did you train? Did you train on a single video? In order to speed up the training, I processed the video into image frames and used them as data sets for training (when loading, the required frames will be loaded continuously), which saved a lot of data loading time.

I have also processed the video into image frames, which indeed sped things up quite a bit. However, when I used DALI for acceleration with a global batch size of 128, I found that if I choose a configuration of 16*8, it doesn’t fully utilize the entire A800 GPU. Is it the same for you? How is your memory usage? With this setup, it took me just under 4 days, but the results were not ideal. I wonder if it’s because DALI is not aligned with the original transforms. Perhaps we can discuss this further.

My single gpu batchsize is 8, and the memory occupies less than 20gb. Maybe the author uses the 40g version of A100 for batchsize 16.


But he is using an 80GB GPU, which I find very strange. Perhaps a batch size of 128 is crucial for dora. When I was testing the original code, both CPU usage and MEM would be extremely high, and MEM would often get maxed out. Additionally, would it be possible to refer to your code? It might help us communicate better. If I succeed in reproducing the results using DALI, I will publish it in the comments section. I still need some confirmations for now. Thank you very much.


@fengjk12138
Copy link

How long did you train? Did you train on a single video? In order to speed up the training, I processed the video into image frames and used them as data sets for training (when loading, the required frames will be loaded continuously), which saved a lot of data loading time.

I have also processed the video into image frames, which indeed sped things up quite a bit. However, when I used DALI for acceleration with a global batch size of 128, I found that if I choose a configuration of 16*8, it doesn’t fully utilize the entire A800 GPU. Is it the same for you? How is your memory usage? With this setup, it took me just under 4 days, but the results were not ideal. I wonder if it’s because DALI is not aligned with the original transforms. Perhaps we can discuss this further.

My single gpu batchsize is 8, and the memory occupies less than 20gb. Maybe the author uses the 40g version of A100 for batchsize 16.

But he is using an 80GB GPU, which I find very strange. Perhaps a batch size of 128 is crucial for dora. When I was testing the original code, both CPU usage and MEM would be extremely high, and MEM would often get maxed out. Additionally, would it be possible to refer to your code? It might help us communicate better. If I succeed in reproducing the results using DALI, I will publish it in the comments section. I still need some confirmations for now. Thank you very much.

dataloader.txt
This is my dataset folder. I only modified this one. The rest are the same as the original. Due to the attachment limit of GitHub, the suffix was changed to txt

@fengjk12138
Copy link

Any follow-up progress, bro?

@ggbondcxl
Copy link

Any follow-up progress, bro?

Not yet due to special circumstances, but I am curious about your loss curve?

@ggbondcxl
Copy link

Any follow-up progress, bro?

Any follow-up progress, bro?

I used a custom DALI pipeline and transformed the video into 4K video frames, then applied the provided transform to the frames. This resulted in the same curves. The DALI pipeline used a global batch size of 32, while the image dataset used a global batch size of 64, producing nearly identical curves. Due to time constraints, I only trained for 10 epochs, and by the tenth epoch, the loss was already 0.6 and 0.5, respectively. Therefore, I am very curious about what your curves look like.

@fengjk12138
Copy link

Any follow-up progress, bro?

Any follow-up progress, bro?

I used a custom DALI pipeline and transformed the video into 4K video frames, then applied the provided transform to the frames. This resulted in the same curves. The DALI pipeline used a global batch size of 32, while the image dataset used a global batch size of 64, producing nearly identical curves. Due to time constraints, I only trained for 10 epochs, and by the tenth epoch, the loss was already 0.6 and 0.5, respectively. Therefore, I am very curious about what your curves look like.

That is indeed good progress. After debugging, my model's loss can only reach 6.402631 during pre-training at 100epoch.

@ggbondcxl
Copy link

Any follow-up progress, bro?

Any follow-up progress, bro?

I used a custom DALI pipeline and transformed the video into 4K video frames, then applied the provided transform to the frames. This resulted in the same curves. The DALI pipeline used a global batch size of 32, while the image dataset used a global batch size of 64, producing nearly identical curves. Due to time constraints, I only trained for 10 epochs, and by the tenth epoch, the loss was already 0.6 and 0.5, respectively. Therefore, I am very curious about what your curves look like.

That is indeed good progress. After debugging, my model's loss can only reach 6.402631 during pre-training at 100epoch.

Perhaps you could provide your complete code and detailed parameters. I will try it on my GPU cluster to see exactly where our differences lie.

@ggbondcxl
Copy link

Any follow-up progress, bro?

Any follow-up progress, bro?

I used a custom DALI pipeline and transformed the video into 4K video frames, then applied the provided transform to the frames. This resulted in the same curves. The DALI pipeline used a global batch size of 32, while the image dataset used a global batch size of 64, producing nearly identical curves. Due to time constraints, I only trained for 10 epochs, and by the tenth epoch, the loss was already 0.6 and 0.5, respectively. Therefore, I am very curious about what your curves look like.

That is indeed good progress. After debugging, my model's loss can only reach 6.402631 during pre-training at 100epoch.

Do you have any new developments, I ended up with a failed reproduction, my version of dali has the same loss curve as the original code, but the accuracy on imagenet1k is much lower than the numbers in the paper

@fengjk12138
Copy link

Any follow-up progress, bro?

Any follow-up progress, bro?

I used a custom DALI pipeline and transformed the video into 4K video frames, then applied the provided transform to the frames. This resulted in the same curves. The DALI pipeline used a global batch size of 32, while the image dataset used a global batch size of 64, producing nearly identical curves. Due to time constraints, I only trained for 10 epochs, and by the tenth epoch, the loss was already 0.6 and 0.5, respectively. Therefore, I am very curious about what your curves look like.

That is indeed good progress. After debugging, my model's loss can only reach 6.402631 during pre-training at 100epoch.

Do you have any new developments, I ended up with a failed reproduction, my version of dali has the same loss curve as the original code, but the accuracy on imagenet1k is much lower than the numbers in the paper

I gave up trying it and decided to choose another technical route. According to the current results, the checkpoint performance released in this article is not outstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants