Fix dataloader hang at the end of an epoch #741

XiaohanZhangCMU · 2024-08-02T02:24:25Z

Description of changes:

This bug fix addresses a common issue where GPU utilization drops significantly during training when the batch size is large, typically ranging from 1024 and above. The root cause was identified as resource contention in multithreading. With large batch sizes, both the main iterator thread and the ready thread consume the majority of resources, leaving the prepare_thread with insufficient resources to perform its tasks effectively. To be more precise, the main iterator thread waits on the ready thread. As a result, prepare_thread struggles to keep up. At the end of an epoch, the "prepare thread" takes a long time to catch up to get the total iterations. This causes a noticeable drop in GPU utilization that can last from a few seconds to several minutes, depending on the total number of iterations to catch up. This PR resolves the issue by ensuring that the prepare_thread exits promptly once the ready_thread completes its execution.

We conducted extensive regression testing and observed no adverse effects on training large LLM models. What's better, we saw a 40% reduction in overall training time, and the throughput drops across epochs were significantly mitigated.

Attached are some example plots illustrating the improvements.

streaming regression testing suite:

streaming-regression-tests-runner-0rMyg4
streaming-regression-tests-runner-aVhR9e

mlflow test suite for the resnet model

Additionally, several GitHub issues reported earlier seem to be related to this bug. We encourage users experiencing similar problems to try this fix and provide feedback. Relevant issues include:
Issue 643
Issue 686

Issue #, if available:

It is hypothetical that this issue and this issue are also relevant, where several users observed a throughput drop at the end of an epoch.

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the contributor guidelines
This is a documentation change or typo fix. If so, skip the rest of this checklist.
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
I have added tests that prove my fix is effective or that my feature works (if appropriate).
I ran the tests locally to make sure it pass. (check out testing)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

snarayan21

Thanks for the extensive testing! Can you add the graphs / run names / etc to the PR description, and also add any new relevant findings to the PR description, and make sure it's accurate? This is probably going to be a PR we reference later, so just want to make sure it's properly documented.

There are also a few outstanding github issues that will be addressed by this, can you link those too?

Then I'm good w merging!

snarayan21

lgtm!! just a minor nit

streaming/base/dataset.py

Co-authored-by: Saaketh Narayan <[email protected]>

elbamos · 2024-08-22T18:00:25Z

Thank you for this - I saw an immediate throughput improvement of 20%, and it no longer hangs at the end of each epoch.

XiaohanZhangCMU requested review from karan6181 and snarayan21 August 2, 2024 02:24

snarayan21 approved these changes Aug 12, 2024

View reviewed changes

XiaohanZhangCMU added 2 commits August 21, 2024 11:53

fix

56df1e7

merge with main

1b760da

XiaohanZhangCMU force-pushed the xiaohan/multithread_bug_fix branch from d046485 to 1b760da Compare August 21, 2024 18:54

snarayan21 approved these changes Aug 21, 2024

View reviewed changes

streaming/base/dataset.py Outdated Show resolved Hide resolved

snarayan21 and others added 2 commits August 21, 2024 16:57

Merge branch 'main' into xiaohan/multithread_bug_fix

4c8ec2f

Update streaming/base/dataset.py

82d65f8

Co-authored-by: Saaketh Narayan <[email protected]>

snarayan21 enabled auto-merge (squash) August 22, 2024 01:47

snarayan21 merged commit fac1852 into main Aug 22, 2024
7 checks passed

snarayan21 deleted the xiaohan/multithread_bug_fix branch August 22, 2024 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataloader hang at the end of an epoch #741

Fix dataloader hang at the end of an epoch #741

XiaohanZhangCMU commented Aug 2, 2024 •

edited

Loading

snarayan21 left a comment

snarayan21 left a comment

elbamos commented Aug 22, 2024

Fix dataloader hang at the end of an epoch #741

Fix dataloader hang at the end of an epoch #741

Conversation

XiaohanZhangCMU commented Aug 2, 2024 • edited Loading

Description of changes:

Issue #, if available:

Merge Checklist:

General

Tests

snarayan21 left a comment

Choose a reason for hiding this comment

snarayan21 left a comment

Choose a reason for hiding this comment

elbamos commented Aug 22, 2024

XiaohanZhangCMU commented Aug 2, 2024 •

edited

Loading