Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataloader hang at the end of an epoch #741

Merged
merged 4 commits into from
Aug 22, 2024

Conversation

XiaohanZhangCMU
Copy link
Collaborator

@XiaohanZhangCMU XiaohanZhangCMU commented Aug 2, 2024

Description of changes:

This bug fix addresses a common issue where GPU utilization drops significantly during training when the batch size is large, typically ranging from 1024 and above. The root cause was identified as resource contention in multithreading. With large batch sizes, both the main iterator thread and the ready thread consume the majority of resources, leaving the prepare_thread with insufficient resources to perform its tasks effectively. To be more precise, the main iterator thread waits on the ready thread. As a result, prepare_thread struggles to keep up. At the end of an epoch, the "prepare thread" takes a long time to catch up to get the total iterations. This causes a noticeable drop in GPU utilization that can last from a few seconds to several minutes, depending on the total number of iterations to catch up. This PR resolves the issue by ensuring that the prepare_thread exits promptly once the ready_thread completes its execution.

We conducted extensive regression testing and observed no adverse effects on training large LLM models. What's better, we saw a 40% reduction in overall training time, and the throughput drops across epochs were significantly mitigated.

Attached are some example plots illustrating the improvements.

streaming regression testing suite:

  • streaming-regression-tests-runner-0rMyg4
  • streaming-regression-tests-runner-aVhR9e

mlflow test suite for the resnet model

Additionally, several GitHub issues reported earlier seem to be related to this bug. We encourage users experiencing similar problems to try this fix and provide feedback. Relevant issues include:
Issue 643
Issue 686

Screenshot 2024-08-12 at 2 36 21 PM Screenshot 2024-08-12 at 2 36 29 PM

Issue #, if available:

It is hypothetical that this issue and this issue are also relevant, where several users observed a throughput drop at the end of an epoch.

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extensive testing! Can you add the graphs / run names / etc to the PR description, and also add any new relevant findings to the PR description, and make sure it's accurate? This is probably going to be a PR we reference later, so just want to make sure it's properly documented.

There are also a few outstanding github issues that will be addressed by this, can you link those too?

Then I'm good w merging!

@XiaohanZhangCMU XiaohanZhangCMU force-pushed the xiaohan/multithread_bug_fix branch from d046485 to 1b760da Compare August 21, 2024 18:54
Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!! just a minor nit

streaming/base/dataset.py Outdated Show resolved Hide resolved
@snarayan21 snarayan21 enabled auto-merge (squash) August 22, 2024 01:47
@snarayan21 snarayan21 merged commit fac1852 into main Aug 22, 2024
7 checks passed
@snarayan21 snarayan21 deleted the xiaohan/multithread_bug_fix branch August 22, 2024 01:53
@elbamos
Copy link

elbamos commented Aug 22, 2024

Thank you for this - I saw an immediate throughput improvement of 20%, and it no longer hangs at the end of each epoch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants