Fix dataloader hang at the end of an epoch #741
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes:
This bug fix addresses a common issue where GPU utilization drops significantly during training when the batch size is large, typically ranging from 1024 and above. The root cause was identified as resource contention in multithreading. With large batch sizes, both the main iterator thread and the ready thread consume the majority of resources, leaving the prepare_thread with insufficient resources to perform its tasks effectively. To be more precise, the main iterator thread waits on the ready thread. As a result, prepare_thread struggles to keep up. At the end of an epoch, the "prepare thread" takes a long time to catch up to get the total iterations. This causes a noticeable drop in GPU utilization that can last from a few seconds to several minutes, depending on the total number of iterations to catch up. This PR resolves the issue by ensuring that the prepare_thread exits promptly once the ready_thread completes its execution.
We conducted extensive regression testing and observed no adverse effects on training large LLM models. What's better, we saw a 40% reduction in overall training time, and the throughput drops across epochs were significantly mitigated.
Attached are some example plots illustrating the improvements.
streaming regression testing suite:
mlflow test suite for the resnet model
Additionally, several GitHub issues reported earlier seem to be related to this bug. We encourage users experiencing similar problems to try this fix and provide feedback. Relevant issues include:
Issue 643
Issue 686
Issue #, if available:
It is hypothetical that this issue and this issue are also relevant, where several users observed a throughput drop at the end of an epoch.
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)