-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal usage of 8xH100 GPUs - Streaming dataloader speed significantly fluctuates across batches #686
Comments
Thanks for raising this issue @VSehwag! So we have seen drops in throughput between epochs, but given that your data is residing locally, the time between epochs shouldn't be very long. Could you make sure that As for throughput drops during training, what is the memory usage of your GPUs during training? PyTorch has a known issue (a probable memory leak) so disabling garbage collection and only performing it once every N steps has resolved this sort of throughput issue in the past. You can also increase the StreamingDataset Hope this helps. |
Thanks for taking a look into it. I am currently using persistent workers with dropping last batch, so both The memory usage of each gpu is ~35/80G. I am not sure if any memory leak is happening as we don't see the gpu memory increasing over steps. Just to clarify, the data is fully residing locally (remote is set to None). So the predownload setting and identifying cache limits and other bottlenecks from simulator aren't applicable in this case. |
@VSehwag Ah right, I forgot your data was local. Since you're using composer, you can refer to the scheduled gc callback we have in LLM foundry, which should work for your training setup as well. Here's documentation with info on how to enable callbacks. Even though GPU memory may not seem to increase, we have seen that only doing garbage collection at specified intervals will resolve throughput fluctuation/degradation over the course of training. To narrow down whether this issue is with Streaming, could you try training without Streaming and check the throughput? |
@VSehwag Wondering if you have gotten a chance to try @snarayan21 suggestion? |
Hey @VSehwag -- @XiaohanZhangCMU recently merged the above fix that addresses inter-epoch hangs. I'm wondering if it might also be affecting your runs. We're going to be cutting a new release soon, but in the meantime, feel free to retry from main! |
Thanks a lot for following up on the thread. Let me pull the latest changes and check if it resolves the issue. |
Hey @VSehwag want to follow up here to see if the fix from the recent release works for your workloads. |
Setup
This issue is related #643 but concerns a more subtle issue with Streaming datasets. Over the course of training, we observe that the streaming dataloader speed randomly takes a dip. This is predominant between epochs but also happens at random steps. This is true for all model sizes we tested (a single layer to 1B param model).
Our current setup includes:
Our overall setup is same as diffusion training from https://github.com/mosaicml/diffusion i.e., we launch the script using
composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-256.yaml
The epoch size is 652 batches where the dataloader gets stuck and take a lot of time. However the drop is throughput is also there at random steps between epochs. This test was done on a relatively small dataset with 1.3M (652x2048) samples.
We also test it with two other datasets with disk size 1T and 2T (and corresponding epochs size of 2k and 5k batches) and observe the same drops in throughput. The plot of the right shows that the dataloader often hangs for more than a minute. Two subtle issues happening here:
We have tried a prefetch factor of up to 8 (with eight workers) and 2 (with 2 workers) for both datasets and didn't observe any resolution to the drops. We also ablate the fsdp mode to full_shard and no_shard but the dips in throughput presists. We used a 140M parameter model but the dips are there with another 1B model. We created the mds dataset by processing our raw data using 8 processed and then merged the index.json files.
Somehow the issue is less severe in the single-gpu training. We only observe the drop in throughput at the end of epoch (2k batches). In all previous tests we had used the 8 gpus ddp (with grad_sharding fsdp config). Note that we don't observe a perfectly linear speedup (8.5 batch/s on 8 gpus vs 1.4 batch/s on 1 gpu) which indicated a IO bound and could contribute to the worse dips in multi-gpu setting.
Overall there are three puzzling questions:
The text was updated successfully, but these errors were encountered: