Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download optimal for device_per_stream batching method. #726

Open
huxuan opened this issue Jul 16, 2024 · 4 comments
Open

Download optimal for device_per_stream batching method. #726

huxuan opened this issue Jul 16, 2024 · 4 comments

Comments

@huxuan
Copy link
Contributor

huxuan commented Jul 16, 2024

Background:

Our data is quite large and varies in size. With a size limit of 100 MB, there will only be 8 or 9 samples per shard. I have noticed that many duplicate shards are downloaded on different nodes even with shuffle disabled. I would like your suggestions on how to avoid duplicate shards.

Additional Information that may be related:

batch_size: 4
shuffle: False
sampling_granularity: 1
num_canonical_nodes: Defaults to the number of physical nodes, which is 4 in our current case
batching_method: device_per_stream

Thoughts

When shuffle is disabled, I assume the shards can be evenly divided among different nodes. Perhaps we could implement something like sample_limit instead of size_limit and achieve that with proper configuration?

@snarayan21
Copy link
Collaborator

Couple things:

  • How are you verifying that duplicate shards are being downloaded between nodes? Streaming explicitly partitions shard files between nodes so the degree of duplication should be pretty small
  • Is there a reason why you're using device_per_stream batching? Are all your samples homogeneous?
  • Related to the question above, are all your shards the same / similar size?
  • To clarify, do you see duplication both with shuffle = True and shuffle = False? A high number of duplicate shard downloads should not happen, regardless of the shuffle setting.

While it's possible that device_per_stream batching is causing more duplicated shard downloads than necessary (it's a newly added batching method and may have some stuff to iron out), the rest of your settings seem pretty standard. Since one of Streaming's main features is that it partitions up your shard downloads among your nodes, I'd be very surprised if there were indeed many duplicate shards being downloaded.

@huxuan
Copy link
Contributor Author

huxuan commented Jul 16, 2024

  • How are you verifying that duplicate shards are being downloaded between nodes? Streaming explicitly partitions shard files between nodes so the degree of duplication should be pretty small

Yes, I can confirm not all the shards are downloaded in each nodes, but there are many duplicate ones. I even checked the shard size (in bytes).

  • Is there a reason why you're using device_per_stream batching? Are all your samples homogeneous?

Not all our data are homogeneous, so we packed same kind of data in different streams. But for current test, there is only one stream configured.

  • Related to the question above, are all your shards the same / similar size?

Yes, all the shard are in similar size with size limited configured to 100 MB.

  • To clarify, do you see duplication both with shuffle = True and shuffle = False? A high number of duplicate shard downloads should not happen, regardless of the shuffle setting.

We only use shuffle=False for maximum performance currently.

While it's possible that device_per_stream batching is causing more duplicated shard downloads than necessary (it's a newly added batching method and may have some stuff to iron out), the rest of your settings seem pretty standard. Since one of Streaming's main features is that it partitions up your shard downloads among your nodes, I'd be very surprised if there were indeed many duplicate shards being downloaded.

I suspect it might be related to the number of samples (8 or 9) in each shard that can not evenly divide the batch_size (4). Or I misconfigured something else, but I failed to find that. I will try the default batching_method then, and just let me know if there is something else I can try.

@huxuan
Copy link
Contributor Author

huxuan commented Jul 20, 2024

A quick response that after using the default batching method, there is no obvious duplicate shards now, seems it is caused by the device_per_stream batching method. May come up again with when there is further progress on the investigation.

@snarayan21
Copy link
Collaborator

That makes sense! thanks for investigating. device_per_stream is a newer batching method and so is not completely download-optimal. Some download optimization has been implemented to prevent massive levels of duplication, but as you're observing, it's not completely de-duplicated.

Will keep this issue open as we improve this in the future.

@huxuan huxuan changed the title [Question] Need suggestions on avoid duplicate shards Download optimal for device_per_stream batching method. Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants