-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download optimal for device_per_stream batching method. #726
Comments
Couple things:
While it's possible that |
Yes, I can confirm not all the shards are downloaded in each nodes, but there are many duplicate ones. I even checked the shard size (in bytes).
Not all our data are homogeneous, so we packed same kind of data in different streams. But for current test, there is only one stream configured.
Yes, all the shard are in similar size with size limited configured to 100 MB.
We only use
I suspect it might be related to the number of samples (8 or 9) in each shard that can not evenly divide the batch_size (4). Or I misconfigured something else, but I failed to find that. I will try the default |
A quick response that after using the default batching method, there is no obvious duplicate shards now, seems it is caused by the |
That makes sense! thanks for investigating. Will keep this issue open as we improve this in the future. |
Background:
Our data is quite large and varies in size. With a size limit of 100 MB, there will only be 8 or 9 samples per shard. I have noticed that many duplicate shards are downloaded on different nodes even with
shuffle
disabled. I would like your suggestions on how to avoid duplicate shards.Additional Information that may be related:
Thoughts
When shuffle is disabled, I assume the shards can be evenly divided among different nodes. Perhaps we could implement something like sample_limit instead of size_limit and achieve that with proper configuration?
The text was updated successfully, but these errors were encountered: