Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate total shards at the beginning of data conversion #742

Open
abhijithneilabraham opened this issue Aug 3, 2024 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@abhijithneilabraham
Copy link

🚀 Feature Request

Number of shards that would be created, estimated with help of size_limit and data size can be a useful metric.

Motivation

If in future, other features such as resume data conversion etc are implemented , it could be built with the help of this feature.

[Optional] Implementation

Additional context

@abhijithneilabraham abhijithneilabraham added the enhancement New feature or request label Aug 3, 2024
@snarayan21
Copy link
Collaborator

Hey @abhijithneilabraham thanks for this issue! How would you propose finding the dataset size ahead of time? MDSWriter currently has no knowledge of how large your raw dataset files are or how it is being used to iterate over your original dataset...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants