Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Streaming Dataset Decompression in Distributed Setting #863

Open
jasonkrone opened this issue Jan 15, 2025 · 2 comments
Open

Error in Streaming Dataset Decompression in Distributed Setting #863

jasonkrone opened this issue Jan 15, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@jasonkrone
Copy link

jasonkrone commented Jan 15, 2025

Environment

Enroot image built off the nvcr.io/nvidia/pytorch:24.11-py3 docker image.

  • OS: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Ubuntu 22.04) 20241208
  • Hardware (GPU, or instance type): Two nodes with 8xH100 each

Issue

in the os.rename(tmp_filename, raw_filename) line here inside the _decompress_shard_part function in the Stream class I'm getting the error:

FileNotFoundError: [Errno 2] No such file or directory: '/data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds.tmp' -> '/data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds'

Further Details

My data is stored on FSx and then loaded into the streaming dataset via the local option. When I check, these files /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds exists and /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds.tmp does not.

The issue appears to be non-deterministic and only occurs sometimes (e.g., on a recent run it happened 3x at the start for different .mds files and then disappeared).

Attempted Fix

I tried increasing retry here from 7 to 20, but that didn't solve it.

To reproduce

Working on a repo script, may take a sec given my setup is pretty involved.

Expected behavior

Data should be decompressed without any error.

Ideas on cause

Initially, I though the error was due to a race condition, but looking into StreamingDataset I see there are file locks to prevent that issue. So now I’m totally stumped on what’s causing the problem.

@jasonkrone jasonkrone added the bug Something isn't working label Jan 15, 2025
@jasonkrone
Copy link
Author

@snarayan21 - hoping you may have a suggestion here on how to fix!

@ethantang-db
Copy link
Contributor

I think this might be related to the same issue of #824 FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants