You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FileNotFoundError: [Errno 2] No such file or directory: '/data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds.tmp' -> '/data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds'
Further Details
My data is stored on FSx and then loaded into the streaming dataset via the local option. When I check, these files /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds exists and /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds.tmp does not.
The issue appears to be non-deterministic and only occurs sometimes (e.g., on a recent run it happened 3x at the start for different .mds files and then disappeared).
Attempted Fix
I tried increasing retry here from 7 to 20, but that didn't solve it.
To reproduce
Working on a repo script, may take a sec given my setup is pretty involved.
Expected behavior
Data should be decompressed without any error.
Ideas on cause
Initially, I though the error was due to a race condition, but looking into StreamingDataset I see there are file locks to prevent that issue. So now I’m totally stumped on what’s causing the problem.
The text was updated successfully, but these errors were encountered:
Environment
Enroot image built off the nvcr.io/nvidia/pytorch:24.11-py3 docker image.
Issue
in the os.rename(tmp_filename, raw_filename) line here inside the _decompress_shard_part function in the Stream class I'm getting the error:
Further Details
My data is stored on FSx and then loaded into the streaming dataset via the local option. When I check, these files /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds exists and /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds.tmp does not.
The issue appears to be non-deterministic and only occurs sometimes (e.g., on a recent run it happened 3x at the start for different .mds files and then disappeared).
Attempted Fix
I tried increasing retry here from 7 to 20, but that didn't solve it.
To reproduce
Working on a repo script, may take a sec given my setup is pretty involved.
Expected behavior
Data should be decompressed without any error.
Ideas on cause
Initially, I though the error was due to a race condition, but looking into StreamingDataset I see there are file locks to prevent that issue. So now I’m totally stumped on what’s causing the problem.
The text was updated successfully, but these errors were encountered: