Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file corruption on Windows #83

Open
bryanlarsen opened this issue Jun 18, 2024 · 8 comments
Open

file corruption on Windows #83

bryanlarsen opened this issue Jun 18, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@bryanlarsen
Copy link

A significant number of users are reporting file corruption on Windows. It's a write heavy workload, and investigation reveals that the content file is filled with NUL's and the index file's last line has a bunch of NUL's appended to it. On read, IntegrityError is unsurprisingly returned.

This occurs even when writing content that already exists in the cache.

@zkat zkat added the bug Something isn't working label Jun 19, 2024
@zkat
Copy link
Owner

zkat commented Jun 19, 2024

This can definitely happen in any kind of environment with high writes. I'm assuming you're properly awaiting/calling .close() on your Writers.

It might also have to do with mmap, so you could try building without the mmap feature.

As far as what to do: I don't really expect to be able to fix all cases of cache corruption. What cacache is designed for isn't completely preventing corruption (since I don't consider that possible), but preventing you from reading bad data. My recommendation, and what I do in my own applications, is that if I read data and get an IntegrityError, the correct action is to delete the bad content and redownload the cached data. This should only happen occasionally, of course.

@bryanlarsen
Copy link
Author

The big surprise was the corruption of existing data. I did not anticipate that, I expected that writing identical content to a different key would not re-write and corrupt the content for the old key.

@zkat
Copy link
Owner

zkat commented Jun 19, 2024

I would not expect that either, that's very strange--considering there's separate files and filenames for everything. The only way I can see this happening is if you're trying to write a bunch of empty files (which would explain the NULs), and they all end up writing to the same empty content file, which will share the filename

@bryanlarsen
Copy link
Author

I wrote a wrapper for my usage of cacache::write that calls cacache::exists to ensure it doesn't exist before calling cacache::write. I assumed that it would significantly cut down on the amount of file corruption we're seeing in the wild. If your code already does something like that my wrapper will be ineffective and I need to do something more drastic.

@zkat
Copy link
Owner

zkat commented Jun 19, 2024

@bryanlarsen I wouldn't do that: cacache takes care of doing things as atomically as possible, which is how it can operate completely lockless. Having this kind of two-step operation might inject a race condition which will almost definitely be hit under high-concurrency environments where you're hitting the same data.

That said!

I have run into some weird cache corruption stuff when I was using cacache in orogene (which is VERY high throughput), but it only happened occasionally and I couldn't really track down why.

@bryanlarsen
Copy link
Author

Why? If I run two threads of if !exists() { write() } simultaneously, if they don't race it does what I want and if it does race it degenerates into my old code, which would just be two threads of write() simultaneously. And if it's not possible to call write() simultaneously cacache is broken.

@zkat
Copy link
Owner

zkat commented Jun 19, 2024

because if something happens after the exists() that corrupts or removes the package, you then fail to rewrite the package

@paulie-g
Copy link

What cacache is designed for isn't completely preventing corruption (since I don't consider that possible)

Could you briefly elaborate on what specifically in the design of cacache makes corruption inevitable? I ask because a) one of the feature bulletpoints in README advertises: Fault tolerance (immune to corruption, partial writes, process races, etc) and b) caches don't typically make trade-offs with corruption and it's unclear to me what the other side of that trade-off would be. There must be something I'm missing (and there doesn't appear to be an architecture/design doc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants