-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential Race Condition Fix - OS Rename & Chmod - PermissionError #115
Conversation
After further investigation - my assumption was correct. After the There's multiple ways to fix it.
I think solution 1 is a little cleaner |
cef5bd1
to
ee9a79c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this.
It's strange that GitPython is still using this backend for storing files, I thought it's all done by the Git (CLI) based backend.
The python implementation is pretty slow and I definitely wouldn't want to trust it, so it makes me sad it's still in use and gets to meddle with your work.
ee9a79c
to
4b84090
Compare
@Byron yeah I'm not so sure TBH. Do you mind publishing a I'm honestly not sure why they use Changes requested applied. Sorry I'm not able to produce a repro case. It's actually very difficult to reproduce (as most race conditions are) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! There are a couple of nits that should probably be fixed before merging.
It comes to mind that in gitoxide
tempfiles are created with the correct mode right away. This is also how Git does it, and a way to completely avoid the issue here. In Python that doesn't seem quite as straightforward, but I wanted to mention it in case you know how.
4b84090
to
74a0eab
Compare
Yeap honestly I don't think I know of a way to do this. Question 1: Do you actually need to set the permissions for This case: obj_path = self.db_path(self.object_path(hexsha))
obj_dir = dirname(obj_path)
os.makedirs(obj_dir, exist_ok=True)
# END handle destination directory
# rename onto existing doesn't work on NTFS
if isfile(obj_path):
remove(tmp_path) If not then setting the permission only in the "rename" case and before the actual rename will entirely fix the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great catch, that's true! It would be better to do that only in the rename case.
Besides that, and beyond this PR, the logic there could use a cleanup, as it really does object writes in the slowest possible way. In order to not slow to a crawl objects need to be written and hashed to memory. Then one won't write the object at all if it already exists in the database.
Co-authored-by: Sebastian Thiel <[email protected]>
@Byron Happy new year ;) I'm not 100% sure of your comment. Do you confirm the So far the implementation we have is pretty "risk-free" it's just a retry loop. If I only do it in the rename case, we are now changing the logic and I dont have the knowledge/expertize to know if it's safe. So you tell me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independently of what might be right, this change seems to be the smallest possible change to fix this particular issue you have been encountering, so I think it's good to merge.
I will see to create a new release, too.
Amazing ! Thanks a lot @Byron |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is only posted as a review because that lets me include a comment on a specific part of the changed code. This pull request has already been merged, and rightly so. You are under no obligation to reply!)
After the
rename(tmp_path, obj_path)
operation, the file atobj_path
might not be fully committed to disk immediately. If chmod is called before the system fully registers obj_path as available, a PermissionError may occur.
I'm curious if this is specific to bind mounts to the outside filesystem in Docker Desktop on macOS, or if it is broader. Is this related to docker/for-mac#6812? I ask because:
-
Not being able to operate on the target of
os.rename
immediately after it returns successfully is unintuitive, and if this is not a bug in Docker, then maybe it should be documented somewhere. If it is in practice specific toos.rename
, then it could be documented for that. If it is already documented, we could add a link to the comment here. -
Maybe the delays could be made specific to that situation, so as to avoid waiting before reporting other
PermissionError
s in most circumstances. -
If creating a file with the desired permissions initially, as suggested in #115 (review), would work, then I think it would be worthwhile. But I worry that some ways of doing this might also be limited by whatever limitations apply to the filesystem when using containers on macOS.
-
If the file cannot be created with the desired permissions initially, I wonder if it might still be possible to create it in a way that would allow the desired permissions to be set without delay. The operation that was not permitted includes access increases:
msktemp
creates files with 0600 permissions, but they are then usually changed to 0444. If permissions are changed in a way that only decreases access, by usually creating the file with 0644 permissions and then changing them to 0444, would the change still have to wait? It seems odd to think those operations might have different constraints, but docker/for-mac#6812 makes me think that might be so.Even if we need to avoid path collisions before the file is moved into place as robustly as
mkstemp
does, we can still create the file initially with less restrictive permissions, by usingmkdtemp
orTemporaryDirectory
to make a temporary directory, and then creating the file with a predictable name inside that directory.
My motivation in bringing up (2), (3), and (4) is that I worry that, for some callers, the delays could accumulate. Callers should probably not call store
in a loop without stopping on errors, so ideally the added delay shouldn't exceed one second. But if a program makes independent attempts to store many objects and then reports errors, then there could be a long delay if PermissionError
is being raised for other reasons such as the user not having enough access to the repository. In principle this can already happen on Windows because of the retry logic for remove
, but as far as I know it has not been expected on other systems.
# Ensure rename is actually done and file is stable | ||
# Retry up to 14 times - exponential wait & retry in ms. | ||
# The total maximum wait time is 1000ms, which should be vastly enough for the | ||
# OS to return and commit the file to disk. | ||
for exp_backoff_ms in [1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 181]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These timings seem reasonable, but I believe this is quadratic backoff rather than exponential backoff. (In exponential backoff, delays are a fixed base, often 2, raised to the power of a number that increases by one with each attempt. Here, the number that increases by one with each attempt is the base, raised to a fixed power of 2.) I mention this only in case true exponential backoff is intended or required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is very true! Since I copied these values from gitoxide
where I implemented that based on what Git does. So most definitely it was me misnaming this, as Git correctly calls it quadratic.
A PR would definitely be welcome that fixes the name in gitoxide
.
This wold indeed be very interesting to find out! @DEKHTIARJonathan would be the one uniquely positioned to do that if time permits. |
There is a really flaky and hard to debug bug when using
gitdb
at least on MacOS (M-Processor) using Docker (docker.io/python:3.13
image).I have only been able to reproduce this error within
pytest
on my laptop (Apple M4 laptop).I completely randomly get:
When you inspect a little bit that
obj_path
- everything looks absolutely fine as far as I can tell. I have no idea what could trigger thisPermissionError
. The script is executing under the userdev-user:dev-user
of UID/GID:1000:1000
So clearly something is not going well.
Looking a little bit more in detail, turns out the error is always happening in scenario B - but not systematically in scenario B:
There might be an interference between
tempfile.mkstemp
,os.rename
andos.chmod
. Not sure which one.It seems like a retry loop with a small temporization seems to mitigate the problem.
One supposition is that
os.rename
might not entirely complete by the time it returns. Or some file descriptors might be corrupted. Unfortunately, it's really really hard for me to debug and I'm not sure even what to look for.