Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Race Condition Fix - OS Rename & Chmod - PermissionError #115

Merged
merged 2 commits into from
Jan 2, 2025

Conversation

DEKHTIARJonathan
Copy link
Contributor

@DEKHTIARJonathan DEKHTIARJonathan commented Dec 23, 2024

There is a really flaky and hard to debug bug when using gitdb at least on MacOS (M-Processor) using Docker (docker.io/python:3.13 image).

I have only been able to reproduce this error within pytest on my laptop (Apple M4 laptop).

I completely randomly get:

repo = Repo(........)  # Repo Created using `GitPython`
[...]
repo.index.add("*")

    ~~~~~~~~~~~~~~^^^^^
  File "/usr/local/lib/python3.13/site-packages/git/index/base.py", line 885, in add
    entries_added.extend(self._entries_for_paths(paths, path_rewriter, fprogress, entries))
                         ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/git/util.py", line 176, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.13/site-packages/git/index/util.py", line 111, in set_git_working_dir
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.13/site-packages/git/index/base.py", line 745, in _entries_for_paths
    entries_added.append(self._store_path(filepath, fprogress))
                         ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/git/index/base.py", line 698, in _store_path
    istream = self.repo.odb.store(IStream(Blob.type, st.st_size, stream))
  File "/usr/local/lib/python3.13/site-packages/gitdb/db/loose.py", line 233, in store
    chmod(obj_path, self.new_objects_mode)
    ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 1] Operation not permitted: '/git_projects/<GIT_PROJECT_NAME>/.git/objects/9d/961f54d04e15cef0e525d1cb3d937bd7b82b04'

When you inspect a little bit that obj_path - everything looks absolutely fine as far as I can tell. I have no idea what could trigger this PermissionError. The script is executing under the user dev-user:dev-user of UID/GID: 1000:1000

os.getuid()=1000
pwd.getpwuid(os.getuid())=pwd.struct_passwd(pw_name='dev-user', pw_passwd='x', pw_uid=1000, pw_gid=1000, pw_gecos='', pw_dir='/home/dev-user', pw_shell='/bin/bash')
os.stat(obj_path)=os.stat_result(st_mode=33152, st_ino=3713822, st_dev=65025, st_nlink=1, st_uid=1000, st_gid=1000, st_size=473, st_atime=1734989354, st_mtime=1734989354, st_ctime=1734989354)
Path(obj_path).exists()=True
Path(obj_path).owner()='dev-user'
Path(obj_path).group()='dev-user'

So clearly something is not going well.

Looking a little bit more in detail, turns out the error is always happening in scenario B - but not systematically in scenario B:

if isfile(obj_path):
    remove(tmp_path)
    print("SCENARIO A !!!!!!!!!!!!!!!")
else:
    rename(tmp_path, obj_path)
    print("SCENARIO B !!!!!!!!!!!!!!!")

There might be an interference between tempfile.mkstemp, os.rename and os.chmod. Not sure which one.

It seems like a retry loop with a small temporization seems to mitigate the problem.

One supposition is that os.rename might not entirely complete by the time it returns. Or some file descriptors might be corrupted. Unfortunately, it's really really hard for me to debug and I'm not sure even what to look for.

@DEKHTIARJonathan
Copy link
Contributor Author

DEKHTIARJonathan commented Dec 24, 2024

After further investigation - my assumption was correct.

After the rename(tmp_path, obj_path) operation, the file at obj_path might not be fully committed to disk immediately. If chmod is called before the system fully registers obj_path as available, a PermissionError may occur.

There's multiple ways to fix it.

  1. A small loop as I did with a sleep() around the chmod.
  2. A small loop calling on os.stat(obj_path) which will trigger a OSError if not yet ready => sleep and re-loop.

I think solution 1 is a little cleaner

@DEKHTIARJonathan DEKHTIARJonathan changed the title [Mac OS] PermissionError Fix Potential Race Condition Fix - OS Rename & Chmod - PermissionError Dec 24, 2024
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this.

It's strange that GitPython is still using this backend for storing files, I thought it's all done by the Git (CLI) based backend.

The python implementation is pretty slow and I definitely wouldn't want to trust it, so it makes me sad it's still in use and gets to meddle with your work.

gitdb/db/loose.py Show resolved Hide resolved
gitdb/db/loose.py Outdated Show resolved Hide resolved
@DEKHTIARJonathan
Copy link
Contributor Author

DEKHTIARJonathan commented Dec 24, 2024

@Byron yeah I'm not so sure TBH.

Do you mind publishing a patch level release of gitdb once you merge this PR ? That would be helpful so that I don't need to instal from github separately.

I'm honestly not sure why they use gitdb, this bug has been quite difficult to identify (fix was trivial though).

Changes requested applied. Sorry I'm not able to produce a repro case. It's actually very difficult to reproduce (as most race conditions are)

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! There are a couple of nits that should probably be fixed before merging.

It comes to mind that in gitoxide tempfiles are created with the correct mode right away. This is also how Git does it, and a way to completely avoid the issue here. In Python that doesn't seem quite as straightforward, but I wanted to mention it in case you know how.

gitdb/db/loose.py Outdated Show resolved Hide resolved
@DEKHTIARJonathan
Copy link
Contributor Author

DEKHTIARJonathan commented Dec 26, 2024

Thanks! There are a couple of nits that should probably be fixed before merging.

It comes to mind that in gitoxide tempfiles are created with the correct mode right away. This is also how Git does it, and a way to completely avoid the issue here. In Python that doesn't seem quite as straightforward, but I wanted to mention it in case you know how.

Yeap honestly I don't think I know of a way to do this.

Question 1: Do you actually need to set the permissions for obj_path in the case it's indeed a file ?

This case:

            obj_path = self.db_path(self.object_path(hexsha))
            obj_dir = dirname(obj_path)
            os.makedirs(obj_dir, exist_ok=True)
            # END handle destination directory
            # rename onto existing doesn't work on NTFS
            if isfile(obj_path):
                remove(tmp_path)

If not then setting the permission only in the "rename" case and before the actual rename will entirely fix the problem.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great catch, that's true! It would be better to do that only in the rename case.

Besides that, and beyond this PR, the logic there could use a cleanup, as it really does object writes in the slowest possible way. In order to not slow to a crawl objects need to be written and hashed to memory. Then one won't write the object at all if it already exists in the database.

gitdb/db/loose.py Outdated Show resolved Hide resolved
Co-authored-by: Sebastian Thiel <[email protected]>
@DEKHTIARJonathan
Copy link
Contributor Author

@Byron Happy new year ;)

I'm not 100% sure of your comment. Do you confirm the chmod is only necessary in the rename case. Are you 100% sure of it ?

So far the implementation we have is pretty "risk-free" it's just a retry loop. If I only do it in the rename case, we are now changing the logic and I dont have the knowledge/expertize to know if it's safe. So you tell me

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independently of what might be right, this change seems to be the smallest possible change to fix this particular issue you have been encountering, so I think it's good to merge.

I will see to create a new release, too.

@Byron Byron merged commit 38b5c38 into gitpython-developers:master Jan 2, 2025
7 checks passed
@DEKHTIARJonathan
Copy link
Contributor Author

Amazing ! Thanks a lot @Byron

Copy link
Contributor

@EliahKagan EliahKagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is only posted as a review because that lets me include a comment on a specific part of the changed code. This pull request has already been merged, and rightly so. You are under no obligation to reply!)

After the rename(tmp_path, obj_path) operation, the file at obj_path might not be fully committed to disk immediately. If chmod is called before the system fully registers obj_path as available, a PermissionError may occur.

I'm curious if this is specific to bind mounts to the outside filesystem in Docker Desktop on macOS, or if it is broader. Is this related to docker/for-mac#6812? I ask because:

  1. Not being able to operate on the target of os.rename immediately after it returns successfully is unintuitive, and if this is not a bug in Docker, then maybe it should be documented somewhere. If it is in practice specific to os.rename, then it could be documented for that. If it is already documented, we could add a link to the comment here.

  2. Maybe the delays could be made specific to that situation, so as to avoid waiting before reporting other PermissionErrors in most circumstances.

  3. If creating a file with the desired permissions initially, as suggested in #115 (review), would work, then I think it would be worthwhile. But I worry that some ways of doing this might also be limited by whatever limitations apply to the filesystem when using containers on macOS.

  4. If the file cannot be created with the desired permissions initially, I wonder if it might still be possible to create it in a way that would allow the desired permissions to be set without delay. The operation that was not permitted includes access increases: msktemp creates files with 0600 permissions, but they are then usually changed to 0444. If permissions are changed in a way that only decreases access, by usually creating the file with 0644 permissions and then changing them to 0444, would the change still have to wait? It seems odd to think those operations might have different constraints, but docker/for-mac#6812 makes me think that might be so.

    Even if we need to avoid path collisions before the file is moved into place as robustly as mkstemp does, we can still create the file initially with less restrictive permissions, by using mkdtemp or TemporaryDirectory to make a temporary directory, and then creating the file with a predictable name inside that directory.

My motivation in bringing up (2), (3), and (4) is that I worry that, for some callers, the delays could accumulate. Callers should probably not call store in a loop without stopping on errors, so ideally the added delay shouldn't exceed one second. But if a program makes independent attempts to store many objects and then reports errors, then there could be a long delay if PermissionError is being raised for other reasons such as the user not having enough access to the repository. In principle this can already happen on Windows because of the retry logic for remove, but as far as I know it has not been expected on other systems.

Comment on lines +232 to +236
# Ensure rename is actually done and file is stable
# Retry up to 14 times - exponential wait & retry in ms.
# The total maximum wait time is 1000ms, which should be vastly enough for the
# OS to return and commit the file to disk.
for exp_backoff_ms in [1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 181]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These timings seem reasonable, but I believe this is quadratic backoff rather than exponential backoff. (In exponential backoff, delays are a fixed base, often 2, raised to the power of a number that increases by one with each attempt. Here, the number that increases by one with each attempt is the base, raised to a fixed power of 2.) I mention this only in case true exponential backoff is intended or required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very true! Since I copied these values from gitoxide where I implemented that based on what Git does. So most definitely it was me misnaming this, as Git correctly calls it quadratic.

A PR would definitely be welcome that fixes the name in gitoxide.

@Byron
Copy link
Member

Byron commented Jan 5, 2025

4. Even if we need to avoid path collisions before the file is moved into place as robustly as mkstemp does, we can still create the file initially with less restrictive permissions, by using mkdtemp or TemporaryDirectory to make a temporary directory, then and creating the file with a predictable name inside that directory.

This wold indeed be very interesting to find out! @DEKHTIARJonathan would be the one uniquely positioned to do that if time permits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants