-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix get protocol and path #4409
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
I'm not sure if this fix works in all cases, because in many filesystem URLs Another, less pathological example:
|
A few clarifications:
>>> _parse_filepath("s3://some#dummy?file#name##?")
>>> SplitResult(scheme='s3', netloc='some', path='', query='', fragment='dummy?file#name##?')
>>> {'protocol': 's3', 'path': 'some#dummy?file#name##?'}
>>> _parse_filepath("s3://some/dummy?file#name"")
>>> SplitResult(scheme='s3', netloc='some', path='/dummy', query='file', fragment='name')
>>> {'protocol': 's3', 'path': 'some/dummy?file#name'}
@jasonmhite Thank you for sharing your concerns. The fix still looks relevant to me. Feel free to share other examples in case I'm missing anything. |
@@ -901,6 +901,11 @@ def _parse_filepath(filepath: str) -> dict[str, str]: | |||
if windows_path: | |||
path = ":".join(windows_path.groups()) | |||
|
|||
if parsed_path.query: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this essentially
?
If so, are we ready to dump this function and use fsspec.utils.infer_storage_options
?
I feel like we should try to converge with fsspec ASAP - bug by bug, if need be! and contribute the fixes upstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I fully understand the first question, but from what I got: It's not the same because we output a different format and the difference is in the way we build the return output.
Looks like our _parse_filepath
is based on fsspec.utils.infer_storage_options
but it does a different thing as we only split protocol from the rest of the path but the last split all of them plus some extra logic on top.
We cannot replace them as is, meaning that fsspec.utils.infer_storage_options
needs to be updated to comply with Kerdo downstream logic. This might work for us, but it's out of the scope of this ticket and can be discussed separately.
So my suggestion is to fix the bug first, then decide on replacing the existing logic with fsspec-based.
@ElenaKhaustova OK, I wasn't able to test my examples and was just going on what I saw, but it seems you already handled repeat # and ? correctly. Does the second example I give work?
It looked to me like the way the code appends the query and fragment portions back on will cause this case to get reversed. If that works too then the proposed solution looks good to me. |
Description
Solves #3196
Explanation: #3196 (comment)
Development notes
Fixed
_parse_filepath
method to extractquery
andfragment
URL components based onurllib.parse.urlsplit
return -SplitResult
and append them to the result path. Previouslyquery
andfragment
were omitted.Example:
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file