Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix get protocol and path #4409

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Jan 9, 2025

Description

Solves #3196

Explanation: #3196 (comment)

Development notes

Fixed _parse_filepath method to extract query and fragment URL components based on urllib.parse.urlsplit return - SplitResult and append them to the result path. Previously query and fragment were omitted.

Example:

>>> _parse_filepath("s3://some/dummy#filename")
>>> SplitResult(scheme='s3', netloc='some', path='/dummy', query='', fragment='filename') <-- urlsplit return

>>> {'protocol': 's3', 'path': 'some/dummy'} <-- _parse_filepath return before fix

>>> {'protocol': 's3', 'path': 'some/dummy#filename'} <-- _parse_filepath return after fix

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@jasonmhite
Copy link
Contributor

jasonmhite commented Jan 9, 2025

I'm not sure if this fix works in all cases, because in many filesystem URLs # and ? are valid characters in the filename and may appear multiple times and in different orders. E.g. I think s3://some#dummy?file#name##? would still be valid, and this fix would garble that. The fix assumes a specific order for the query and fragment, and it won't handle ##? right either. I think the only real fix is to not do the splitting in the first place, but only do this on path types where # and ? are valid file names (e.g. s3:, file: etc but not http:). This is kind of a nightmare problem :|. Or perhaps there is some way to escape these characters when they are valid names?

Another, less pathological example:

s3://some/dummy?file#name would get parsed to some/dummy#name?file

@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review January 9, 2025 17:17
@ElenaKhaustova ElenaKhaustova marked this pull request as draft January 9, 2025 18:04
@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Jan 10, 2025

I'm not sure if this fix works in all cases, because in many filesystem URLs # and ? are valid characters in the filename and may appear multiple times and in different orders. E.g. I think s3://some#dummy?file#name##? would still be valid, and this fix would garble that. The fix assumes a specific order for the query and fragment, and it won't handle ##? right either. I think the only real fix is to not do the splitting in the first place, but only do this on path types where # and ? are valid file names (e.g. s3:, file: etc but not http:). This is kind of a nightmare problem :|. Or perhaps there is some way to escape these characters when they are valid names?

Another, less pathological example:

s3://some/dummy?file#name would get parsed to some/dummy#name?file

A few clarifications:

  1. As mentioned above we rely on urllib.parse.urlsplit which parses a URL into 5 components: <scheme>://<netloc>/<path>?<query>#<fragment>. From these components we only use scheme and netloc - the rest components are just joined back to the result as they were. So if scheme and netloc are parsed correctly we do not care about the rest components. Even if they were not parsed correctly into this format <scheme>://<netloc>/<path>?<query>#<fragment> they will be joined back and the resulting string will be correct anyway.
  2. The fix only handles the cases when query and fargment components exist (previously they were omitted).
  3. HHTP protocols are processed separately and they're not parsed with urllib.parse.urlsplit
  4. Output for your example looks alright:
>>> _parse_filepath("s3://some#dummy?file#name##?")
>>> SplitResult(scheme='s3', netloc='some', path='', query='', fragment='dummy?file#name##?')
>>> {'protocol': 's3', 'path': 'some#dummy?file#name##?'}

I think s3://some#dummy?file#name##? would still be valid, and this fix would garble that.

  1. Your second example is already included as the unit test and is parsed as expected
>>> _parse_filepath("s3://some/dummy?file#name"")
>>> SplitResult(scheme='s3', netloc='some', path='/dummy', query='file', fragment='name')
>>> {'protocol': 's3', 'path': 'some/dummy?file#name'}

s3://some/dummy?file#name would get parsed to some/dummy#name?file

@jasonmhite Thank you for sharing your concerns. The fix still looks relevant to me. Feel free to share other examples in case I'm missing anything.

@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review January 10, 2025 12:47
@@ -901,6 +901,11 @@ def _parse_filepath(filepath: str) -> dict[str, str]:
if windows_path:
path = ":".join(windows_path.groups())

if parsed_path.query:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this essentially

https://github.com/fsspec/filesystem_spec/blob/1d34249f0b043907f86064c274a33454ec670ebe/fsspec/utils.py#L112-L115

?

If so, are we ready to dump this function and use fsspec.utils.infer_storage_options?

I feel like we should try to converge with fsspec ASAP - bug by bug, if need be! and contribute the fixes upstream

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I fully understand the first question, but from what I got: It's not the same because we output a different format and the difference is in the way we build the return output.

Looks like our _parse_filepath is based on fsspec.utils.infer_storage_options but it does a different thing as we only split protocol from the rest of the path but the last split all of them plus some extra logic on top.

We cannot replace them as is, meaning that fsspec.utils.infer_storage_options needs to be updated to comply with Kerdo downstream logic. This might work for us, but it's out of the scope of this ticket and can be discussed separately.

So my suggestion is to fix the bug first, then decide on replacing the existing logic with fsspec-based.

@jasonmhite
Copy link
Contributor

jasonmhite commented Jan 10, 2025

@ElenaKhaustova OK, I wasn't able to test my examples and was just going on what I saw, but it seems you already handled repeat # and ? correctly.

Does the second example I give work?

s3://some/dummy?file#name would get parsed to some/dummy#name?file

It looked to me like the way the code appends the query and fragment portions back on will cause this case to get reversed. If that works too then the proposed solution looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants