Fix get protocol and path #4409

ElenaKhaustova · 2025-01-09T16:34:07Z

Description

Development notes

Fixed _parse_filepath method to extract query and fragment URL components based on urllib.parse.urlsplit return - SplitResult and append them to the result path. Previously query and fragment were omitted.

Example:

>>> _parse_filepath("s3://some/dummy#filename")
>>> SplitResult(scheme='s3', netloc='some', path='/dummy', query='', fragment='filename') <-- urlsplit return

>>> {'protocol': 's3', 'path': 'some/dummy'} <-- _parse_filepath return before fix

>>> {'protocol': 's3', 'path': 'some/dummy#filename'} <-- _parse_filepath return after fix

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>

jasonmhite · 2025-01-09T17:11:27Z

I'm not sure if this fix works in all cases, because in many filesystem URLs # and ? are valid characters in the filename and may appear multiple times and in different orders. E.g. I think s3://some#dummy?file#name##? would still be valid, and this fix would garble that. The fix assumes a specific order for the query and fragment, and it won't handle ##? right either. I think the only real fix is to not do the splitting in the first place, but only do this on path types where # and ? are valid file names (e.g. s3:, file: etc but not http:). This is kind of a nightmare problem :|. Or perhaps there is some way to escape these characters when they are valid names?

Another, less pathological example:

s3://some/dummy?file#name would get parsed to some/dummy#name?file

ElenaKhaustova · 2025-01-10T12:47:19Z

I'm not sure if this fix works in all cases, because in many filesystem URLs # and ? are valid characters in the filename and may appear multiple times and in different orders. E.g. I think s3://some#dummy?file#name##? would still be valid, and this fix would garble that. The fix assumes a specific order for the query and fragment, and it won't handle ##? right either. I think the only real fix is to not do the splitting in the first place, but only do this on path types where # and ? are valid file names (e.g. s3:, file: etc but not http:). This is kind of a nightmare problem :|. Or perhaps there is some way to escape these characters when they are valid names?

Another, less pathological example:

s3://some/dummy?file#name would get parsed to some/dummy#name?file

A few clarifications:

As mentioned above we rely on urllib.parse.urlsplit which parses a URL into 5 components: <scheme>://<netloc>/<path>?<query>#<fragment>. From these components we only use scheme and netloc - the rest components are just joined back to the result as they were. So if scheme and netloc are parsed correctly we do not care about the rest components. Even if they were not parsed correctly into this format <scheme>://<netloc>/<path>?<query>#<fragment> they will be joined back and the resulting string will be correct anyway.
The fix only handles the cases when query and fargment components exist (previously they were omitted).
HHTP protocols are processed separately and they're not parsed with urllib.parse.urlsplit
Output for your example looks alright:

>>> _parse_filepath("s3://some#dummy?file#name##?")
>>> SplitResult(scheme='s3', netloc='some', path='', query='', fragment='dummy?file#name##?')
>>> {'protocol': 's3', 'path': 'some#dummy?file#name##?'}

I think s3://some#dummy?file#name##? would still be valid, and this fix would garble that.

Your second example is already included as the unit test and is parsed as expected

>>> _parse_filepath("s3://some/dummy?file#name"")
>>> SplitResult(scheme='s3', netloc='some', path='/dummy', query='file', fragment='name')
>>> {'protocol': 's3', 'path': 'some/dummy?file#name'}

s3://some/dummy?file#name would get parsed to some/dummy#name?file

@jasonmhite Thank you for sharing your concerns. The fix still looks relevant to me. Feel free to share other examples in case I'm missing anything.

astrojuanlu · 2025-01-10T13:02:51Z

kedro/io/core.py

@@ -901,6 +901,11 @@ def _parse_filepath(filepath: str) -> dict[str, str]:
        if windows_path:
            path = ":".join(windows_path.groups())

+    if parsed_path.query:


Is this essentially

https://github.com/fsspec/filesystem_spec/blob/1d34249f0b043907f86064c274a33454ec670ebe/fsspec/utils.py#L112-L115

?

If so, are we ready to dump this function and use fsspec.utils.infer_storage_options?

I feel like we should try to converge with fsspec ASAP - bug by bug, if need be! and contribute the fixes upstream

Not sure I fully understand the first question, but from what I got: It's not the same because we output a different format and the difference is in the way we build the return output.

Looks like our _parse_filepath is based on fsspec.utils.infer_storage_options but it does a different thing as we only split protocol from the rest of the path but the last split all of them plus some extra logic on top.

We cannot replace them as is, meaning that fsspec.utils.infer_storage_options needs to be updated to comply with Kerdo downstream logic. This might work for us, but it's out of the scope of this ticket and can be discussed separately.

So my suggestion is to fix the bug first, then decide on replacing the existing logic with fsspec-based.

jasonmhite · 2025-01-10T19:54:41Z

@ElenaKhaustova OK, I wasn't able to test my examples and was just going on what I saw, but it seems you already handled repeat # and ? correctly.

Does the second example I give work?

s3://some/dummy?file#name would get parsed to some/dummy#name?file

It looked to me like the way the code appends the query and fragment portions back on will cause this case to get reversed. If that works too then the proposed solution looks good to me.

ElenaKhaustova added 4 commits January 9, 2025 14:43

Fixed _parse_filepath

e4e9097

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into fix/3196-get-protocol-and-path

ae1a6f8

Updated unit tests

dc84d02

Signed-off-by: Elena Khaustova <[email protected]>

Updated release notes

adc2228

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova mentioned this pull request Jan 9, 2025

[Spike] Investigate why get_protocol_and_path does not correctly parse paths with # in them #3196

Open

ElenaKhaustova marked this pull request as ready for review January 9, 2025 17:17

ElenaKhaustova requested a review from merelcht as a code owner January 9, 2025 17:17

ElenaKhaustova requested review from noklam, DimedS, ravi-kumar-pilla and ankatiyar January 9, 2025 17:17

ElenaKhaustova marked this pull request as draft January 9, 2025 18:04

ElenaKhaustova marked this pull request as ready for review January 10, 2025 12:47

ElenaKhaustova requested a review from astrojuanlu January 10, 2025 12:48

astrojuanlu reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix get protocol and path #4409

Fix get protocol and path #4409

ElenaKhaustova commented Jan 9, 2025 •

edited

Loading

jasonmhite commented Jan 9, 2025 •

edited

Loading

ElenaKhaustova commented Jan 10, 2025 •

edited

Loading

astrojuanlu Jan 10, 2025

ElenaKhaustova Jan 10, 2025 •

edited

Loading

jasonmhite commented Jan 10, 2025 •

edited

Loading

Fix get protocol and path #4409

Are you sure you want to change the base?

Fix get protocol and path #4409

Conversation

ElenaKhaustova commented Jan 9, 2025 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

jasonmhite commented Jan 9, 2025 • edited Loading

ElenaKhaustova commented Jan 10, 2025 • edited Loading

astrojuanlu Jan 10, 2025

Choose a reason for hiding this comment

ElenaKhaustova Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

jasonmhite commented Jan 10, 2025 • edited Loading

ElenaKhaustova commented Jan 9, 2025 •

edited

Loading

jasonmhite commented Jan 9, 2025 •

edited

Loading

ElenaKhaustova commented Jan 10, 2025 •

edited

Loading

ElenaKhaustova Jan 10, 2025 •

edited

Loading

jasonmhite commented Jan 10, 2025 •

edited

Loading