Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix get protocol and path #4409

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
* Added `node` import to the pipeline template.
* Update error message when executing kedro run without pipeline.
* Safeguard hooks when user incorrectly registers a hook class in settings.py.
* Fixed parsing paths with query and fragment.

## Breaking changes to the API
## Documentation changes
Expand Down
5 changes: 5 additions & 0 deletions kedro/io/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -901,6 +901,11 @@ def _parse_filepath(filepath: str) -> dict[str, str]:
if windows_path:
path = ":".join(windows_path.groups())

if parsed_path.query:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this essentially

https://github.com/fsspec/filesystem_spec/blob/1d34249f0b043907f86064c274a33454ec670ebe/fsspec/utils.py#L112-L115

?

If so, are we ready to dump this function and use fsspec.utils.infer_storage_options?

I feel like we should try to converge with fsspec ASAP - bug by bug, if need be! and contribute the fixes upstream

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I fully understand the first question, but from what I got: It's not the same because we output a different format and the difference is in the way we build the return output.

Looks like our _parse_filepath is based on fsspec.utils.infer_storage_options but it does a different thing as we only split protocol from the rest of the path but the last split all of them plus some extra logic on top.

We cannot replace them as is, meaning that fsspec.utils.infer_storage_options needs to be updated to comply with Kerdo downstream logic. This might work for us, but it's out of the scope of this ticket and can be discussed separately.

So my suggestion is to fix the bug first, then decide on replacing the existing logic with fsspec-based.

path = f"{path}?{parsed_path.query}"
if parsed_path.fragment:
path = f"{path}#{parsed_path.fragment}"

options = {"protocol": protocol, "path": path}

if parsed_path.netloc and protocol in CLOUD_PROTOCOLS:
Expand Down
18 changes: 18 additions & 0 deletions tests/io/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,24 @@ def test_get_filepath_str(self):
("file:///C:\\Projects\\file.txt", ("file", "C:\\Projects\\file.txt")),
("https://example.com/file.txt", ("https", "example.com/file.txt")),
("http://example.com/file.txt", ("http", "example.com/file.txt")),
(
"https://example.com/search?query=books&category=fiction#reviews",
("https", "example.com/search?query=books&category=fiction#reviews"),
),
(
"https://example.com/search#reviews",
("https", "example.com/search#reviews"),
),
(
"http://example.com/search?query=books&category=fiction",
("http", "example.com/search?query=books&category=fiction"),
),
(
"s3://some/example?query=query#filename",
("s3", "some/example?query=query#filename"),
),
("s3://some/example#filename", ("s3", "some/example#filename")),
("s3://some/example?query=query", ("s3", "some/example?query=query")),
],
)
def test_get_protocol_and_path(self, filepath, expected_result):
Expand Down
Loading