[WIP] Apply obstore as storage backend #3033

machichima · 2025-01-04T14:40:09Z

Tracking issue

Why are the changes needed?

Use a Rust/Pyo3 package - obstore - as the storage backend for cloud storages. This provides the smaller dependencies size and enable users to use their own s3fs, gsfs, abfs, ... version.

What changes were proposed in this pull request?

Use obstore as the storage backend to replace s3fs, gsfs, and abfs.

How was this patch tested?

Setup process

Screenshots

Performance

put file to minio

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Summary by Bito

Implementation of obstore as the new storage backend for cloud services in Flytekit, replacing direct cloud storage implementations with obstore-based filesystem classes. The changes include enhanced path splitting functionality, S3 retry support, and updated Azure storage configuration parameters. The implementation provides robust bucket handling and async filesystem support while maintaining backward compatibility with existing storage protocols through obstore's configuration format.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 4

flyte-bot · 2025-01-04T14:40:23Z

Code Review Agent Run #39883a

Actionable Suggestions - 7

plugins/flytekit-spark/flytekitplugins/spark/models.py - 2
- Missing pod parameters in with_overrides method · Line 79-80
- Consider adding null validation checks · Line 193-194
flytekit/core/data_persistence.py - 5
- Rename hardcoded secret constant · Line 52-52
- Improve file protocol detection precision · Line 119-121
- Consider moving storage types to constant · Line 136-141
- Consider using anonymous parameter for _ANON · Line 171-171
- Consider validating bucket before filesystem call · Line 433-434

Additional Suggestions - 3

flytekit/core/data_persistence.py - 2
- Consider optimizing bucket extraction timing · Line 521-522
- Consider combining empty dict initializations · Line 59-60
plugins/flytekit-spark/tests/test_spark_task.py - 1
- Consider single line function call · Line 151-153

Review Details

Files reviewed - 5 · Commit Range: 64c6c79..0187150
- Dockerfile.dev
- flytekit/core/data_persistence.py
- plugins/flytekit-spark/flytekitplugins/spark/models.py
- plugins/flytekit-spark/flytekitplugins/spark/task.py
- plugins/flytekit-spark/tests/test_spark_task.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by

flyte-bot · 2025-01-04T14:48:13Z

Changelist by Bito

This pull request implements the following key changes.

Key Change	Files Impacted
Feature Improvement - Storage Backend Migration to Obstore	- `data_persistence.py` - Refactored storage backend to use obstore, added path splitting and improved bucket handling - `obstore_filesystem.py` - Added new filesystem classes for S3, GCS and Azure using obstore implementation - `Dockerfile.dev` - Added obstore dependency - `pyproject.toml` - Added obstore package dependency
Testing - Storage Backend Test Updates	- `test_data.py` - Updated storage tests to use new obstore backend and added path splitting tests - `test_data_persistence.py` - Added base64 import for Azure storage tests
Feature Improvement - Storage Backend Migration to Obstore	- `data_persistence.py` - Refactored storage backend to use obstore, added path splitting and improved bucket handling - `obstore_filesystem.py` - Added new filesystem classes for S3, GCS and Azure using obstore implementation - `Dockerfile.dev` - Added obstore dependency - `pyproject.toml` - Added obstore package dependency
Testing - Storage Backend Test Updates	- `test_data.py` - Updated storage tests to use new obstore backend and added path splitting tests - `test_data_persistence.py` - Updated Azure storage tests with mocking and base64 encoding - `test_flyte_directory.py` - Updated S3 filesystem mocking to use new obstore implementation

flyte-bot · 2025-01-04T14:48:15Z

plugins/flytekit-spark/flytekitplugins/spark/models.py

+            driver_pod=self.driver_pod,
+            executor_pod=self.executor_pod,


Missing pod parameters in with_overrides method

Consider adding driver_pod and executor_pod to the with_overrides method to maintain consistency with the constructor parameters.

Code suggestion

Check the AI-generated fix before applying

@@ -56,6 +56,8 @@ def with_overrides( new_spark_conf: Optional[Dict[str, str]] = None, new_hadoop_conf: Optional[Dict[str, str]] = None, new_databricks_conf: Optional[Dict[str, Dict]] = None, + driver_pod: Optional[K8sPod] = None, + executor_pod: Optional[K8sPod] = None, ) -> "SparkJob": if not new_spark_conf: new_spark_conf = self.spark_conf @@ -65,6 +67,12 @@ def with_overrides( if not new_databricks_conf: new_databricks_conf = self.databricks_conf + if not driver_pod: + driver_pod = self.driver_pod + + if not executor_pod: + executor_pod = self.executor_pod + return SparkJob( spark_type=self.spark_type, application_file=self.application_file,

Code Review Run #39883a

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-04T14:48:15Z

plugins/flytekit-spark/flytekitplugins/spark/models.py

+            driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod else None,
+            executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod else None,


Consider adding null validation checks

Consider adding null checks for to_flyte_idl() calls on driver_pod and executor_pod to avoid potential NoneType errors.

Code suggestion

Check the AI-generated fix before applying

Suggested change

driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod else None,

executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod else None,

driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod and hasattr(self.driver_pod, 'to_flyte_idl') else None,

executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod and hasattr(self.executor_pod, 'to_flyte_idl') else None,

Code Review Run #39883a

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-04T14:48:17Z

flytekit/core/data_persistence.py

+    if "file" in path:
+        # no bucket for file
+        return "", path


Improve file protocol detection precision

The condition if "file" in path may match paths containing 'file' anywhere in the string, not just the protocol. Consider using if get_protocol(path) == "file" for more precise protocol checking.

Code suggestion

Check the AI-generated fix before applying

Suggested change

if "file" in path:

# no bucket for file

return "", path

if get_protocol(path) == "file":

# no bucket for file

return "", path

Code Review Run #39883a

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-04T14:48:18Z

flytekit/core/data_persistence.py

+        support_types = ["s3", "gs", "abfs"]
+        if protocol in support_types:
+            file_path = "/".join(path_li[1:])
+            return (bucket, file_path)
+        else:
+            return bucket, path


Consider moving storage types to constant

The list of supported storage types support_types = ['s3', 'gs', 'abfs'] could be defined as a module-level constant since it's used for validation. Consider moving it outside the function to improve maintainability.

Code suggestion

Check the AI-generated fix before applying

@@ -53,1 +53,2 @@ _ANON = "anon" +SUPPORTED_STORAGE_TYPES = ["s3", "gs", "abfs"] @@ -136,2 +136,1 @@ - support_types = ["s3", "gs", "abfs"] - if protocol in support_types: + if protocol in SUPPORTED_STORAGE_TYPES:

Code Review Run #39883a

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-04T14:48:19Z

flytekit/core/data_persistence.py

+    kwargs["store"] = store
+
+    if anonymous:
+        kwargs[_ANON] = True


Consider using anonymous parameter for _ANON

Consider using kwargs[_ANON] = anonymous instead of hardcoding True to maintain consistency with the input parameter value.

Code suggestion

Check the AI-generated fix before applying

Suggested change

kwargs[_ANON] = True

kwargs[_ANON] = anonymous

Code Review Run #39883a

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-04T14:48:20Z

flytekit/core/data_persistence.py

+        bucket, to_path_file_only = split_path(to_path)
+        file_system = await self.get_async_filesystem_for_path(to_path, bucket)


Consider validating bucket before filesystem call

Consider validating the bucket parameter before passing it to get_async_filesystem_for_path(). An empty bucket could cause issues with certain storage backends. Similar issues were also found in:

flytekit/core/data_persistence.py (line 318)

flytekit/core/data_persistence.py (line 521)

flytekit/core/data_persistence.py (line 308)

Code suggestion

Check the AI-generated fix before applying

Suggested change

bucket, to_path_file_only = split_path(to_path)

file_system = await self.get_async_filesystem_for_path(to_path, bucket)

bucket, to_path_file_only = split_path(to_path)

protocol = get_protocol(to_path)

if protocol in ['s3', 'gs', 'abfs'] and not bucket:

raise ValueError(f'Bucket cannot be empty for {protocol} protocol')

file_system = await self.get_async_filesystem_for_path(to_path, bucket)

Code Review Run #39883a

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-04T15:48:58Z

Code Review Agent Run #8926b7

Actionable Suggestions - 0

Review Details

Files reviewed - 1 · Commit Range: 0187150..7c76cc6
- pyproject.toml
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by

successfully run it on local, not yet tested remote Signed-off-by: machichima <[email protected]>

Signed-off-by: machichima <[email protected]>

flyte-bot · 2025-01-05T04:58:58Z

Code Review Agent Run #0b7f4d

Actionable Suggestions - 4

flytekit/core/data_persistence.py - 4
- Consider extracting path splitting logic · Line 433-434
- Consider extracting path splitting logic · Line 433-434
- Handle empty bucket case for storage · Line 391-392
- Consider relocating fsspec implementation registrations · Line 735-737

Additional Suggestions - 1

flytekit/core/data_persistence.py - 1
- Consider combining dictionary initializations · Line 59-60

Review Details

Files reviewed - 3 · Commit Range: 58ba73c..353f000
- Dockerfile.dev
- flytekit/core/data_persistence.py
- pyproject.toml
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by

flyte-bot · 2025-01-05T05:02:21Z

flytekit/core/data_persistence.py

+        bucket, to_path_file_only = split_path(to_path)
+        file_system = await self.get_async_filesystem_for_path(to_path, bucket)


Consider extracting path splitting logic

Consider extracting the bucket and path splitting logic into a separate method to improve code reusability and maintainability. The split_path function is used in multiple places and could be encapsulated better.

Code suggestion

Check the AI-generated fix before applying

Suggested change

bucket, to_path_file_only = split_path(to_path)

file_system = await self.get_async_filesystem_for_path(to_path, bucket)

bucket, path = self._split_and_get_bucket_path(to_path)

file_system = await self.get_async_filesystem_for_path(to_path, bucket)

Code Review Run #0b7f4d

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-05T05:02:22Z

flytekit/core/data_persistence.py

+        bucket, from_path_file_only = split_path(from_path)
+        file_system = await self.get_async_filesystem_for_path(from_path, bucket)


Handle empty bucket case for storage

Consider handling the case where split_path() returns empty bucket for non-file protocols. Currently passing empty bucket to get_async_filesystem_for_path() could cause issues with cloud storage access.

Code suggestion

Check the AI-generated fix before applying

Suggested change

bucket, from_path_file_only = split_path(from_path)

file_system = await self.get_async_filesystem_for_path(from_path, bucket)

bucket, from_path_file_only = split_path(from_path)

protocol = get_protocol(from_path)

if protocol not in ['file'] and not bucket:

raise ValueError(f'Empty bucket not allowed for protocol {protocol}')

file_system = await self.get_async_filesystem_for_path(from_path, bucket)

Code Review Run #0b7f4d

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-05T05:02:25Z

flytekit/core/data_persistence.py

+fsspec.register_implementation("s3", AsyncFsspecStore)
+fsspec.register_implementation("gs", AsyncFsspecStore)
+fsspec.register_implementation("abfs", AsyncFsspecStore)


Consider relocating fsspec implementation registrations

Consider moving the fsspec implementation registrations to a more appropriate initialization location, such as a module-level __init__.py or a dedicated setup function. This would improve code organization and make the registrations more discoverable.

Code suggestion

Check the AI-generated fix before applying

Suggested change

fsspec.register_implementation("s3", AsyncFsspecStore)

fsspec.register_implementation("gs", AsyncFsspecStore)

fsspec.register_implementation("abfs", AsyncFsspecStore)

def register_fsspec_implementations():

fsspec.register_implementation("s3", AsyncFsspecStore)

fsspec.register_implementation("gs", AsyncFsspecStore)

fsspec.register_implementation("abfs", AsyncFsspecStore)

# Call during module initialization

register_fsspec_implementations()

Code Review Run #0b7f4d

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

Specify the class properties for each file storage Signed-off-by: machichima <[email protected]>

not remove protocol from path other than s3, gs, and abfs Signed-off-by: machichima <[email protected]>

Signed-off-by: machichima <[email protected]>

codecov · 2025-01-05T10:51:12Z

Codecov Report

Attention: Patch coverage is 43.15789% with 54 lines in your changes missing coverage. Please review.

Project coverage is 47.06%. Comparing base (0ad84f3) to head (deb9f3d).

Files with missing lines	Patch %	Lines
flytekit/core/data_persistence.py	29.16%	50 Missing and 1 partial ⚠️
flytekit/core/obstore_filesystem.py	86.95%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #3033       +/-   ##
===========================================
- Coverage   82.79%   47.06%   -35.73%     
===========================================
  Files           3      202      +199     
  Lines         186    21277    +21091     
  Branches        0     2740     +2740     
===========================================
+ Hits          154    10015     +9861     
- Misses         32    10773    +10741     
- Partials        0      489      +489

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

flyte-bot · 2025-01-05T11:58:58Z

Code Review Agent Run #e101cd

Actionable Suggestions - 3

flytekit/core/obstore_filesystem.py - 1
- Consider using DEFAULT_BLOCK_SIZE constant instead · Line 21-21
flytekit/core/data_persistence.py - 2
- Consider validating S3 retries value · Line 82-82
- Consider moving support types to constant · Line 121-123

Review Details

Files reviewed - 3 · Commit Range: 353f000..deb9f3d
- flytekit/core/data_persistence.py
- flytekit/core/obstore_filesystem.py
- tests/flytekit/unit/core/test_flyte_directory.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by

flyte-bot · 2025-01-05T12:02:56Z

flytekit/core/obstore_filesystem.py

+    connect_timeout = 5
+    retries = 5
+    read_timeout = 15
+    default_block_size = 5 * 2**20


Consider using DEFAULT_BLOCK_SIZE constant instead

Consider using the DEFAULT_BLOCK_SIZE constant defined on line 9 instead of duplicating the value 5 * 2**20 in ObstoreS3FileSystem. This would improve maintainability and reduce the risk of inconsistencies.

Code suggestion

Check the AI-generated fix before applying

Suggested change

default_block_size = 5 * 2**20

default_block_size = DEFAULT_BLOCK_SIZE

Code Review Run #e101cd

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

we can remove this and use the one defined at line 9, right

Sure, sorry I forgot to use the one in line 9, just fixed it to default_block_size = DEFAULT_BLOCK_SIZE

flyte-bot · 2025-01-05T12:02:56Z

flytekit/core/data_persistence.py

+        },
+    )
+
+    kwargs["retries"] = s3_cfg.retries


Consider validating S3 retries value

Consider validating the retries value before assigning it to kwargs. A negative or extremely large value could cause issues.

Code suggestion

Check the AI-generated fix before applying

Suggested change

kwargs["retries"] = s3_cfg.retries

if s3_cfg.retries is not None and 0 <= s3_cfg.retries <= 10:

kwargs["retries"] = s3_cfg.retries

Code Review Run #e101cd

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-05T12:02:57Z

flytekit/core/data_persistence.py

+    support_types = ["s3", "gs", "abfs"]
+    protocol = get_protocol(path)
+    if protocol not in support_types:


Consider moving support types to constant

Consider moving the support_types list to a module-level constant since it represents static configuration data. This would improve maintainability and reusability.

Code suggestion

Check the AI-generated fix before applying

@@ -1,1 +1,3 @@ +SUPPORTED_PROTOCOLS = ["s3", "gs", "abfs"] + def split_path(path: str) -> Tuple[str, str]: - support_types = ["s3", "gs", "abfs"] - protocol = get_protocol(path) - if protocol not in support_types: + protocol = get_protocol(path) + if protocol not in SUPPORTED_PROTOCOLS:

Code Review Run #e101cd

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

wild-endeavor · 2025-01-06T21:37:46Z

flytekit/core/data_persistence.py

@@ -46,47 +48,128 @@

 # Refer to https://github.com/fsspec/s3fs/blob/50bafe4d8766c3b2a4e1fc09669cf02fb2d71454/s3fs/core.py#L198


let's update this link if we're going to change the args.

Sure! I updated the link in the new commit

wild-endeavor · 2025-01-06T22:02:19Z

flytekit/core/data_persistence.py

+        store_kwargs["endpoint_url"] = s3_cfg.endpoint
+        # kwargs["client_kwargs"] = {"endpoint_url": s3_cfg.endpoint}
+
+    store = S3Store.from_env(


should we cache these setup args functions? i think each call to S3Store is creating a new client underneath the hood in the object store library. let's add lru_cache to this call? @pingsutw

assert if the specific function is called with provided parameters Signed-off-by: machichima <[email protected]>

Signed-off-by: machichima <[email protected]>

flyte-bot · 2025-01-08T16:22:56Z

Code Review Agent Run #ffca15

Actionable Suggestions - 6

tests/flytekit/unit/core/test_data.py - 3
- Consider fixing assertion tuple comparison syntax · Line 69-71
- Consider reordering mock setup and assertions · Line 248-255
- Consider meaningful value for empty parameter · Line 377-378
flytekit/core/data_persistence.py - 3
- Consider impact of changing auth constant · Line 53-53
- Consider consistent type for anonymous flag · Line 71-72
- Consider using boolean instead of string · Line 159-159

Additional Suggestions - 4

tests/flytekit/unit/core/test_data.py - 4
- Consider single line set definition · Line 484-487
- Consider single line function signature · Line 566-568
- Consider single line FileAccessProvider initialization · Line 543-545
- Consider single line initialization for readability · Line 593-595

Review Details

Files reviewed - 4 · Commit Range: deb9f3d..a1c99ec
- flytekit/core/data_persistence.py
- flytekit/core/obstore_filesystem.py
- tests/flytekit/unit/core/test_data.py
- tests/flytekit/unit/core/test_data_persistence.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by

flyte-bot · 2025-01-08T16:30:32Z

tests/flytekit/unit/core/test_data.py

+    assert (
+        "file:///abc/happy/"
+    ), "s3://my-s3-bucket/bucket1/" == local_raw_fp.recursive_paths(


Consider fixing assertion tuple comparison syntax

The assertion syntax appears incorrect. The tuple construction and comparison operator placement seems to be malformed. Consider restructuring the assertion to properly compare the tuple values. A similar issue was also found in tests/flytekit/unit/core/test_data.py (line 69-71).

Code suggestion

Check the AI-generated fix before applying

Suggested change

assert (

"file:///abc/happy/"

), "s3://my-s3-bucket/bucket1/" == local_raw_fp.recursive_paths(

assert ("file:///abc/happy/", "s3://my-s3-bucket/bucket1/") == \

local_raw_fp.recursive_paths(

"file:///abc/happy/", "s3://my-s3-bucket/bucket1/")

Code Review Run #ffca15

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-08T16:30:33Z

tests/flytekit/unit/core/test_data.py

+    mock_from_env.return_value = mock.Mock()
+    mock_from_env.assert_called_with(
+        "",
+        config={
+            "aws_allow_http": "true",  # Allow HTTP connections
+            "aws_virtual_hosted_style_request": "false",  # Use path-style addressing
+        },
+    )


Consider reordering mock setup and assertions

Consider moving the mock assertions before the s3_setup_args call since the mock setup should ideally be done before exercising the code under test.

Code suggestion

Check the AI-generated fix before applying

@@ -242,14 +242,14 @@ def test_s3_setup_args_env_empty(mock_from_env, mock_os, mock_get_config_file): mock_get_config_file.return_value = None mock_os.get.return_value = None + mock_from_env.return_value = mock.Mock() s3c = S3Config.auto() kwargs = s3_setup_args(s3c) - - mock_from_env.return_value = mock.Mock() mock_from_env.assert_called_with( "", config={ "aws_allow_http": "true", # Allow HTTP connections "aws_virtual_hosted_style_request": "false", # Use path-style addressing }, )

Code Review Run #ffca15

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-08T16:30:34Z

tests/flytekit/unit/core/test_data.py

+    mock_from_env.assert_called_with(
+        "",


Consider meaningful value for empty parameter

Consider providing a meaningful value for the empty string parameter in mock_from_env.assert_called_with(). An empty string for what appears to be a path/endpoint parameter may not properly test the intended behavior.

Code suggestion

Check the AI-generated fix before applying

Suggested change

mock_from_env.assert_called_with(

"",

mock_from_env.assert_called_with(

"s3://test-bucket",

Code Review Run #ffca15

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-08T16:30:35Z

flytekit/core/data_persistence.py

-_ANON = "anon"
+_FSSPEC_S3_KEY_ID = "access_key_id"
+_FSSPEC_S3_SECRET = "secret_access_key"
+_ANON = "skip_signature"


Consider impact of changing auth constant

Consider if changing _ANON constant from "anon" to "skip_signature" might affect existing code that relies on this value. This appears to be a breaking change in the S3 authentication configuration.

Code suggestion

Check the AI-generated fix before applying

Suggested change

_ANON = "skip_signature"

# TODO: Deprecate "anon" in future versions

_ANON = "anon" # or support both: _ANON = ("anon", "skip_signature")

Code Review Run #ffca15

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-08T16:30:36Z

flytekit/core/data_persistence.py

+    if anonymous:
+        store_kwargs[_ANON] = "true"


Consider consistent type for anonymous flag

The _ANON value is being set to 'true' as a string in s3_setup_args() but was previously being set to True boolean. This type inconsistency could cause issues with S3 authentication.

Code suggestion

Check the AI-generated fix before applying

Suggested change

if anonymous:

store_kwargs[_ANON] = "true"

if anonymous:

store_kwargs[_ANON] = True

Code Review Run #ffca15

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-08T16:30:37Z

flytekit/core/data_persistence.py

-    kwargs[_ANON] = anonymous
+        store_kwargs["tenant_id"] = azure_cfg.tenant_id
+    if anonymous:
+        kwargs[_ANON] = "true"


Consider using boolean instead of string

Consider using a boolean value directly instead of string 'true' for the _ANON parameter to maintain type consistency. Many systems interpret string 'true' differently than boolean True.

Code suggestion

Check the AI-generated fix before applying

Suggested change

kwargs[_ANON] = "true"

kwargs[_ANON] = True

Code Review Run #ffca15

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

machichima requested review from wild-endeavor, kumare3, eapolinario, pingsutw, cosmicBboy, samhita-alla, thomasjpfan and Future-Outlier as code owners January 4, 2025 14:40

flyte-bot reviewed Jan 4, 2025

View reviewed changes

machichima added 6 commits January 5, 2025 11:50

feat: enable obstore for minio in local

58ba73c

successfully run it on local, not yet tested remote Signed-off-by: machichima <[email protected]>

feat: enable obstore write to remote minio s3

79ea46d

Signed-off-by: machichima <[email protected]>

feat: use obstore for gcs

caaa657

Signed-off-by: machichima <[email protected]>

feat: use obstore for azure blob storage (abfs)

7ba66e2

Signed-off-by: machichima <[email protected]>

fix: wrong file path for get_filesystem_for_path

0ef7c05

Signed-off-by: machichima <[email protected]>

build(Dockerfile.dev): add obstore

17bde4a

Signed-off-by: machichima <[email protected]>

machichima force-pushed the apply-obstore branch from 7c76cc6 to 17bde4a Compare January 5, 2025 03:52

build(pyproject): add obstore

353f000

Signed-off-by: machichima <[email protected]>

flyte-bot reviewed Jan 5, 2025

View reviewed changes

fix: add storage specific obstore class

7f0782a

Specify the class properties for each file storage Signed-off-by: machichima <[email protected]>

machichima added 3 commits January 5, 2025 15:37

fix: path error for some file source

04bdf20

not remove protocol from path other than s3, gs, and abfs Signed-off-by: machichima <[email protected]>

feat: enable setting retries for s3

0189419

Signed-off-by: machichima <[email protected]>

test: modify test for obstore s3

deb9f3d

Signed-off-by: machichima <[email protected]>

flyte-bot reviewed Jan 5, 2025

View reviewed changes

flyteorg deleted a comment from flyte-bot Jan 6, 2025

wild-endeavor reviewed Jan 6, 2025

View reviewed changes

machichima added 3 commits January 8, 2025 21:52

test: use mock patch for obstore test

0ca6dbc

assert if the specific function is called with provided parameters Signed-off-by: machichima <[email protected]>

fix: use correct anon key for s3 and azure

42cc75f

Signed-off-by: machichima <[email protected]>

fix: use defined DEFAULT_BLOCK_SIZE

a1c99ec

Signed-off-by: machichima <[email protected]>

flyte-bot reviewed Jan 8, 2025

View reviewed changes

		driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod else None,
		executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod else None,

		bucket, to_path_file_only = split_path(to_path)
		file_system = await self.get_async_filesystem_for_path(to_path, bucket)

		bucket, from_path_file_only = split_path(from_path)
		file_system = await self.get_async_filesystem_for_path(from_path, bucket)

	default_block_size = 5 * 2**20
	default_block_size = DEFAULT_BLOCK_SIZE

	kwargs["retries"] = s3_cfg.retries
	if s3_cfg.retries is not None and 0 <= s3_cfg.retries <= 10:
	kwargs["retries"] = s3_cfg.retries

		@@ -46,47 +48,128 @@

		# Refer to https://github.com/fsspec/s3fs/blob/50bafe4d8766c3b2a4e1fc09669cf02fb2d71454/s3fs/core.py#L198

	_ANON = "skip_signature"
	# TODO: Deprecate "anon" in future versions
	_ANON = "anon" # or support both: _ANON = ("anon", "skip_signature")

[WIP] Apply obstore as storage backend #3033

Are you sure you want to change the base?

[WIP] Apply obstore as storage backend #3033

Conversation

machichima commented Jan 4, 2025 • edited by flyte-bot Loading

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Performance

Check all the applicable boxes

Related PRs

Docs link

Summary by Bito

flyte-bot commented Jan 4, 2025 • edited Loading

Code Review Agent Run #39883a

flyte-bot commented Jan 4, 2025 • edited Loading

Changelist by Bito

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyte-bot commented Jan 4, 2025 • edited Loading

Code Review Agent Run #8926b7

flyte-bot commented Jan 5, 2025 • edited Loading

Code Review Agent Run #0b7f4d

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 5, 2025 • edited Loading

Codecov Report

flyte-bot commented Jan 5, 2025 • edited Loading

Code Review Agent Run #e101cd

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyte-bot commented Jan 8, 2025 • edited Loading

Code Review Agent Run #ffca15

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machichima commented Jan 4, 2025 •

edited by flyte-bot

Loading

flyte-bot commented Jan 4, 2025 •

edited

Loading

flyte-bot commented Jan 4, 2025 •

edited

Loading

flyte-bot commented Jan 4, 2025 •

edited

Loading

flyte-bot commented Jan 5, 2025 •

edited

Loading

codecov bot commented Jan 5, 2025 •

edited

Loading

flyte-bot commented Jan 5, 2025 •

edited

Loading

flyte-bot commented Jan 8, 2025 •

edited

Loading