Avoid local dir creation, ensure dense array ordering during UMAP save() #823

rishic3 · 2025-01-13T18:19:09Z

No description provided.

Signed-off-by: Rishi Chandra <[email protected]>

rishic3 · 2025-01-13T18:23:28Z

build

eordentlich · 2025-01-13T19:00:19Z

python/src/spark_rapids_ml/umap.py

+                pd.DataFrame(
+                    {
+                        "row_id": range(array.shape[0]),
+                        "data": array.tolist(),


Is this better than list(array) ?

Pretty sure we create pandas dfs array columns from np arrays and vice versa elsewhere in our code and would be good to be consistent and/or use best way through out.

While list(array) is more efficient since tolist() does a deep conversion of each row to python lists, Spark will throw an error with list(array) if spark.sql.execution.arrow.pyspark.enabled=false, since pyarrow would no longer handle the numpy -> arrow array conversion.

We pretty much require that to be enabled to get good data transfer from jvm to python workers.

eordentlich · 2025-01-13T23:27:30Z

python/src/spark_rapids_ml/umap.py

+                    }
+                ),
+                schema=schema,
+            )


Below and elsewhere in this class is it correct to use overwrite when writing? This might be counter to the overwrite MLWriter api. If that is not invoked, a user would not expect overwrite to be allowed.

eordentlich · 2025-01-13T23:34:15Z

python/src/spark_rapids_ml/umap.py

-            data_df = spark.read.parquet(df_path)
-            return np.array(data_df.collect(), dtype=np.float32)
+            data_df = spark.read.parquet(df_path).orderBy("row_id")
+            return np.array([row.data for row in data_df.collect()], dtype=np.float32)


toPandas might be better here followed by np.array(list(data_pandas_df.data))

eordentlich · 2025-01-13T23:36:48Z

python/src/spark_rapids_ml/umap.py

@@ -1495,8 +1504,6 @@ def write_dense_array(array: np.ndarray, df_path: str) -> None:
        assert model_attributes is not None

        data_path = os.path.join(path, "data")
-        if not os.path.exists(data_path):


Would be good to have a test that checks for expected files and directories?

eordentlich · 2025-01-13T23:39:26Z

python/src/spark_rapids_ml/umap.py

@@ -1547,8 +1554,8 @@ def read_sparse_array(
            return scipy.sparse.csr_matrix((data, indices, indptr), shape=csr_shape)

        def read_dense_array(df_path: str) -> np.ndarray:
-            data_df = spark.read.parquet(df_path)
-            return np.array(data_df.collect(), dtype=np.float32)
+            data_df = spark.read.parquet(df_path).orderBy("row_id")


I wonder if there is test for the order, one that would fail if orderby was omitted.

A multi-gpu env (e.g., DGX) where Spark's default parallelism is >1 would have caught it and I should have tested there with the last PR.
Forcing >1 parallelism would require changing CleanSparkSession to allow a new conf to override the default conf - not sure if that's worth it

eordentlich · 2025-01-14T04:17:02Z

build

eordentlich

👍

rishic3 added 2 commits January 13, 2025 18:12

Avoid local dir creation, fix dense ordering

ed0360d

signoff

ba6220a

Signed-off-by: Rishi Chandra <[email protected]>

rishic3 changed the title ~~Avoid local dir creation, fix dense ordering~~ Avoid local dir creation, ensure dense array ordering during UMAP save() Jan 13, 2025

rishic3 changed the base branch from branch-25.02 to branch-24.12 January 13, 2025 18:27

rishic3 marked this pull request as ready for review January 13, 2025 18:39

rishic3 requested a review from eordentlich January 13, 2025 19:16

eordentlich reviewed Jan 13, 2025

View reviewed changes

rishic3 added 6 commits January 13, 2025 18:28

add listdir test, use topandas

297c486

license header

2a6e886

fix overwriting and add test

3cf20c5

formatting

7b3048c

use list()

3bf1f51

remove unneeded import

c0e8b5e

eordentlich approved these changes Jan 14, 2025

View reviewed changes

rishic3 merged commit f9624be into NVIDIA:branch-24.12 Jan 14, 2025
3 checks passed

rishic3 deleted the umap-save-dense branch January 14, 2025 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid local dir creation, ensure dense array ordering during UMAP save() #823

Avoid local dir creation, ensure dense array ordering during UMAP save() #823

rishic3 commented Jan 13, 2025

rishic3 commented Jan 13, 2025

eordentlich Jan 13, 2025

rishic3 Jan 14, 2025

eordentlich Jan 14, 2025

eordentlich Jan 13, 2025

rishic3 Jan 14, 2025

eordentlich Jan 13, 2025

eordentlich Jan 13, 2025

eordentlich Jan 13, 2025

rishic3 Jan 14, 2025

eordentlich commented Jan 14, 2025

eordentlich left a comment

Avoid local dir creation, ensure dense array ordering during UMAP save() #823

Avoid local dir creation, ensure dense array ordering during UMAP save() #823

Conversation

rishic3 commented Jan 13, 2025

rishic3 commented Jan 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eordentlich commented Jan 14, 2025

eordentlich left a comment

Choose a reason for hiding this comment