Introduce hybrid (CPU) scan for Parquet read [databricks] #11720

res-life · 2024-11-13T13:50:43Z

Introduce hybrid (CPU) scan for Parquet read
This PR leverages Gluten/Velox to do scan on CPU.

hybrid feature contains

Gluten repo: In internal gitlab repo gluten-public
Hybrid MR: In internal gitlab repo rapids-hybrid-execution, branch 1.2
This Spark-Rapids PR

This PR

Add Shims

build for all shims: 320-324, 330-334, 340-344, 350-353, CDHs, Databricks, throw runtime error if it's CDH or Databricks runtime.

Checks

In Hybrid MR: Gluten bundle version
Scala version is 2.12
Java version is 1.8
Hybrid MR: Arch is amd64, OS is Ubuntu 22.04 or Ubuntu 20.04
Spark is not Databricks or CDH
Hybrid jar is in the classpath if Hybrid is enabled.
Scan runs properly when Hybrid jar is not in the classpath and Hybrid is disabled.

Call to Hybrid JNI to do Parquet scan

Limitations

supports more Spark versions than Gluten official supports

The Gluten official doc says only support Spark 322, 331, 342, 351.

Support Spark 3.2.2, 3.3.1, 3.4.2, and 3.5.1 with all UTs passed(if data type supported)

Hybrid supports totally 19 Spark versions(320-324, 330-334, 340-344, 350-353 ), and add doc to config HYBRID_PARQUET_READER that other versions except Gluten official supports are not fully tested.

tests

config	jars exists ?	result	comment
Hybrid enabled	Hybrid/Gluten jar are exist	pass
Hybrid enabled	Hybrid/Gluten jar are not exist	pass	Report Jar is not in the classpath
Hybrid disabled	Hybrid/Gluten jar are exist	pass	no error reported
Hybrid disabled	Hybrid/Gluten jar are not exist	pass	no error reported

Signed-off-by: sperlingxx [email protected]
Signed-off-by: Chong Gao [email protected]

res-life · 2024-11-13T13:52:00Z

It's draft, may missed some code change, will double check later.
This can not pass building, because Gluten backends-velox 1.2.0 jar is not deployed to public maven repo by Gluten community.
The building will pass if the Gluten jars are installed locally by maven install

jlowe

Please elaborate in the headline and description what this PR is doing. C2C is not a well-known acronym in the project and is not very descriptive.

revans2

Just a quick look at the code. Nothing too in depth.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuRowToColumnarExec.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

integration_tests/src/main/python/parquet_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Signed-off-by: Chong Gao <[email protected]>

res-life · 2024-11-25T10:30:24Z

Passed IT. Tested conventional Spark-Rapids jar and regular Spark-Rapids jar.
Passed NDS test.
Will fix comments later.
Will push commits related to make a uber jar for all spark versions.

revans2

I need to do some manual testing on my own to try and understand what is happening here and how this is all working. It may take a while.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/CoalesceConvertIterator.scala

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridBackend.scala

sql-plugin/pom.xml

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridParquetScanRDD.scala

revans2 · 2024-11-25T15:13:03Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+        case MapType(kt, vt, _) if kt.isInstanceOf[MapType] || vt.isInstanceOf[MapType] => false
+        // For the time being, BinaryType is not supported yet
+        case _: BinaryType => false
+        case _ => true


facebookincubator/velox#9560 I am not an expert, and I don't even know what version of velox we will end up using. It sounds like it is plugable. But according to this, even the latest version of velox cannot handle bytes/TINYINT. We are not looking for spaces in the names of columns, among other issues. I know that other implementations fall back for even more things. Should we be concerned about this?

Gluten uses another velox repo, code link

VELOX_REPO=https://github.com/oap-project/velox.git VELOX_BRANCH=gluten-1.2.1

This will be something we should remember once we switch to use facebookincubator/velox directly.

My main concern is that if the gluten/velox version we use is pluggable, then we need to have some clear documentation on exactly which version you need to be based off of.

Yeah, Chong has added hybrid-execution.md to clarify the 1.2.0 version of Gluten.

sql-plugin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/ScanExecShims.scala

…park 322,331,343,351

res-life · 2024-12-25T07:09:04Z

build

res-life · 2024-12-25T08:59:39Z

TODO: scala 2.13 buiding is blocking.
In order to not implement shim code for scala 2.12 and scala 2.13, we plan to build a hybrid 2.13 jar, artifact name will be rapids-4-spark-hybrid_2.13.

Signed-off-by: Chong Gao <[email protected]>

res-life · 2025-01-06T07:34:13Z

build

res-life · 2025-01-13T03:58:58Z

build

res-life · 2025-01-13T06:26:39Z

build

res-life · 2025-01-13T09:44:42Z

The building for Scala2.13 works now.
Also tested Java 17 and it works.
All comments are addressed, waiting for the premerge to pass.

res-life · 2025-01-13T10:10:07Z

Premerge passed.
Trigger building again to test databricks.

res-life · 2025-01-13T10:10:17Z

build

revans2

At this point my only concerns are with some "nice to have" additions to the documentation and some nits in the code (mostly around comments and naming).

revans2 · 2025-01-13T14:58:32Z

docs/dev/hybrid-execution.md

+- Only supports V1 Parquet data source.
+- Only supports Scala 2.12, do not support Scala 2.13.
+- Support Spark 3.2.2, 3.3.1, 3.4.2, and 3.5.1 like [Gluten supports](https://github.com/apache/incubator-gluten/releases/tag/v1.2.0),
+other Spark versions 32x, 33x, 34x, 35x also work, but are not fully tested.


nit: Can we add a few comments about what cases this appears to be better than the current parquet scan so that customers can know if it is worth the effort to try this out?

Do we need/want to mention some of the limitations with different data types? And are there any gluten specific configs that they need to set to make this work for them?

revans2 · 2025-01-13T15:01:13Z

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/CoalesceConvertIterator.scala

+      // release the native instance when upstreaming iterator has been exhausted
+      val detailedMetrics = c.close()
+      val tID = TaskContext.get().taskAttemptId()
+      logInfo(s"task[$tID] CoalesceNativeConverter finished:\n$detailedMetrics")


nit: does this need to be at the info level?

revans2 · 2025-01-13T15:04:20Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+        // Currently, under some circumstance, the native backend may return incorrect results
+        // over MapType nested by nested types. To guarantee the correctness, disable this pattern
+        // entirely.
+        // TODO: figure out the root cause and support it


nit: is there an issue that you can point to here?

revans2 · 2025-01-13T15:04:54Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+        case ArrayType(_: MapType, _) => true
+        case MapType(_: MapType, _, _) | MapType(_, _: MapType, _) => true
+        case st: StructType if st.exists(_.dataType.isInstanceOf[MapType]) => true
+        // TODO: support DECIMAL with negative scale


nit: Is there an issue you can point to here? Just FYI I think this is super low priority. Spark has disabled this by default so I don't see it as a bit deal.

revans2 · 2025-01-13T15:05:08Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+        case st: StructType if st.exists(_.dataType.isInstanceOf[MapType]) => true
+        // TODO: support DECIMAL with negative scale
+        case dt: DecimalType if dt.scale < 0 => true
+        // TODO: support DECIMAL128


nit: again having an issue to point to is helpful

revans2 · 2025-01-13T15:05:20Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+        case dt: DecimalType if dt.scale < 0 => true
+        // TODO: support DECIMAL128
+        case dt: DecimalType if dt.precision > DType.DECIMAL64_MAX_PRECISION => true
+        // TODO: support BinaryType


nit: again having an issue to point to would be great.

revans2 · 2025-01-13T15:05:39Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+        case _ => false
+      })
+    }
+    // TODO: supports BucketedScan


nit: once more having an issue to point to would be great.

revans2 · 2025-01-13T15:07:07Z

...gin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/HybridFileSourceScanExecMeta.scala

+   * Check Spark distribution is not CDH or Databricks,
+   * report error if it is
+   */
+  private def checkNotRunningCDHorDatabricks(): Unit = {


nit: I would prefer to call these kinds of methods assertSomething instead of checkSomething. To me it implies more strongly that an exception would be thrown in the wrong case.

res-life · 2025-01-14T01:24:37Z

build

res-life · 2025-01-14T07:36:42Z

build

res-life · 2025-01-14T08:54:24Z

build

res-life · 2025-01-14T09:28:15Z

build

res-life · 2025-01-14T13:06:03Z

Premerge with databricks passed.

res-life requested review from jlowe and sperlingxx November 14, 2024 01:13

jlowe reviewed Nov 14, 2024

View reviewed changes

sameerz added the performance A performance related task/issue label Nov 16, 2024

revans2 reviewed Nov 20, 2024

View reviewed changes

res-life changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 09:53

winningsix reviewed Nov 25, 2024

View reviewed changes

integration_tests/src/main/python/parquet_test.py Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Outdated Show resolved Hide resolved

Chong Gao and others added 5 commits November 25, 2024 18:18

Merge C2C code to main

65de010

Signed-off-by: Chong Gao <[email protected]>

Update the dependencies in pom.xml

e6cede2

revert BD velox hdfs code

1e4fc13

fit codes into the new HybridScan hierarchy

46e19df

refine QueryPlan, RapidsMeta and test suites for HybridScan

4f2a4d6

res-life force-pushed the merge-c2c branch from 74b6075 to 4f2a4d6 Compare November 25, 2024 10:19

Integrate Hybrid plugin; update IT

4d52f90

res-life marked this pull request as ready for review November 25, 2024 10:25

res-life requested review from tgravescs, GaryShen2008, NvTimLiu and gerashegalov as code owners November 25, 2024 10:25

revans2 requested changes Nov 25, 2024

View reviewed changes

res-life marked this pull request as draft November 26, 2024 00:59

winningsix changed the title ~~Merge C2C code to main~~ Introduce hybrid (CPU) scan for Parquet read Nov 26, 2024

Chong Gao added 2 commits December 4, 2024 17:01

Make Hybrid jar provoided scope; Update shim to only applicable for S…

c82eb29

…park 322,331,343,351

Fix comments

65b585a

res-life assigned res-life and sperlingxx Dec 4, 2024

Chong Gao added 2 commits December 5, 2024 09:10

Code comment update, a minor change

d214739

Fix shim logic

e0f1e3b

Chong Gao added 2 commits December 25, 2024 14:47

Remove check for Java versions

5bfd763

Fix scala 2.13 check failure

275fa3d

res-life mentioned this pull request Dec 25, 2024

Enable Hybrid test cases in premerge/nightly CIs [databricks] #11906

Draft

Fix for Scala 2.13 building

cbb5609

Signed-off-by: Chong Gao <[email protected]>

Fix: specify default value for loadBackend to avoid exception

c1df7c4

res-life requested a review from a team as a code owner January 13, 2025 06:24

res-life changed the title ~~Introduce hybrid (CPU) scan for Parquet read~~ Introduce hybrid (CPU) scan for Parquet read [databricks] Jan 13, 2025

revans2 previously approved these changes Jan 13, 2025

View reviewed changes

Fix Databricks building

3d3b172

res-life dismissed revans2’s stale review via 3d3b172 January 14, 2025 07:35

Minor format change

d718d8c

Chong Gao added 2 commits January 14, 2025 16:49

Merge branch 'branch-25.02' into merge-c2c

4e79c2b

Add shim 354 for Hybrid feature

aae45b9

res-life mentioned this pull request Jan 14, 2025

[FEA] [Hybrid][FOLLOW-UP] tasks #11965

Open

8 tasks

Fix Databricks building

4fd5fdb

revans2 approved these changes Jan 14, 2025

View reviewed changes

Introduce hybrid (CPU) scan for Parquet read [databricks] #11720

Are you sure you want to change the base?

Introduce hybrid (CPU) scan for Parquet read [databricks] #11720

Conversation

res-life commented Nov 13, 2024 • edited by GaryShen2008 Loading

hybrid feature contains

This PR

Add Shims

Checks

Call to Hybrid JNI to do Parquet scan

Limitations

supports more Spark versions than Gluten official supports

tests

res-life commented Nov 13, 2024 • edited Loading

jlowe left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

res-life commented Nov 25, 2024

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Dec 25, 2024

res-life commented Dec 25, 2024

res-life commented Jan 6, 2025

res-life commented Jan 13, 2025

res-life commented Jan 13, 2025

res-life commented Jan 13, 2025 • edited Loading

res-life commented Jan 13, 2025

res-life commented Jan 13, 2025

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Jan 14, 2025

res-life commented Jan 14, 2025

res-life commented Jan 14, 2025

res-life commented Jan 14, 2025

res-life commented Jan 14, 2025

res-life commented Nov 13, 2024 •

edited by GaryShen2008

Loading

res-life commented Nov 13, 2024 •

edited

Loading

res-life commented Jan 13, 2025 •

edited

Loading