Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce hybrid (CPU) scan for Parquet read [databricks] #11720

Open
wants to merge 34 commits into
base: branch-25.02
Choose a base branch
from

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Nov 13, 2024

Introduce hybrid (CPU) scan for Parquet read
This PR leverages Gluten/Velox to do scan on CPU.

hybrid feature contains

  • Gluten repo: In internal gitlab repo gluten-public
  • Hybrid MR: In internal gitlab repo rapids-hybrid-execution, branch 1.2
  • This Spark-Rapids PR

This PR

Add Shims

build for all shims: 320-324, 330-334, 340-344, 350-353, CDHs, Databricks, throw runtime error if it's CDH or Databricks runtime.

Checks

  • In Hybrid MR: Gluten bundle version
  • Scala version is 2.12
  • Java version is 1.8
  • Hybrid MR: Arch is amd64, OS is Ubuntu 22.04 or Ubuntu 20.04
  • Spark is not Databricks or CDH
  • Hybrid jar is in the classpath if Hybrid is enabled.
  • Scan runs properly when Hybrid jar is not in the classpath and Hybrid is disabled.

Call to Hybrid JNI to do Parquet scan

Limitations

supports more Spark versions than Gluten official supports

The Gluten official doc says only support Spark 322, 331, 342, 351.

Support Spark 3.2.2, 3.3.1, 3.4.2, and 3.5.1 with all UTs passed(if data type supported)

Hybrid supports totally 19 Spark versions(320-324, 330-334, 340-344, 350-353 ), and add doc to config HYBRID_PARQUET_READER that other versions except Gluten official supports are not fully tested.

tests

config jars exists ? result comment
Hybrid enabled Hybrid/Gluten jar are exist pass
Hybrid enabled Hybrid/Gluten jar are not exist pass Report Jar is not in the classpath
Hybrid disabled Hybrid/Gluten jar are exist pass no error reported
Hybrid disabled Hybrid/Gluten jar are not exist pass no error reported

Signed-off-by: sperlingxx [email protected]
Signed-off-by: Chong Gao [email protected]

@res-life
Copy link
Collaborator Author

res-life commented Nov 13, 2024

It's draft, may missed some code change, will double check later.
This can not pass building, because Gluten backends-velox 1.2.0 jar is not deployed to public maven repo by Gluten community.
The building will pass if the Gluten jars are installed locally by maven install

@res-life res-life requested review from jlowe and sperlingxx November 14, 2024 01:13
Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate in the headline and description what this PR is doing. C2C is not a well-known acronym in the project and is not very descriptive.

@sameerz sameerz added the performance A performance related task/issue label Nov 16, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick look at the code. Nothing too in depth.

@res-life res-life changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 09:53
@res-life res-life marked this pull request as ready for review November 25, 2024 10:25
@res-life
Copy link
Collaborator Author

Passed IT. Tested conventional Spark-Rapids jar and regular Spark-Rapids jar.
Passed NDS test.
Will fix comments later.
Will push commits related to make a uber jar for all spark versions.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do some manual testing on my own to try and understand what is happening here and how this is all working. It may take a while.

sql-plugin/pom.xml Show resolved Hide resolved
case MapType(kt, vt, _) if kt.isInstanceOf[MapType] || vt.isInstanceOf[MapType] => false
// For the time being, BinaryType is not supported yet
case _: BinaryType => false
case _ => true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

facebookincubator/velox#9560 I am not an expert, and I don't even know what version of velox we will end up using. It sounds like it is plugable. But according to this, even the latest version of velox cannot handle bytes/TINYINT. We are not looking for spaces in the names of columns, among other issues. I know that other implementations fall back for even more things. Should we be concerned about this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gluten uses another velox repo, code link

VELOX_REPO=https://github.com/oap-project/velox.git
VELOX_BRANCH=gluten-1.2.1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be something we should remember once we switch to use facebookincubator/velox directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that if the gluten/velox version we use is pluggable, then we need to have some clear documentation on exactly which version you need to be based off of.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Chong has added hybrid-execution.md to clarify the 1.2.0 version of Gluten.

@res-life res-life marked this pull request as draft November 26, 2024 00:59
@winningsix winningsix changed the title Merge C2C code to main Introduce hybrid (CPU) scan for Parquet read Nov 26, 2024
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

TODO: scala 2.13 buiding is blocking.
In order to not implement shim code for scala 2.12 and scala 2.13, we plan to build a hybrid 2.13 jar, artifact name will be rapids-4-spark-hybrid_2.13.

Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Collaborator Author

res-life commented Jan 6, 2025

build

1 similar comment
@res-life
Copy link
Collaborator Author

build

@res-life res-life requested a review from a team as a code owner January 13, 2025 06:24
@res-life
Copy link
Collaborator Author

build

@res-life res-life changed the title Introduce hybrid (CPU) scan for Parquet read Introduce hybrid (CPU) scan for Parquet read [databricks] Jan 13, 2025
@res-life
Copy link
Collaborator Author

res-life commented Jan 13, 2025

The building for Scala2.13 works now.
Also tested Java 17 and it works.
All comments are addressed, waiting for the premerge to pass.

@res-life
Copy link
Collaborator Author

Premerge passed.
Trigger building again to test databricks.

@res-life
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Jan 13, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point my only concerns are with some "nice to have" additions to the documentation and some nits in the code (mostly around comments and naming).

- Only supports V1 Parquet data source.
- Only supports Scala 2.12, do not support Scala 2.13.
- Support Spark 3.2.2, 3.3.1, 3.4.2, and 3.5.1 like [Gluten supports](https://github.com/apache/incubator-gluten/releases/tag/v1.2.0),
other Spark versions 32x, 33x, 34x, 35x also work, but are not fully tested.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we add a few comments about what cases this appears to be better than the current parquet scan so that customers can know if it is worth the effort to try this out?

Do we need/want to mention some of the limitations with different data types? And are there any gluten specific configs that they need to set to make this work for them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// release the native instance when upstreaming iterator has been exhausted
val detailedMetrics = c.close()
val tID = TaskContext.get().taskAttemptId()
logInfo(s"task[$tID] CoalesceNativeConverter finished:\n$detailedMetrics")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this need to be at the info level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Currently, under some circumstance, the native backend may return incorrect results
// over MapType nested by nested types. To guarantee the correctness, disable this pattern
// entirely.
// TODO: figure out the root cause and support it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is there an issue that you can point to here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case ArrayType(_: MapType, _) => true
case MapType(_: MapType, _, _) | MapType(_, _: MapType, _) => true
case st: StructType if st.exists(_.dataType.isInstanceOf[MapType]) => true
// TODO: support DECIMAL with negative scale
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there an issue you can point to here? Just FYI I think this is super low priority. Spark has disabled this by default so I don't see it as a bit deal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case st: StructType if st.exists(_.dataType.isInstanceOf[MapType]) => true
// TODO: support DECIMAL with negative scale
case dt: DecimalType if dt.scale < 0 => true
// TODO: support DECIMAL128
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: again having an issue to point to is helpful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case dt: DecimalType if dt.scale < 0 => true
// TODO: support DECIMAL128
case dt: DecimalType if dt.precision > DType.DECIMAL64_MAX_PRECISION => true
// TODO: support BinaryType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: again having an issue to point to would be great.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case _ => false
})
}
// TODO: supports BucketedScan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: once more having an issue to point to would be great.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Check Spark distribution is not CDH or Databricks,
* report error if it is
*/
private def checkNotRunningCDHorDatabricks(): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would prefer to call these kinds of methods assertSomething instead of checkSomething. To me it implies more strongly that an exception would be thrown in the wrong case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Premerge with databricks passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants