Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for Hyper Log Log PLus Plus(HLL++) #11638

Draft
wants to merge 6 commits into
base: branch-25.02
Choose a base branch
from

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Oct 21, 2024

closes ##5199

depends on

Description

Spark approx_count_distinct description link
Spark accepts one column(can be nested column) and a double literal relativeSD.

Depending on JNI PR:
NVIDIA/spark-rapids-jni#2522

TODO

  • The NullType reduction case reports error:
struct(longs) is not supported for GPU processing yet

Perf test

// group by
import org.apache.spark.sql.functions
spark.range(10000000).repartition(5).withColumn("m", functions.expr("id % 10")).createOrReplaceTempView("tab")
spark.time(spark.sql("select m, APPROX_COUNT_DISTINCT(id) from tab group by m").show())

// reduction
spark.range(10000000).repartition(5).createOrReplaceTempView("tab")
spark.time(spark.sql("select APPROX_COUNT_DISTINCT(id) from tab ").show())
num_groups CPU time(hot runs) GPU time(hot runs) speedup
10 1106ms, 1020ms, 1059ms 196ms, 208ms, 188ms 3.53x
1,000,000 5135ms, 5307ms, 5487ms 1447ms, 1565ms, 1497ms 5.38x
reduction 942ms, 1041ms, 973ms 169ms, 165ms, 180ms 5.75x

correctness

The results are identical between CPU and GPU.

Signed-off-by: Chong Gao [email protected]

@res-life res-life requested a review from ttnghia October 21, 2024 12:46
@res-life res-life force-pushed the hll branch 2 times, most recently from d42d80a to 1945192 Compare October 23, 2024 01:34
}
}

case class GpuHLL(childExpr: Expression, relativeSD: Double)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let' call by full name like GpuHyperLogLogPlusPlus to better reflect the CPU version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ReductionAggregation.HLL(numRegistersPerSketch), DType.STRUCT)
override lazy val groupByAggregate: GroupByAggregation =
GroupByAggregation.HLL(numRegistersPerSketch)
override val name: String = "CudfHLL"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if "PlusPlus" is necessary.

Suggested change
override val name: String = "CudfHLL"
override val name: String = "CudfHyperLogLogPlusPlus"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@res-life res-life changed the title [Do not review] Add Hyper Log Log PLus Plus(HLL++) [Do not review] Add support for Hyper Log Log PLus Plus(HLL++) Oct 24, 2024
@res-life res-life force-pushed the hll branch 4 times, most recently from 0a4939f to eb00c2b Compare October 30, 2024 12:37
@res-life res-life changed the title [Do not review] Add support for Hyper Log Log PLus Plus(HLL++) Add support for Hyper Log Log PLus Plus(HLL++) Oct 31, 2024
Signed-off-by: Chong Gao <[email protected]>
@res-life res-life changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 09:53
@res-life
Copy link
Collaborator Author

Ready to review except test cases.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

expr[HyperLogLogPlusPlus](
"Aggregation approximate count distinct",
ExprChecks.reductionAndGroupByAgg(TypeSig.LONG, TypeSig.LONG,
Seq(ParamCheck("input", TypeSig.cpuAtomics, TypeSig.all))),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Using cpuAtomics for a GPU field gets to be kind of confusing. Could you please create a gpuAtomics instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to support map, array and list because this is merged: NVIDIA/spark-rapids-jni#2575

@res-life res-life changed the title Add support for Hyper Log Log PLus Plus(HLL++) [WIP] Add support for Hyper Log Log PLus Plus(HLL++) Dec 13, 2024
@res-life
Copy link
Collaborator Author

Explain for HLLPP:
In general, HLLPP sketch is a block of memory to estimate distinct value, it contains several integer registers. The num of registers is decided by precision parameter.
num_of_registers_in_a_sketch = pow(2, precision)
e.g.: precision = 9, then num_of_registers_in_a_sketch = 2^9 = 512
Each integer register stores the number of zero bits in a hash code.
Because Spark use xxhash64 to compute hash code, thus hash code is 64 bits.
The max value of register is 64. So Refer to link

  /**
   * The number of bits that is required per register.
   *
   * This number is determined by the maximum number of leading binary zeros a hashcode can
   * produce. This is equal to the number of bits the hashcode returns. The current
   * implementation uses a 64-bit hashcode, this means 6-bits are (at most) needed to store the
   * number of leading zeros.
   */
  val REGISTER_SIZE = 6

6 bits is enough to save a register value.
Spark uses long columns to save HLLPP sketch.
e.g.: precision = 9, num_of_registers_in_a_sketch = 512
becasue of max register value is 6 bits, thus a long can hold 10 register values.
So spark uses 512/10+1 = 52 long columns to save HLLPP sketch column.
In this PR, there are some handlings the conversion:

cuDF uses Struct<long, ..., long> column to do aggregate
Convert long columns to Struct<long, ..., long> column
Convert Struct<long, ..., long> column to long columns

TODO:
Add more test cases
Support nested types: after #11859

@revans2 could you have a look first?

@res-life
Copy link
Collaborator Author

  • [DONE] Add more test cases
  • [DONE] Support nested types
  • [DONE] Check the stack depth GPU will use does not exceed threshold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants