Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for org.apache.spark.sql.catalyst.expressions.Bin #2760

Open
wants to merge 6 commits into
base: branch-25.02
Choose a base branch
from

Conversation

ustcfy
Copy link
Collaborator

@ustcfy ustcfy commented Jan 13, 2025

Contribute to NVIDIA/spark-rapids#11648

This PR adds support for org.apache.spark.sql.catalyst.expressions.Bin.
What it does is taking a long and outputting its binary representation.

Signed-off-by: Yan Feng <[email protected]>
@thirtiseven thirtiseven self-requested a review January 13, 2025 08:48
Signed-off-by: Yan Feng <[email protected]>
Copy link
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, love to see some Java tests

#include <cudf/column/column_device_view.cuh>
#include <cudf/column/column_factories.hpp>
#include <cudf/detail/null_mask.hpp>
#include <cudf/detail/nvtx/ranges.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless include (not sure)?


TEST_F(LongToBinaryStringTests, FromLongToBinary)
{
auto const longs = cudf::test::fixed_width_column_wrapper<int64_t>{0L, 1L, 10L, -0L, -1L};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some edge case like null, LONG_MAX, LONG_MIN?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test case at plugin level to make sure Bin(13.3) returns 1101 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test case at plugin level to make sure Bin(13.3) returns 1101 ?

I tested it locally, and Bin(13.3) indeed returns 1101. I will soon submit the plugin PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a Spark test not a good test for binary ops. Spark only accepts a Long as the input to bin

https://github.com/apache/spark/blob/3569e768e657d4e28ee7520808ec910cdff2b099/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1010

So any floating point input gets a cast to long inserted in before bin is called. So that test is really a test that Spark is doing the right thing. Even then it would probably be something that we would want to put in the integration tests if we did test it at all.

Note that you can also pass in a string as an input and it will still try to cast it to a long before calling bin.

{
auto const size = 64 - __clzll(value);
// If the value is 0, the size should be 1
return size > 0 ? size : 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can first check if the value == 0 to save a __clzll call in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't that introduce thread divergence? I understand how that can be an advantage on a CPU, but I don't really see it on a GPU.

Copy link
Collaborator Author

@ustcfy ustcfy Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PTX translated from the code is as follows:

auto const size = 64 - __clzll(value);
return size > 0 ? size : 1;

// PTX
clz.b64         %r1, %rd1;           // %r1: number of leading zeros in %rd1 (value)
mov.u32         %r2, 64;             // %r2: constant value 64
sub.s32         %r3, %r2, %r1;       // %r3: size = 64 - %r1
max.s32         %r4, %r3, 1;         // %r4: return max(size, 1)
if (!value) return 1;
return 64 - __clzll(value);

// PTX
setp.eq.s64     %p1, %rd1, 0;        // %p1: set to true if %rd1 (value) is zero
clz.b64         %r1, %rd1;           // %r1: number of leading zeros in %rd1 (value)
mov.u32         %r2, 64;             // %r2: constant value 64
sub.s32         %r3, %r2, %r1;       // %r3: size = 64 - %r1
selp.b32        %r4, 1, %r3, %p1;    // %r4: if %p1 is true, return 1; otherwise, return %r3 (final size)

Copy link
Collaborator

@thirtiseven thirtiseven Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't that introduce thread divergence? I understand how that can be an advantage on a CPU, but I don't really see it on a GPU.

Yes, I was thinking there must be an if else to check size == 0 case so we can put it earlier than clz to save some calls without introducing new branch. But it looks like the compiler will optimize the size > 0 ? size : 1; to a max.s32 so it's branch less then the original approach looks better in anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now changed it to this: return max(64 - __clzll(value), 1);

auto results = spark_rapids_jni::long_to_binary_string(longs, cudf::get_default_stream());

auto const expected = cudf::test::strings_column_wrapper{
"0", "1", "1010", "0", "1111111111111111111111111111111111111111111111111111111111111111"};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revans2
revans2 previously approved these changes Jan 13, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing for me is going to be performance. Do you have any numbers comparing the CPU to the GPU? If you want to wait for a plugin patch that is fine. But we need to do it somewhere.


TEST_F(LongToBinaryStringTests, FromLongToBinary)
{
auto const longs = cudf::test::fixed_width_column_wrapper<int64_t>{0L, 1L, 10L, -0L, -1L};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a Spark test not a good test for binary ops. Spark only accepts a Long as the input to bin

https://github.com/apache/spark/blob/3569e768e657d4e28ee7520808ec910cdff2b099/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1010

So any floating point input gets a cast to long inserted in before bin is called. So that test is really a test that Spark is doing the right thing. Even then it would probably be something that we would want to put in the integration tests if we did test it at all.

Note that you can also pass in a string as an input and it will still try to cast it to a long before calling bin.

auto const value = d_longs.element<LongType>(idx);
char* d_buffer = d_chars + d_offsets[idx];
for (auto i = d_sizes[idx] - 1; i >= 0; --i) {
*d_buffer++ = value & (1LL << i) ? '1' : '0';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*d_buffer++ = '0' + ((value & (1LL << i)) >> i); perhaps this approach is more efficient since it avoids branching, which might degrade performance on GPUs with warp divergence.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am not sure if it is a good practice which is really effective. I would like to hear your opinions on this issue @res-life @ttnghia .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe *d_buffer++ = '0' + ((value & (1LL << i)) != 0);? It will be (very slightly) cheaper and easier to read.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thirtiseven
Yes, this one is also a Branch-Free expression since the compiler shall use setne instruction avoids branching by directly setting a register based on the zero flag (ZF):

cmp rax, 0
setne al
and al, 1
add eax, 48

The corresponding codes of my alternative would be translated into:

sar     rax, cl
add     rax, 48

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am not sure if it is a good practice which is really effective. I would like to hear your opinions on this issue @res-life @ttnghia .

Yes, I think this approach is more efficient.
You may conduct a benchmark test to double confirm.


__device__ void operator()(cudf::size_type idx)
{
if (d_longs.is_null(idx)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Just some nice-to-have improvement, use constexpr if instead of if and add an extra template variable nullable for this functor. Because we already knew whether the column_view is nullable or NOT.

Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
@res-life
Copy link
Collaborator

res-life commented Jan 15, 2025

About the performance, we may try the following approach.

  1. Use warp level parallel, this means one warp(32 threads) handles a long value. Currently one thread handles a long value.
  2. Use unrolling approach:
for (auto i = d_sizes[idx] - 1; i >= 0; --i) {
      *d_buffer++ = value & (1LL << i) ? '1' : '0';

==>>

create a buffer in the stack:
char buf[64];

// totally have 64 lines:
buf[0] = value & (1LL << 0) ? '1' : '0';
buf[1] = value & (1LL << 1) ? '1' : '0';
buf[2] = value & (1LL << 2) ? '1' : '0';
...
buf[62] = value & (1LL << 62) ? '1' : '0';
buf[63] = value & (1LL << 63) ? '1' : '0';
// some of the above lines are unnecessary if the long values have leading have zeros, but keep them to avoid GPU divergence. 

// write out by 4 chars
num_chars = x
while(num_chars > 4) {
  int* d_int_buf = (int*)(d_buffer)
  int* int_buf = (int*)buf
  d_int_buf[i] = int_buf[i]
  num_chars -= 4
}
write out left chars

// Or use mem_cpy to copy multiple chars at one time
mem_cpy(src, dst, 0, size)

@res-life
Copy link
Collaborator

I found an example, maybe it's useful:

void intToBin(uint64_t i, char* out) {
	assert((out % sizeof(uint64_t)) == 0); // check alignment
	uint64_t zeros   = 0x3030303030303030ULL; // "00000000"
	uint64_t mask    = 0x0101010101010101ULL;
	uint64_t *out_64 = reinterprent_cast<uint64_t*>(out);

	uint64_t bit_index = 0;
	_BitScanForward64(&bit_index, i)
	i = i << bit_index;

	out_64[0] = zeros | _pdep_u64(i >> 56, mask);
	out_64[1] = zeros | _pdep_u64(i >> 48, mask);
	out_64[2] = zeros | _pdep_u64(i >> 40, mask);
	out_64[3] = zeros | _pdep_u64(i >> 32, mask);
	out_64[4] = zeros | _pdep_u64(i >> 24, mask);
	out_64[5] = zeros | _pdep_u64(i >> 16, mask);
	out_64[6] = zeros | _pdep_u64(i >>  8, mask);
	out_64[7] = zeros | _pdep_u64(i >>  0, mask);
	
	_BitScanReverse64(&bit_index, i)
	out[64 - bit_index] = '\0';
}

CUDA supports _pdep_u64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants