Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: block-level InclusiveScan with initial value doesn't compute the block aggregate correctly #3383

Open
1 task done
Beanavil opened this issue Jan 14, 2025 · 0 comments
Labels
bug Something isn't working right.

Comments

@Beanavil
Copy link

Beanavil commented Jan 14, 2025

Is this a duplicate?

Type of Bug

Silent Failure

Component

CUB

Describe the bug

The recently-added block-level InclusiveScan with initial value support seems to not be taking into account such initial value for the block aggregate calculation.

How to Reproduce

  1. Add the following test
__global__ void InclusiveScanKernelAggregate(int* output, int* d_block_aggregate)
{
  // Specialize BlockScan for a 1D block of 64 threads of type int
  using block_scan_t   = cub::BlockScan<int, 64>;
  using temp_storage_t = block_scan_t::TempStorage;

  // Allocate shared memory for BlockScan
  __shared__ temp_storage_t temp_storage;

  int initial_value = 1;
  int thread_data[] = { 1, -1 };
  //  input: {[1, -1], [1, -1],[1, -1], ... [1, -1]}

  // Collectively compute the block-wide inclusive scan max
  int block_aggregate;
  block_scan_t(temp_storage).InclusiveScan(thread_data, thread_data, initial_value, cub::Sum(), block_aggregate);

  // output: {[2, 1], [2, 1],[2, 1], ... [2, 1]}
  // block_aggregate = 1;
  // ...
  // example-end inclusive-scan-array-aggregate-init-value

  *d_block_aggregate          = block_aggregate;
  output[threadIdx.x * 2]     = thread_data[0];
  output[threadIdx.x * 2 + 1] = thread_data[1];
}

CUB_TEST("Block array-based inclusive scan with block aggregate works with initial value", "[scan][block]")
{
  thrust::device_vector<int> d_out(block_num_threads * num_items_per_thread);

  c2h::device_vector<int> d_block_aggregate(1);
  InclusiveScanKernelAggregate<<<1, block_num_threads>>>(
    thrust::raw_pointer_cast(d_out.data()), thrust::raw_pointer_cast(d_block_aggregate.data()));
  REQUIRE(cudaSuccess == cudaPeekAtLastError());
  REQUIRE(cudaSuccess == cudaDeviceSynchronize());

  c2h::host_vector<int> expected(d_out.size());
  for (size_t i = 0; i < expected.size() - 1; i += 2)
  {
    expected[i]     = 2;
    expected[i + 1] = 1;
  }

  REQUIRE(d_out == expected);
  REQUIRE(d_block_aggregate[0] == 1);
}

to catch2_test_block_scan_api.cu.

  1. Build and run the test. E.g:
./ci/build_cub.sh
cmake --build build/cub-cpp17/ --target cub.cpp17.test.block_scan_api
./build/cub-cpp17/cub.cpp17.test.block_scan_api

It should result in a failure due to the block aggregate computed being 0 and not 1.

FYI: a similar test is implemented in catch2_test_block_scan_api.cu, but it uses the cub::Max() operator so it does seem to oversee the aggregates miscalculation.

Expected behavior

The block InclusiveScan with initial values should also include the initial value in the aggregates computation.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

@Beanavil Beanavil added the bug Something isn't working right. label Jan 14, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jan 14, 2025
@Beanavil Beanavil changed the title [BUG]: block-level InclusiveScan with initial value doesn't compute the aggregates correctly [BUG]: block-level InclusiveScan with initial value doesn't compute the block aggregate correctly Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: Todo
Development

No branches or pull requests

1 participant