Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize merge_sort algorithm for largest data sizes #1977

Open
wants to merge 94 commits into
base: main
Choose a base branch
from

Conversation

SergeyKopienko
Copy link
Contributor

@SergeyKopienko SergeyKopienko commented Dec 19, 2024

In this PR we extends the approach from #1933 to merge_sort algorithm.

…introduce new function __find_start_point_in

Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __parallel_merge_submitter_large for merge of biggest data sizes

Signed-off-by: Sergey Kopienko <[email protected]>
…using __parallel_merge_submitter_large for merge data equal or greater then 4M items

Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…rename template parameter names in __parallel_merge_submitter

Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __starting_size_limit_for_large_submitter into __parallel_merge

Signed-off-by: Sergey Kopienko <[email protected]>
…introduce _split_point_t type

Signed-off-by: Sergey Kopienko <[email protected]>
…remove usages of std::make_pair

Signed-off-by: Sergey Kopienko <[email protected]>
…optimize evaluation of split-points on base diagonals

Signed-off-by: Sergey Kopienko <[email protected]>
…extract eval_split_points_for_groups function

Signed-off-by: Sergey Kopienko <[email protected]>
…extract run_parallel_merge function

Signed-off-by: Sergey Kopienko <[email protected]>
…using SLM bank size to define chunk in the eval_nd_range_params function

Signed-off-by: Sergey Kopienko <[email protected]>
…using SLM bank size to define chunk in the eval_nd_range_params function (16)

Signed-off-by: Sergey Kopienko <[email protected]>
…restore old implementation of __find_start_point

Signed-off-by: Sergey Kopienko <[email protected]>
…rename: base_diag_part -> steps_between_two_base_diags

Signed-off-by: Sergey Kopienko <[email protected]>
…fix an error in __parallel_merge_submitter_large::eval_split_points_for_groups

Signed-off-by: Sergey Kopienko <[email protected]>
…erge_submitter_large` into one `__parallel_merge_submitter` (#1956)
…fix review comment: remove extra condition check from __find_start_point_in

Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment: fix condition check in __find_start_point_in

Signed-off-by: Sergey Kopienko <[email protected]>
…apply GitHUB clang format

Signed-off-by: Sergey Kopienko <[email protected]>
@SergeyKopienko SergeyKopienko marked this pull request as draft December 22, 2024 13:26
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_sort_V1 branch from 4574d07 to efa2649 Compare December 22, 2024 21:29
…or the largest data sizes

Signed-off-by: Sergey Kopienko <[email protected]>
….h -remove unused local variable

Signed-off-by: Sergey Kopienko <[email protected]>
….h - rename __find_or_eval_sp to __lookup_sp

Signed-off-by: Sergey Kopienko <[email protected]>
….h - fix an error in tests

Signed-off-by: Sergey Kopienko <[email protected]>
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_sort_V1 branch from efa2649 to 7906635 Compare December 22, 2024 21:39
…rge_sort.h - fix an error in tests"

This reverts commit 7906635.
….h - fix an error in tests

Signed-off-by: Sergey Kopienko <[email protected]>
….h - refactoring of __merge_sort_global_submitter __lookup_sp

Signed-off-by: Sergey Kopienko <[email protected]>
….h - refactoring of __merge_sort_global_submitter::eval_split_points_for_groups

Signed-off-by: Sergey Kopienko <[email protected]>
… largest data sizes on GPU only

Signed-off-by: Sergey Kopienko <[email protected]>
… largest data sizes on GPU only

Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
@SergeyKopienko SergeyKopienko marked this pull request as ready for review December 22, 2024 22:45
….h - additional explanations in the __merge_sort_global_submitter::__lookup_sp function

Signed-off-by: Sergey Kopienko <[email protected]>
….h - fix capture modes in submit() calls

Signed-off-by: Sergey Kopienko <[email protected]>
….h - fix self-review comment: refactoring of __temp_sp_storages creation in the __merge_sort_global_submitter::operator()

Signed-off-by: Sergey Kopienko <[email protected]>
….h - remove extra static_cast in the __leaf_sorter::sort()

Signed-off-by: Sergey Kopienko <[email protected]>
….h - fix self-review comment: refactoring of __temp_sp_storages creation in the __merge_sort_global_submitter::operator()

Signed-off-by: Sergey Kopienko <[email protected]>
Comment on lines +46 to 53
using std::swap;

for (std::uint32_t i = __start; i < __end; ++i)
{
for (std::uint32_t j = __start + 1; j < __start + __end - i; ++j)
{
auto& __first_item = __storage_acc[j - 1];
auto& __second_item = __storage_acc[j];
if (__comp(__second_item, __first_item))
{
using std::swap;
swap(__first_item, __second_item);
}
__comp(__storage_acc[j], __storage_acc[j - 1]) ? swap(__storage_acc[j - 1], __storage_acc[j]) : void();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding the purpose of this change...

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review, but enough feedback to group together. I still need time to digest the algorithm, and look into details

Comment on lines +501 to +505
template <typename _ExecutionPolicy, typename _Range, typename _TempBuf, typename _Compare, typename _Storage>
sycl::event
run_parallel_merge(const sycl::event& __event_chain, const _IndexT __n_sorted, const bool __data_in_temp,
_ExecutionPolicy&& __exec, _Range&& __rng, _TempBuf& __temp_buf, _Compare __comp,
const nd_range_params& __nd_range_params, _Storage& __base_diagonals_sp_global_storage) const
Copy link
Contributor

@danhoeflinger danhoeflinger Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I would name this differently from the other run_parallel_merge function, perhaps run_parallel_merge_from_diagonals or something like that.

Comment on lines +487 to +495
__data_area.is_i_elem_local_inside_merge_matrix()
? (__data_in_temp
? __serial_merge_w(
__nd_range_params, __data_area, DropViews(__dst, __data_area), __rng,
__find_start_point_w(__data_area, DropViews(__dst, __data_area), __comp), __comp)
: __serial_merge_w(
__nd_range_params, __data_area, DropViews(__rng, __data_area), __dst,
__find_start_point_w(__data_area, DropViews(__rng, __data_area), __comp), __comp))
: void();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these ternary operators are way more confusing than if statements, and I don't believe they provide any advantage over if statements.

Comment on lines +585 to +588
const auto __portions = oneapi::dpl::__internal::__dpl_ceiling_div(__n, 2 * __n_sorted);
nd_range_params __nd_range_params_this = eval_nd_range_params(__exec, std::size_t(2 * __n_sorted));
__nd_range_params_this.steps *= __portions;
__nd_range_params_this.base_diag_count *= __portions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems strange to have a function specifically to calculate these parameters, which needs this extra work to "fix" parameters in 1 of 2 times it is called. Can we add an extra parameter and include this calculation in the function?

This would also allow us to call it at the beginning of each loop rather than separating it to 2 places in the code (before loop and in else).

const _IndexT __steps = oneapi::dpl::__internal::__dpl_ceiling_div(__rng_size, __chunk);

// TODO required to evaluate this value based on available SLM size for each work-group.
_IndexT __base_diag_count = tune_amount_of_base_diagonals(__rng_size, 32 * 1'024); // 32 Kb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more explanation of these magic numbers is warranted. Is it possible to consolidate this magic number with the one in tune_amount_of_base_diagonals?
It seems like it should be equivalent to something like __base_diag_count = bit_floor(ceil(__nsorted/8192)).

Comment on lines +592 to +593
auto __p_base_diagonals_sp_storage =
new __base_diagonals_sp_storage_t(__exec, 0, __nd_range_params_this.base_diag_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we calculate the maximum diag count once and allocate that once, then re-use the storage, rather than reallocating each time?

}
});
});
if (2 * __n_sorted < __get_starting_size_limit_for_large_submitter<__value_type>())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense that the starting size limit is the same for "merge" as it is for segments of merge_sort?
I would expect that it would be a somewhat lower threshold for merge sort, since we can combine calculations of base diagonals across multiple segments of merge within merge sort in a single kernel.

Perhaps it has more to do with the size between base segments than justifying the diagonal kernel, in which case it perhaps makes sense to have the same threshold. I think a discussion within a tech meeting would be helpful to consider this stuff.

// | __sp_left | __sp_right
// | |
// | __linear_id_in_steps_range
// We doesn't save the first diagonal into base diagonal's SP storage !!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We doesn't save the first diagonal into base diagonal's SP storage !!!
// We don't save the first diagonal into base diagonal's SP storage !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants