-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize merge_sort
algorithm for largest data sizes
#1977
base: main
Are you sure you want to change the base?
Conversation
…introduce new function __find_start_point_in Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __parallel_merge_submitter_large for merge of biggest data sizes Signed-off-by: Sergey Kopienko <[email protected]>
…using __parallel_merge_submitter_large for merge data equal or greater then 4M items Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…fix compile error Signed-off-by: Sergey Kopienko <[email protected]>
…fix Kernel names Signed-off-by: Sergey Kopienko <[email protected]>
…rename template parameter names in __parallel_merge_submitter Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __starting_size_limit_for_large_submitter into __parallel_merge Signed-off-by: Sergey Kopienko <[email protected]>
…renames Signed-off-by: Sergey Kopienko <[email protected]>
…introduce _split_point_t type Signed-off-by: Sergey Kopienko <[email protected]>
…remove usages of std::make_pair Signed-off-by: Sergey Kopienko <[email protected]>
…optimize evaluation of split-points on base diagonals Signed-off-by: Sergey Kopienko <[email protected]>
…renames Signed-off-by: Sergey Kopienko <[email protected]>
…extract eval_split_points_for_groups function Signed-off-by: Sergey Kopienko <[email protected]>
…extract run_parallel_merge function Signed-off-by: Sergey Kopienko <[email protected]>
…using SLM bank size to define chunk in the eval_nd_range_params function Signed-off-by: Sergey Kopienko <[email protected]>
…using SLM bank size to define chunk in the eval_nd_range_params function (16) Signed-off-by: Sergey Kopienko <[email protected]>
…restore old implementation of __find_start_point Signed-off-by: Sergey Kopienko <[email protected]>
…rename: base_diag_part -> steps_between_two_base_diags Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment Signed-off-by: Sergey Kopienko <[email protected]>
…fix an error in __parallel_merge_submitter_large::eval_split_points_for_groups Signed-off-by: Sergey Kopienko <[email protected]>
…onals is too short Signed-off-by: Sergey Kopienko <[email protected]>
…erge_submitter_large` into one `__parallel_merge_submitter` (#1956)
…fix review comment: remove extra condition check from __find_start_point_in Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment: fix condition check in __find_start_point_in Signed-off-by: Sergey Kopienko <[email protected]>
…apply GitHUB clang format Signed-off-by: Sergey Kopienko <[email protected]>
4574d07
to
efa2649
Compare
…or the largest data sizes Signed-off-by: Sergey Kopienko <[email protected]>
….h -remove unused local variable Signed-off-by: Sergey Kopienko <[email protected]>
….h - rename __find_or_eval_sp to __lookup_sp Signed-off-by: Sergey Kopienko <[email protected]>
….h - fix an error in tests Signed-off-by: Sergey Kopienko <[email protected]>
efa2649
to
7906635
Compare
…rge_sort.h - fix an error in tests" This reverts commit 7906635.
…nt earlier Signed-off-by: Sergey Kopienko <[email protected]>
….h - fix an error in tests Signed-off-by: Sergey Kopienko <[email protected]>
….h - refactoring of __merge_sort_global_submitter __lookup_sp Signed-off-by: Sergey Kopienko <[email protected]>
….h - refactoring of __merge_sort_global_submitter::eval_split_points_for_groups Signed-off-by: Sergey Kopienko <[email protected]>
… largest data sizes on GPU only Signed-off-by: Sergey Kopienko <[email protected]>
… largest data sizes on GPU only Signed-off-by: Sergey Kopienko <[email protected]>
…nt earlier Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
….h - additional explanations in the __merge_sort_global_submitter::__lookup_sp function Signed-off-by: Sergey Kopienko <[email protected]>
….h - fix capture modes in submit() calls Signed-off-by: Sergey Kopienko <[email protected]>
….h - fix self-review comment: refactoring of __temp_sp_storages creation in the __merge_sort_global_submitter::operator() Signed-off-by: Sergey Kopienko <[email protected]>
….h - remove extra static_cast in the __leaf_sorter::sort() Signed-off-by: Sergey Kopienko <[email protected]>
….h - fix self-review comment: refactoring of __temp_sp_storages creation in the __merge_sort_global_submitter::operator() Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
….h - avoid if statement inside Kernel's code Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
using std::swap; | ||
|
||
for (std::uint32_t i = __start; i < __end; ++i) | ||
{ | ||
for (std::uint32_t j = __start + 1; j < __start + __end - i; ++j) | ||
{ | ||
auto& __first_item = __storage_acc[j - 1]; | ||
auto& __second_item = __storage_acc[j]; | ||
if (__comp(__second_item, __first_item)) | ||
{ | ||
using std::swap; | ||
swap(__first_item, __second_item); | ||
} | ||
__comp(__storage_acc[j], __storage_acc[j - 1]) ? swap(__storage_acc[j - 1], __storage_acc[j]) : void(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not understanding the purpose of this change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review, but enough feedback to group together. I still need time to digest the algorithm, and look into details
template <typename _ExecutionPolicy, typename _Range, typename _TempBuf, typename _Compare, typename _Storage> | ||
sycl::event | ||
run_parallel_merge(const sycl::event& __event_chain, const _IndexT __n_sorted, const bool __data_in_temp, | ||
_ExecutionPolicy&& __exec, _Range&& __rng, _TempBuf& __temp_buf, _Compare __comp, | ||
const nd_range_params& __nd_range_params, _Storage& __base_diagonals_sp_global_storage) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, I would name this differently from the other run_parallel_merge
function, perhaps run_parallel_merge_from_diagonals
or something like that.
__data_area.is_i_elem_local_inside_merge_matrix() | ||
? (__data_in_temp | ||
? __serial_merge_w( | ||
__nd_range_params, __data_area, DropViews(__dst, __data_area), __rng, | ||
__find_start_point_w(__data_area, DropViews(__dst, __data_area), __comp), __comp) | ||
: __serial_merge_w( | ||
__nd_range_params, __data_area, DropViews(__rng, __data_area), __dst, | ||
__find_start_point_w(__data_area, DropViews(__rng, __data_area), __comp), __comp)) | ||
: void(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these ternary operators are way more confusing than if statements, and I don't believe they provide any advantage over if statements.
const auto __portions = oneapi::dpl::__internal::__dpl_ceiling_div(__n, 2 * __n_sorted); | ||
nd_range_params __nd_range_params_this = eval_nd_range_params(__exec, std::size_t(2 * __n_sorted)); | ||
__nd_range_params_this.steps *= __portions; | ||
__nd_range_params_this.base_diag_count *= __portions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems strange to have a function specifically to calculate these parameters, which needs this extra work to "fix" parameters in 1 of 2 times it is called. Can we add an extra parameter and include this calculation in the function?
This would also allow us to call it at the beginning of each loop rather than separating it to 2 places in the code (before loop and in else).
const _IndexT __steps = oneapi::dpl::__internal::__dpl_ceiling_div(__rng_size, __chunk); | ||
|
||
// TODO required to evaluate this value based on available SLM size for each work-group. | ||
_IndexT __base_diag_count = tune_amount_of_base_diagonals(__rng_size, 32 * 1'024); // 32 Kb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more explanation of these magic numbers is warranted. Is it possible to consolidate this magic number with the one in tune_amount_of_base_diagonals
?
It seems like it should be equivalent to something like __base_diag_count = bit_floor(ceil(__nsorted/8192))
.
auto __p_base_diagonals_sp_storage = | ||
new __base_diagonals_sp_storage_t(__exec, 0, __nd_range_params_this.base_diag_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we calculate the maximum diag count once and allocate that once, then re-use the storage, rather than reallocating each time?
} | ||
}); | ||
}); | ||
if (2 * __n_sorted < __get_starting_size_limit_for_large_submitter<__value_type>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense that the starting size limit is the same for "merge" as it is for segments of merge_sort
?
I would expect that it would be a somewhat lower threshold for merge sort, since we can combine calculations of base diagonals across multiple segments of merge within merge sort in a single kernel.
Perhaps it has more to do with the size between base segments than justifying the diagonal kernel, in which case it perhaps makes sense to have the same threshold. I think a discussion within a tech meeting would be helpful to consider this stuff.
// | __sp_left | __sp_right | ||
// | | | ||
// | __linear_id_in_steps_range | ||
// We doesn't save the first diagonal into base diagonal's SP storage !!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We doesn't save the first diagonal into base diagonal's SP storage !!! | |
// We don't save the first diagonal into base diagonal's SP storage !!! |
In this PR we extends the approach from #1933 to
merge_sort
algorithm.