Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue with sorting_with_executors #17

Open
alexbarev opened this issue Jan 5, 2025 · 5 comments
Open

Memory issue with sorting_with_executors #17

alexbarev opened this issue Jan 5, 2025 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@alexbarev
Copy link
Contributor

alexbarev commented Jan 5, 2025

  1. Run sorting_with_executors benchmark using the std::execution::par_unseq policy.

  2. Memory consumption quickly exceeds my machine's availability of 60GB after the second test variant finishes: sorting_with_executors/par_unseq/4194304/

Observations:

  1. Memory does not decrease between tests with inputs of different sizes.

  2. I tried moving the policy inside the loop, but it did not resolve the issue:

    for (auto _ : state) {
        auto local_policy = std::execution::par_unseq; // Create policy here
        std::reverse(local_policy, array.begin(), array.end());
        std::sort(local_policy, array.begin(), array.end());
        bm::DoNotOptimize(array.size());
        bm::DoNotOptimize(local_policy); // even tried this just in case
    }
  3. I noticed that only 16 threads are spawned, which matches the number of available CPU's. The issue is puzzling, and I am yet to understand its root cause.

@ashvardanian
Copy link
Owner

ashvardanian commented Jan 5, 2025

Turns out, it’s a known issue.

@ashvardanian
Copy link
Owner

Also relates to this issue.

@alexbarev
Copy link
Contributor Author

alexbarev commented Jan 5, 2025

@ashvardanian I reproduced the issue with a manually built oneTBB.
This code compiled with -DPAR increases memory consumption, reaching 6GB at its peak, after which the script finishes. The increase happens slowly (>1min) with O0 but very quickly (few seconds) with O3.

WSL Ubuntu 22.04, gcc version 13.1.0. Latest oneTBB/master built from source.

NB. Same issue outside of WSL.

#include <vector>
#include <cstdint>
#include <numeric>
#include <execution>
#include <algorithm>
#include <oneapi/tbb.h>

int main() {
    int count = 100000;
    std::vector<std::uint32_t> array(count);
    std::iota(array.begin(), array.end(), 1u);

#ifndef PAR
    auto policy = std::execution::seq;
#else
    auto policy = std::execution::par_unseq;
#endif

    for (int i = 0; i < 100000; ++i) {
        std::sort(policy, array.begin(), array.end());
    }

}

In contrast, using -fopenmp with this code runs without a noticeable increase in memory consumption.

#include <vector>
#include <cstdint>
#include <numeric>
#include <execution>
#include <algorithm>

int main() {
    int count = 100000;
    std::vector<std::uint32_t> array(count);
    std::iota(array.begin(), array.end(), 1u);

#pragma omp parallel for
    for (int i = 0; i < 100000; ++i) {
        std::sort(array.begin(), array.end());
    }

}

@alexbarev
Copy link
Contributor Author

alexbarev commented Jan 6, 2025

@ashvardanian The problem lies with libstdc++.

We are waiting for a fix here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117276

Currently, one can fix it locally by modifying the libstdc++ files at:
/usr/include/c++/13/pstl/parallel_backend_tbb.h

This works for me

 class __task : public tbb::detail::d1::task
 {
   protected:
@@ -646,10 +646,15 @@
 
         _PSTL_ASSERT(__parent != nullptr);
         _PSTL_ASSERT(__parent->_M_refcount.load(std::memory_order_relaxed) > 0);
-        if (--__parent->_M_refcount == 0)
+        auto __refcount = --__parent->_M_refcount;
+
+        // Placing the deallocation after the refcount decrement allows another thread to proceed with tree
+        // folding concurrently with this task cleanup.
+        __alloc.deallocate(this, *__ed);
+
+        if (__refcount == 0)
         {
             _PSTL_ASSERT(__next == nullptr);
-            __alloc.deallocate(this, *__ed);
             return __parent;
         }

Found solution here uxlfoundation/oneTBB#1533

@ashvardanian
Copy link
Owner

@alexbarev, how about we add a ->Iterations(1) to that benchmark with a link to this issue for now?

alexbarev added a commit to alexbarev/less_slow.cpp that referenced this issue Jan 6, 2025
ashvardanian pushed a commit that referenced this issue Jan 6, 2025
@ashvardanian ashvardanian added the bug Something isn't working label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants