ThreadPool: Spend less time busy waiting. #21545

goldsteinn · 2024-07-29T13:57:12Z

The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work).

Changes are twofold:

Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
reality is after ~10^4 spins, if there hasn't been any new work
added its unlikely any new work is imminent so sleep to
preserve power. This aligns more closely with upstream EigenV3.
Use exponential backoff for waiting on memory. This saves a bit
more power, and important increases the time between iterations
in WorkerLoop to help accomidate the dramatically lowering spin
counts.

Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on CPUIDInfo::IsHybrid(). This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of ThreadPoolTempl.

Measured performance on an Intel Meteor Lake CPU across a range of models.

Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better).

Session creation time cost	0.7179
First inference time cost	0.7156
Total inference time cost	1.0146
Total inference requests	0.8874
Average inference time cost	0.8800
Total inference run time	1.0146
Number of inferences per second	0.8955
Avg CPU usage	0.9462
Peak working set size	0.9922
Runs	1.1552
Min Latency	0.7283
Max Latency	0.9258
P50 Latency	0.9534
P90 Latency	0.9639
P95 Latency	0.9659
P99 Latency	0.9640

So the net result is a 1.16x improvement in throughput and between 1.08-1.37x improvement in latency.

The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work). Changes are twofold: 1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The reality is after ~10^4 spins, if there hasn't been any new work added its unlikely any new work is imminent so sleep to preserve power. 2) Use exponential backoff for waiting on memory. This saves a bit more power, and important increases the time between iterations in WorkerLoop to help accomidate the dramatically lowering spin counts. Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on `CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of `ThreadPoolTempl`. Measured performance on an [Intel Raptor Lake CPU](https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html) across a range of models. Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better). Session creation time cost|First inference time cost |Total inference time cost |Total inference requests |Average inference time cost |Total inference run time |Number of inferences per second |Avg CPU usage |Peak working set size |Runs |Min Latency |Max Latency |P50 Latency |P90 Latency |P95 Latency |P99 Latency :-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----: 0.9151 |0.8564 |0.9995 |0.9450 |0.9396 |0.9995 |0.9449 |0.9018 |0.9876 |1.0650 |0.9706 |0.8538 |0.9453 |0.9051 |0.8683 |0.8547

snnn · 2024-07-29T18:38:44Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline

snnn · 2024-07-29T18:38:51Z

/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2024-07-29T18:39:24Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2024-07-29T18:39:25Z

Azure Pipelines successfully started running 9 pipeline(s).

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

snnn · 2024-07-30T15:37:18Z

/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

snnn · 2024-07-30T15:37:25Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline

azure-pipelines · 2024-07-30T15:37:56Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2024-07-30T15:38:05Z

Azure Pipelines successfully started running 9 pipeline(s).

goldsteinn · 2024-07-30T18:26:08Z

Looking at the CI failures, I think the failures where spurious (or at least unrelated to this PR). Can it be re-run?

Windows Failure Builder #20240730.3

node:events:495
      throw er; // Unhandled 'error' event
      ^

Error: read ECONNRESET
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20)
Emitted 'error' event on ClientRequest instance at:
    at TLSSocket.socketErrorListener (node:_http_client:501:9)
    at TLSSocket.emit (node:events:517:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -4077,
  code: 'ECONNRESET',
  syscall: 'read'
}

goldsteinn · 2024-08-02T16:58:43Z

CI looks green :)

goldsteinn · 2024-08-10T14:54:10Z

ping

jeyblu · 2024-08-13T21:54:08Z

@snnn Is there anything else holding back this PR? Thanks

goldsteinn · 2024-08-21T17:23:50Z

ping

jeyblu · 2024-09-05T23:48:34Z

@pranavsharma How can we move forward on this? Thanks

goldsteinn · 2024-09-13T16:29:34Z

The:

ONNX Runtime Web CI Pipeline
Windows GPU CUDA CI Pipeline
Windows GPU DML CI Pipeline
Windows GPU Doc Gen CI Pipeline

CI tests seem to have hung. Can they be restarted?

goldsteinn · 2024-09-16T14:05:09Z

/azp run ONNX Runtime Web CI Pipeline

azure-pipelines · 2024-09-16T14:05:15Z

Commenter does not have sufficient privileges for PR 21545 in repo microsoft/onnxruntime

goldsteinn · 2024-09-18T21:42:33Z

Hi, Could someone restart the not-yet completed tests. I don't think I have the ability to merge this until they are all green and don't have the permissions to run the tests.

yufenglee · 2024-09-23T20:57:59Z

/azp run ONNX Runtime Web CI Pipeline, Windows GPU CUDA CI Pipeline ,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-09-23T20:58:23Z

Azure Pipelines successfully started running 4 pipeline(s).

yufenglee · 2024-09-23T20:58:41Z

/azp run Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-09-23T20:58:49Z

Azure Pipelines successfully started running 1 pipeline(s).

goldsteinn · 2024-09-30T21:20:21Z

Hi,

The failures (warnings) still seem entirely unrelated i.e:

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1500618&view=results

You should provide GitHub token if you want to download a python release. Otherwise you may hit the GitHub anonymous download limit.
Build_wasm_Debug • build_WASM • UsePythonVersion
	
	
You should provide GitHub token if you want to download a python release. Otherwise you may hit the GitHub anonymous download limit.
Build_wasm_Release • build_WASM • UsePythonVersion
	
	
You should provide GitHub token if you want to download a python release. Otherwise you may hit the GitHub anonymous download limit.
Build_wasm_Release_static_library • build_WASM • UsePythonVersion

Is it possible to ignore those warnings?

pranavsharma · 2024-10-01T22:01:02Z

/azp run ONNX Runtime Web CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-10-01T22:01:20Z

Azure Pipelines successfully started running 3 pipeline(s).

goldsteinn · 2024-10-02T00:45:59Z

<3

This reverts commit 4e15b22.

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

This reverts commit 4e15b22. Reason: We are seeing an increase in the number of deadlocks after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily.

This reverts commit 4e15b22. Reason: We are seeing an increase in the number of deadlocks after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. Moreover, this is causing an increase int he binary size. ### Description We are seeing an [increase in the number of deadlocks](#22315 (comment)) after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. ### Motivation and Context See above.

snnn · 2024-11-08T10:17:45Z

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

@@ -1257,9 +1283,10 @@ class ThreadPoolTempl : public onnxruntime::concurrency::ExtendedThreadPoolInter
    // Increase the worker count if needed.  Each worker will pick up
    // loops to execute from the current parallel section.
    std::function<void(unsigned)> worker_fn = [&ps](unsigned par_idx) {
+      ThreadPoolWaiter<kIsHybrid ? 4 : 0> waiter{};


So, this change means on non-Intel CPUs you removed the SpinPause and replaced it with a busy loop. Why? How is it better?

Not quite, if SpinPause is a proper nop, the loop will be optimized out due to no side-effects.

Why was the change made? I understand you wanted to optimize ONNX Runtime's performance for a few kinds of Intel CPUs, but why do you need to remove the spinlock for the other CPUs?

Hmm, this doesn't remove the spin lock. This is just revising how we wait between testing the lock. I.e:

while(try_locked) { waiter.wait(); }

The waiter.wait() is just to help improve the perf/power consumption of the spin lock. The spin lock itself is the while(try_locked) { /* Wait however you like */ }

The exact values where picked based on benchmarks

Here you set the template parameter kMaxBackoff to zero, which means waiter.wait() is empty, is it right?

But we do not want to turn ONNX Runtime performance for a special kind of CPU, especially when the tunning may hurt performance for the other kinds of CPUs.

I would say the distinction between Hybrid and non-Hybrid is pretty critical in the context of a scheduler. It fundamentally changes the characteristics of the threads.

I mentioned AMD because in my opinion this change is bad for all AMD CPUs. There is no good reason why we should change a spinlock to a busy loop for AMD CPUs. (You can argue it is not just hurting AMD CPUs, it hurts some other CPUs as well). I am questioning why the " _mm_pause();" function call was removed for most kind of CPUs, except Intel Hybrid CPUs

Again, benchmarks indicates it was profitable (including on AMD). Look this single pause is not the end all be all. If as the maintainer you say "I want the pause", I will update the PR, but I don't think your reasoning thus far is sound.

I didn't mention ARM because as you said, you even didn't test it on ARM

The changes the the wait code has no affect on arm. Its all a nop.

My job is to find the deadlock. I commit I will finish it. I am a maintainer of this repo, but I do not want to use it as authority to influence this technical discussion. I am not a specialist in thread scheduling. We will find someone who knows it better to review this code, after the deadlock bug is fixed.

Thanks for your clarification. We'll discuss what you're saying.

Regarding ARM, it is right that nop was used. There is a draft #22787 to add yield (no perf test is done yet).

Just a guess, but a true yield would probably best be used far more sparingly than pause. Pause really exists to save a bit of power/prevent aggressive speculation, but it's orders of magnitude less costly than a yield. Especially if it's basically running at 1 thread per core, it seems if you actually want to go to the kernel you probably want to just be sleeping to save more power, using yield to bounce in and out of kernel space doesn't really accomplish that... But hey, Let's see what the benchmarks say

tianleiwu · 2024-11-08T18:33:39Z

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

    while (ps.tasks_finished < tasks_to_wait_for) {
-      onnxruntime::concurrency::SpinPause();
+      waiter.wait();


Suggest to revert this place. Since it is waiting for other tasks, it is better to use loop of spin pause instead of no-op (for hybrid mode). It could reduce the power consumed by the thread while it waits, minimizing resource contention.

Maybe? The exact values are entirely based on benchmarks across a variety of machines. This performed the best in our benchmarks, but there is no intrinsic reason it needs to be as it is.

tianleiwu · 2024-11-16T00:26:27Z

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

+        unsigned pause_time = pause_time_ + 1U;
+        for (unsigned i = 0; i < pause_time; ++i) {
+          onnxruntime::concurrency::SpinPause();
+        }
+        pause_time_ = (pause_time * 2U) % kMaxBackoff;
+      }


This logic seems odd for kMaxBackoff=2 since kMaxBackoff = 2 has similar effect as kMaxBackoff = 1.

pause_time_ = 0; pause_time=0+1=1; pause_time_ is updated to (1 * 2) % 2 =0
When kMaxBackoff = 2, there is always one pause: 1, 1, 1, 1, ...
When kMaxBackoff = 4, pause: 1, 3, 3, 3, ...
When kMaxBackoff = 8, pause: 1, 3, 7, 7, ...

Ideally it could be something like:
kMaxBackoff = 2: pause 1, 2, 2, 2, 2, 2, ...
kMaxBackoff = 4: pause: 1, 2, 4, 4, 4, 4, ...
kMaxBackoff = 8: pause: 1, 2, 4, 8, 8, 8, ...
For example, when kMaxBackoff > 1, we only allow it be power of 2, then we can shift a bit each time until it reach the max.

I understand that we currently only use kMaxBackoff = 1 or 4 or 8. Suggest to test the "ideal one" to see whether that could help.

You are right. I think just <= would fix that although would need to re-benchmark.

yuslepukhin · 2024-11-18T18:16:07Z

I have not looked at the PR in depth. However, here is two scenarios that we want to account for.
One is a continuous delivery of requests in a busy online service. This is the scenario where we do not want spinning to stop.
Another scenario where we struggle with is a real time speech processing. In this case, the requests are far apart but require near real time responses. In this case, spinning stops and threads go to sleep.
However, waking them up takes a good millisecond which is a lot of time, considering requests often take less than half of a millisecond when spinning is on. We tend to disable spinning in this case, because it usually not long enough, but at the same time consume CPU between requests.

The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work). Changes are twofold: 1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The reality is after ~10^4 spins, if there hasn't been any new work added its unlikely any new work is imminent so sleep to preserve power. This aligns more closely with upstream EigenV3. 2) Use exponential backoff for waiting on memory. This saves a bit more power, and important increases the time between iterations in WorkerLoop to help accomidate the dramatically lowering spin counts. Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on `CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of `ThreadPoolTempl`. Measured performance on an [Intel Meteor Lake CPU](https://www.intel.com/content/www/us/en/products/sku/237329/intel-core-ultra-7-processor-165u-12m-cache-up-to-4-90-ghz/specifications.html) across a range of models. Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better). <div align="center"> <table> <tr> <th>Session creation time cost</th> <td>0.7179</td> </tr> <tr> <th>First inference time cost</th> <td>0.7156</td> </tr> <tr> <th>Total inference time cost</th> <td>1.0146</td> </tr> <tr> <th>Total inference requests</th> <td>0.8874</td> </tr> <tr> <th>Average inference time cost</th> <td>0.8800</td> </tr> <tr> <th>Total inference run time</th> <td>1.0146</td> </tr> <tr> <th>Number of inferences per second</th> <td>0.8955</td> </tr> <tr> <th>Avg CPU usage</th> <td>0.9462</td> </tr> <tr> <th>Peak working set size</th> <td>0.9922</td> </tr> <tr> <th>Runs</th> <td>1.1552</td> </tr> <tr> <th>Min Latency</th> <td>0.7283</td> </tr> <tr> <th>Max Latency</th> <td>0.9258</td> </tr> <tr> <th>P50 Latency</th> <td>0.9534</td> </tr> <tr> <th>P90 Latency</th> <td>0.9639</td> </tr> <tr> <th>P95 Latency</th> <td>0.9659</td> </tr> <tr> <th>P99 Latency</th> <td>0.9640</td> </tr> </table> </div> So the net result is a 1.16x improvement in throughput and between 1.08-1.37x improvement in latency.

…icrosoft#22350) This reverts commit 4e15b22. Reason: We are seeing an increase in the number of deadlocks after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. Moreover, this is causing an increase int he binary size. ### Description We are seeing an [increase in the number of deadlocks](microsoft#22315 (comment)) after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. ### Motivation and Context See above.

goldsteinn · 2024-11-25T15:42:08Z

I have not looked at the PR in depth. However, here is two scenarios that we want to account for. One is a continuous delivery of requests in a busy online service. This is the scenario where we do not want spinning to stop. Another scenario where we struggle with is a real time speech processing. In this case, the requests are far apart but require near real time responses. In this case, spinning stops and threads go to sleep. However, waking them up takes a good millisecond which is a lot of time, considering requests often take less than half of a millisecond when spinning is on. We tend to disable spinning in this case, because it usually not long enough, but at the same time consume CPU between requests.

Agreed. There is no one size fits all here. Wait to long and you burn power/throttle CPU freq on cores potentially doing useful work. Don't spin enough and you end up wasting time going to/from the OS.

The primary purpose of this patch is the prior threshold of 10^6 seemed WAY to hedged towards spinning. The new value 10^4 I think (and benchmarks support) strikes a better balance between these competing needs.

goldsteinn · 2025-01-06T23:52:42Z

Now that #23098 has landed, can this be re-opened and considered? It still merges. Happy to also merge in #22315 if that is the preferred impl.

tianleiwu · 2025-01-07T21:31:47Z

@goldsteinn, please open a new PR (with #22315). Suggest to add some test scripts so that other people can verify it.

goldsteinn · 2025-01-07T23:01:30Z

@goldsteinn, please open a new PR (with #22315). Suggest to add some test scripts so that other people can verify it.

got it, see: #23278

snnn requested a review from yuslepukhin July 29, 2024 18:37

snnn assigned pranavsharma Jul 29, 2024

snnn requested a review from pranavsharma July 29, 2024 18:37

snnn added the core runtime issues related to core runtime label Jul 29, 2024

github-advanced-security bot found potential problems Jul 29, 2024

View reviewed changes

Fixup nits from code scanner (const -> constexpr)

540d988

yufenglee requested a review from snnn August 15, 2024 20:20

pranavsharma approved these changes Sep 12, 2024

View reviewed changes

pranavsharma merged commit 4e15b22 into microsoft:main Oct 2, 2024
73 checks passed

snnn added a commit that referenced this pull request Oct 4, 2024

Revert "ThreadPool: Spend less time busy waiting. (#21545)"

46e73a2

This reverts commit 4e15b22.

snnn mentioned this pull request Oct 4, 2024

Revert "ThreadPool: Spend less time busy waiting." #22312

Closed

skottmckay mentioned this pull request Oct 4, 2024

Refactor recent change to the thread pool to minimize binary size impact #22315

Closed

tianleiwu reviewed Oct 5, 2024

View reviewed changes

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h Show resolved Hide resolved

snnn reviewed Nov 8, 2024

View reviewed changes

tianleiwu reviewed Nov 8, 2024

View reviewed changes

tianleiwu reviewed Nov 16, 2024

View reviewed changes

snnn mentioned this pull request Dec 13, 2024

Fix a deadlock bug in EigenNonBlockingThreadPool.h #23098

Merged

goldsteinn mentioned this pull request Jan 7, 2025

ThreadPool: Spend less time busy waiting. (2nd Attempt) #23278

Open

ThreadPool: Spend less time busy waiting. #21545

ThreadPool: Spend less time busy waiting. #21545

Conversation

goldsteinn commented Jul 29, 2024 • edited Loading

snnn commented Jul 29, 2024

snnn commented Jul 29, 2024

azure-pipelines bot commented Jul 29, 2024

azure-pipelines bot commented Jul 29, 2024

snnn commented Jul 30, 2024

snnn commented Jul 30, 2024

azure-pipelines bot commented Jul 30, 2024

azure-pipelines bot commented Jul 30, 2024

goldsteinn commented Jul 30, 2024

goldsteinn commented Aug 2, 2024

goldsteinn commented Aug 10, 2024

jeyblu commented Aug 13, 2024

goldsteinn commented Aug 21, 2024

jeyblu commented Sep 5, 2024

goldsteinn commented Sep 13, 2024

goldsteinn commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

goldsteinn commented Sep 18, 2024

yufenglee commented Sep 23, 2024

azure-pipelines bot commented Sep 23, 2024

yufenglee commented Sep 23, 2024

azure-pipelines bot commented Sep 23, 2024

goldsteinn commented Sep 30, 2024

pranavsharma commented Oct 1, 2024

azure-pipelines bot commented Oct 1, 2024

goldsteinn commented Oct 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianleiwu Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianleiwu Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuslepukhin commented Nov 18, 2024

goldsteinn commented Nov 25, 2024

goldsteinn commented Jan 6, 2025

tianleiwu commented Jan 7, 2025

goldsteinn commented Jan 7, 2025

goldsteinn commented Jul 29, 2024 •

edited

Loading

tianleiwu Nov 8, 2024 •

edited

Loading

tianleiwu Nov 16, 2024 •

edited

Loading