Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadPool: Spend less time busy waiting. #21545

Merged

Conversation

goldsteinn
Copy link
Contributor

@goldsteinn goldsteinn commented Jul 29, 2024

The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work).

Changes are twofold:

  1. Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
    reality is after ~10^4 spins, if there hasn't been any new work
    added its unlikely any new work is imminent so sleep to
    preserve power. This aligns more closely with upstream EigenV3.

  2. Use exponential backoff for waiting on memory. This saves a bit
    more power, and important increases the time between iterations
    in WorkerLoop to help accomidate the dramatically lowering spin
    counts.

Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on CPUIDInfo::IsHybrid(). This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of ThreadPoolTempl.

Measured performance on an Intel Meteor Lake CPU across a range of models.

Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better).

Session creation time cost 0.7179
First inference time cost 0.7156
Total inference time cost 1.0146
Total inference requests 0.8874
Average inference time cost 0.8800
Total inference run time 1.0146
Number of inferences per second 0.8955
Avg CPU usage 0.9462
Peak working set size 0.9922
Runs 1.1552
Min Latency 0.7283
Max Latency 0.9258
P50 Latency 0.9534
P90 Latency 0.9639
P95 Latency 0.9659
P99 Latency 0.9640

So the net result is a 1.16x improvement in throughput and between 1.08-1.37x improvement in latency.

The purpose of the patch is primarily to save power, but it also has
nice perf benefits (mostly from allowing the system to better
distribute power to cores doing meaningful work).

Changes are twofold:
    1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
       reality is after ~10^4 spins, if there hasn't been any new work
       added its unlikely any new work is imminent so sleep to
       preserve power.

    2) Use exponential backoff for waiting on memory. This saves a bit
       more power, and important increases the time between iterations
       in WorkerLoop to help accomidate the dramatically lowering spin
       counts.

Since the tuning for both the iteration counts / backoff counts are
dramatically different for hybrid/non-hybrid systems, this patch
templates the affected functions and dynamically choses based on
`CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of
getting the change in, although its likely we could incur less dynamic
overhead if we added the template argument to the entirety of
`ThreadPoolTempl`.

Measured performance on an [Intel Raptor Lake
CPU](https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html)
across a range of models.

Below are the result of 3 runs with each metric being the
value-before-patch / value-after-patch (so for something like
inference time, lower is better).

Session creation time cost|First inference time cost |Total inference time cost |Total inference requests |Average inference time cost |Total inference run time |Number of inferences per second |Avg CPU usage |Peak working set size |Runs   |Min Latency |Max Latency |P50 Latency |P90 Latency |P95 Latency |P99 Latency
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
0.9151                    |0.8564                    |0.9995                    |0.9450                   |0.9396                      |0.9995                   |0.9449                          |0.9018        |0.9876                |1.0650 |0.9706      |0.8538      |0.9453      |0.9051      |0.8683      |0.8547
@snnn snnn requested a review from yuslepukhin July 29, 2024 18:37
@snnn snnn requested a review from pranavsharma July 29, 2024 18:37
@snnn snnn added the core runtime issues related to core runtime label Jul 29, 2024
@snnn
Copy link
Member

snnn commented Jul 29, 2024

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline

@snnn
Copy link
Member

snnn commented Jul 29, 2024

/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

Copy link

Azure Pipelines successfully started running 8 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@snnn
Copy link
Member

snnn commented Jul 30, 2024

/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@snnn
Copy link
Member

snnn commented Jul 30, 2024

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline

Copy link

Azure Pipelines successfully started running 8 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@goldsteinn
Copy link
Contributor Author

Looking at the CI failures, I think the failures where spurious (or at least unrelated to this PR). Can it be re-run?

Windows Failure Builder #20240730.3

node:events:495
      throw er; // Unhandled 'error' event
      ^

Error: read ECONNRESET
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20)
Emitted 'error' event on ClientRequest instance at:
    at TLSSocket.socketErrorListener (node:_http_client:501:9)
    at TLSSocket.emit (node:events:517:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -4077,
  code: 'ECONNRESET',
  syscall: 'read'
}

@goldsteinn
Copy link
Contributor Author

CI looks green :)

@goldsteinn
Copy link
Contributor Author

ping

@jeyblu
Copy link
Contributor

jeyblu commented Aug 13, 2024

@snnn Is there anything else holding back this PR? Thanks

@yufenglee yufenglee requested a review from snnn August 15, 2024 20:20
@goldsteinn
Copy link
Contributor Author

ping

@jeyblu
Copy link
Contributor

jeyblu commented Sep 5, 2024

@pranavsharma How can we move forward on this? Thanks

@goldsteinn
Copy link
Contributor Author

The:

ONNX Runtime Web CI Pipeline
Windows GPU CUDA CI Pipeline
Windows GPU DML CI Pipeline
Windows GPU Doc Gen CI Pipeline

CI tests seem to have hung. Can they be restarted?

@goldsteinn
Copy link
Contributor Author

/azp run ONNX Runtime Web CI Pipeline

Copy link

Commenter does not have sufficient privileges for PR 21545 in repo microsoft/onnxruntime

@goldsteinn
Copy link
Contributor Author

Hi, Could someone restart the not-yet completed tests. I don't think I have the ability to merge this until they are all green and don't have the permissions to run the tests.

@yufenglee
Copy link
Member

/azp run ONNX Runtime Web CI Pipeline, Windows GPU CUDA CI Pipeline ,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@yufenglee
Copy link
Member

/azp run Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@goldsteinn
Copy link
Contributor Author

Hi,

The failures (warnings) still seem entirely unrelated i.e:

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1500618&view=results

You should provide GitHub token if you want to download a python release. Otherwise you may hit the GitHub anonymous download limit.
Build_wasm_Debug • build_WASM • UsePythonVersion
	
	
You should provide GitHub token if you want to download a python release. Otherwise you may hit the GitHub anonymous download limit.
Build_wasm_Release • build_WASM • UsePythonVersion
	
	
You should provide GitHub token if you want to download a python release. Otherwise you may hit the GitHub anonymous download limit.
Build_wasm_Release_static_library • build_WASM • UsePythonVersion

Is it possible to ignore those warnings?

@pranavsharma
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@pranavsharma pranavsharma merged commit 4e15b22 into microsoft:main Oct 2, 2024
73 checks passed
@goldsteinn
Copy link
Contributor Author

<3

pranavsharma added a commit that referenced this pull request Oct 8, 2024
This reverts commit 4e15b22.

Reason: We are seeing an increase in the number of deadlocks after this PR.
We have a release coming up next week and do not have enough time to investigate
the root cause, hence reverting this PR temporarily.
pranavsharma added a commit that referenced this pull request Oct 9, 2024
This reverts commit 4e15b22.

Reason: We are seeing an increase in the number of deadlocks after this
PR. We have a release coming up next week and do not have enough time to
investigate the root cause, hence reverting this PR temporarily.
Moreover, this is causing an increase int he binary size.

### Description
We are seeing an [increase in the number of
deadlocks](#22315 (comment))
after this PR. We have a release coming up next week and do not have
enough time to investigate the root cause, hence reverting this PR
temporarily.

### Motivation and Context
See above.
@@ -1257,9 +1283,10 @@ class ThreadPoolTempl : public onnxruntime::concurrency::ExtendedThreadPoolInter
// Increase the worker count if needed. Each worker will pick up
// loops to execute from the current parallel section.
std::function<void(unsigned)> worker_fn = [&ps](unsigned par_idx) {
ThreadPoolWaiter<kIsHybrid ? 4 : 0> waiter{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this change means on non-Intel CPUs you removed the SpinPause and replaced it with a busy loop. Why? How is it better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, if SpinPause is a proper nop, the loop will be optimized out due to no side-effects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the change made? I understand you wanted to optimize ONNX Runtime's performance for a few kinds of Intel CPUs, but why do you need to remove the spinlock for the other CPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this doesn't remove the spin lock. This is just revising how we wait between testing the lock. I.e:

while(try_locked) { waiter.wait(); }

The waiter.wait() is just to help improve the perf/power consumption of the spin lock. The spin lock itself is the while(try_locked) { /* Wait however you like */ }

The exact values where picked based on benchmarks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you set the template parameter kMaxBackoff to zero, which means waiter.wait() is empty, is it right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do not want to turn ONNX Runtime performance for a special kind of CPU, especially when the tunning may hurt performance for the other kinds of CPUs.

I would say the distinction between Hybrid and non-Hybrid is pretty critical in the context of a scheduler. It fundamentally changes the characteristics of the threads.

I mentioned AMD because in my opinion this change is bad for all AMD CPUs. There is no good reason why we should change a spinlock to a busy loop for AMD CPUs. (You can argue it is not just hurting AMD CPUs, it hurts some other CPUs as well). I am questioning why the " _mm_pause();" function call was removed for most kind of CPUs, except Intel Hybrid CPUs

Again, benchmarks indicates it was profitable (including on AMD). Look this single pause is not the end all be all. If as the maintainer you say "I want the pause", I will update the PR, but I don't think your reasoning thus far is sound.

I didn't mention ARM because as you said, you even didn't test it on ARM

The changes the the wait code has no affect on arm. Its all a nop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My job is to find the deadlock. I commit I will finish it. I am a maintainer of this repo, but I do not want to use it as authority to influence this technical discussion. I am not a specialist in thread scheduling. We will find someone who knows it better to review this code, after the deadlock bug is fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your clarification. We'll discuss what you're saying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding ARM, it is right that nop was used. There is a draft #22787 to add yield (no perf test is done yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a guess, but a true yield would probably best be used far more sparingly than pause. Pause really exists to save a bit of power/prevent aggressive speculation, but it's orders of magnitude less costly than a yield. Especially if it's basically running at 1 thread per core, it seems if you actually want to go to the kernel you probably want to just be sleeping to save more power, using yield to bounce in and out of kernel space doesn't really accomplish that... But hey, Let's see what the benchmarks say

while (ps.tasks_finished < tasks_to_wait_for) {
onnxruntime::concurrency::SpinPause();
waiter.wait();
Copy link
Contributor

@tianleiwu tianleiwu Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to revert this place. Since it is waiting for other tasks, it is better to use loop of spin pause instead of no-op (for hybrid mode). It could reduce the power consumed by the thread while it waits, minimizing resource contention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? The exact values are entirely based on benchmarks across a variety of machines. This performed the best in our benchmarks, but there is no intrinsic reason it needs to be as it is.

Comment on lines +784 to +789
unsigned pause_time = pause_time_ + 1U;
for (unsigned i = 0; i < pause_time; ++i) {
onnxruntime::concurrency::SpinPause();
}
pause_time_ = (pause_time * 2U) % kMaxBackoff;
}
Copy link
Contributor

@tianleiwu tianleiwu Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems odd for kMaxBackoff=2 since kMaxBackoff = 2 has similar effect as kMaxBackoff = 1.

pause_time_ = 0; pause_time=0+1=1; pause_time_ is updated to (1 * 2) % 2 =0
When kMaxBackoff = 2, there is always one pause: 1, 1, 1, 1, ...
When kMaxBackoff = 4, pause: 1, 3, 3, 3, ...
When kMaxBackoff = 8, pause: 1, 3, 7, 7, ...

Ideally it could be something like:
kMaxBackoff = 2: pause 1, 2, 2, 2, 2, 2, ...
kMaxBackoff = 4: pause: 1, 2, 4, 4, 4, 4, ...
kMaxBackoff = 8: pause: 1, 2, 4, 8, 8, 8, ...
For example, when kMaxBackoff > 1, we only allow it be power of 2, then we can shift a bit each time until it reach the max.

I understand that we currently only use kMaxBackoff = 1 or 4 or 8. Suggest to test the "ideal one" to see whether that could help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I think just <= would fix that although would need to re-benchmark.

@yuslepukhin
Copy link
Member

I have not looked at the PR in depth. However, here is two scenarios that we want to account for.
One is a continuous delivery of requests in a busy online service. This is the scenario where we do not want spinning to stop.
Another scenario where we struggle with is a real time speech processing. In this case, the requests are far apart but require near real time responses. In this case, spinning stops and threads go to sleep.
However, waking them up takes a good millisecond which is a lot of time, considering requests often take less than half of a millisecond when spinning is on. We tend to disable spinning in this case, because it usually not long enough, but at the same time consume CPU between requests.

ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this pull request Nov 19, 2024
The purpose of the patch is primarily to save power, but it also has
nice perf benefits (mostly from allowing the system to better distribute
power to cores doing meaningful work).

Changes are twofold:

1)    Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
       reality is after ~10^4 spins, if there hasn't been any new work
       added its unlikely any new work is imminent so sleep to
       preserve power. This aligns more closely with upstream EigenV3.

2)   Use exponential backoff for waiting on memory. This saves a bit
       more power, and important increases the time between iterations
       in WorkerLoop to help accomidate the dramatically lowering spin
       counts.

Since the tuning for both the iteration counts / backoff counts are
dramatically different for hybrid/non-hybrid systems, this patch
templates the affected functions and dynamically choses based on
`CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of
getting the change in, although its likely we could incur less dynamic
overhead if we added the template argument to the entirety of
`ThreadPoolTempl`.

Measured performance on an [Intel Meteor Lake
CPU](https://www.intel.com/content/www/us/en/products/sku/237329/intel-core-ultra-7-processor-165u-12m-cache-up-to-4-90-ghz/specifications.html)
across a range of models.

Below are the result of 3 runs with each metric being the
value-before-patch / value-after-patch (so for something like inference
time, lower is better).
<div align="center">
<table>
<tr>
<th>Session creation time cost</th>
<td>0.7179</td>
</tr>
<tr>
<th>First inference time cost</th>
<td>0.7156</td>
</tr>
<tr>
<th>Total inference time cost</th>
<td>1.0146</td>
</tr>
<tr>
<th>Total inference requests</th>
<td>0.8874</td>
</tr>
<tr>
<th>Average inference time cost</th>
<td>0.8800</td>
</tr>
<tr>
<th>Total inference run time</th>
<td>1.0146</td>
</tr>
<tr>
<th>Number of inferences per second</th>
<td>0.8955</td>
</tr>
<tr>
<th>Avg CPU usage</th>
<td>0.9462</td>
</tr>
<tr>
<th>Peak working set size</th>
<td>0.9922</td>
</tr>
<tr>
<th>Runs</th>
<td>1.1552</td>
</tr>
<tr>
<th>Min Latency</th>
<td>0.7283</td>
</tr>
<tr>
<th>Max Latency</th>
<td>0.9258</td>
</tr>
<tr>
<th>P50 Latency</th>
<td>0.9534</td>
</tr>
<tr>
<th>P90 Latency</th>
<td>0.9639</td>
</tr>
<tr>
<th>P95 Latency</th>
<td>0.9659</td>
</tr>
<tr>
<th>P99 Latency</th>
<td>0.9640</td>
</tr>
</table>
</div>

So the net result is a 1.16x improvement in throughput and between
1.08-1.37x improvement in latency.
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this pull request Nov 19, 2024
…icrosoft#22350)

This reverts commit 4e15b22.

Reason: We are seeing an increase in the number of deadlocks after this
PR. We have a release coming up next week and do not have enough time to
investigate the root cause, hence reverting this PR temporarily.
Moreover, this is causing an increase int he binary size.

### Description
We are seeing an [increase in the number of
deadlocks](microsoft#22315 (comment))
after this PR. We have a release coming up next week and do not have
enough time to investigate the root cause, hence reverting this PR
temporarily.

### Motivation and Context
See above.
@goldsteinn
Copy link
Contributor Author

I have not looked at the PR in depth. However, here is two scenarios that we want to account for. One is a continuous delivery of requests in a busy online service. This is the scenario where we do not want spinning to stop. Another scenario where we struggle with is a real time speech processing. In this case, the requests are far apart but require near real time responses. In this case, spinning stops and threads go to sleep. However, waking them up takes a good millisecond which is a lot of time, considering requests often take less than half of a millisecond when spinning is on. We tend to disable spinning in this case, because it usually not long enough, but at the same time consume CPU between requests.

Agreed. There is no one size fits all here. Wait to long and you burn power/throttle CPU freq on cores potentially doing useful work. Don't spin enough and you end up wasting time going to/from the OS.

The primary purpose of this patch is the prior threshold of 10^6 seemed WAY to hedged towards spinning. The new value 10^4 I think (and benchmarks support) strikes a better balance between these competing needs.

@goldsteinn
Copy link
Contributor Author

Now that #23098 has landed, can this be re-opened and considered? It still merges. Happy to also merge in #22315 if that is the preferred impl.

@tianleiwu
Copy link
Contributor

@goldsteinn, please open a new PR (with #22315). Suggest to add some test scripts so that other people can verify it.

@goldsteinn
Copy link
Contributor Author

@goldsteinn, please open a new PR (with #22315). Suggest to add some test scripts so that other people can verify it.

got it, see: #23278

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants