-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ThreadPool: Spend less time busy waiting. #21545
ThreadPool: Spend less time busy waiting. #21545
Conversation
The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work). Changes are twofold: 1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The reality is after ~10^4 spins, if there hasn't been any new work added its unlikely any new work is imminent so sleep to preserve power. 2) Use exponential backoff for waiting on memory. This saves a bit more power, and important increases the time between iterations in WorkerLoop to help accomidate the dramatically lowering spin counts. Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on `CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of `ThreadPoolTempl`. Measured performance on an [Intel Raptor Lake CPU](https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html) across a range of models. Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better). Session creation time cost|First inference time cost |Total inference time cost |Total inference requests |Average inference time cost |Total inference run time |Number of inferences per second |Avg CPU usage |Peak working set size |Runs |Min Latency |Max Latency |P50 Latency |P90 Latency |P95 Latency |P99 Latency :-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----: 0.9151 |0.8564 |0.9995 |0.9450 |0.9396 |0.9995 |0.9449 |0.9018 |0.9876 |1.0650 |0.9706 |0.8538 |0.9453 |0.9051 |0.8683 |0.8547
/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline |
/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
Azure Pipelines successfully started running 8 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline |
Azure Pipelines successfully started running 8 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
Looking at the CI failures, I think the failures where spurious (or at least unrelated to this PR). Can it be re-run? Windows Failure Builder #20240730.3
|
CI looks green :) |
ping |
@snnn Is there anything else holding back this PR? Thanks |
ping |
@pranavsharma How can we move forward on this? Thanks |
The:
CI tests seem to have hung. Can they be restarted? |
/azp run ONNX Runtime Web CI Pipeline |
Commenter does not have sufficient privileges for PR 21545 in repo microsoft/onnxruntime |
Hi, Could someone restart the not-yet completed tests. I don't think I have the ability to merge this until they are all green and don't have the permissions to run the tests. |
/azp run ONNX Runtime Web CI Pipeline, Windows GPU CUDA CI Pipeline ,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline |
Azure Pipelines successfully started running 4 pipeline(s). |
/azp run Windows GPU Doc Gen CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
Hi, The failures (warnings) still seem entirely unrelated i.e: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1500618&view=results
Is it possible to ignore those warnings? |
/azp run ONNX Runtime Web CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU Doc Gen CI Pipeline |
Azure Pipelines successfully started running 3 pipeline(s). |
<3 |
This reverts commit 4e15b22. Reason: We are seeing an increase in the number of deadlocks after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily.
This reverts commit 4e15b22. Reason: We are seeing an increase in the number of deadlocks after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. Moreover, this is causing an increase int he binary size. ### Description We are seeing an [increase in the number of deadlocks](#22315 (comment)) after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. ### Motivation and Context See above.
@@ -1257,9 +1283,10 @@ class ThreadPoolTempl : public onnxruntime::concurrency::ExtendedThreadPoolInter | |||
// Increase the worker count if needed. Each worker will pick up | |||
// loops to execute from the current parallel section. | |||
std::function<void(unsigned)> worker_fn = [&ps](unsigned par_idx) { | |||
ThreadPoolWaiter<kIsHybrid ? 4 : 0> waiter{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this change means on non-Intel CPUs you removed the SpinPause and replaced it with a busy loop. Why? How is it better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite, if SpinPause
is a proper nop, the loop will be optimized out due to no side-effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was the change made? I understand you wanted to optimize ONNX Runtime's performance for a few kinds of Intel CPUs, but why do you need to remove the spinlock for the other CPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this doesn't remove the spin lock. This is just revising how we wait between testing the lock. I.e:
while(try_locked) { waiter.wait(); }
The waiter.wait()
is just to help improve the perf/power consumption of the spin lock. The spin lock itself is the while(try_locked) { /* Wait however you like */ }
The exact values where picked based on benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you set the template parameter kMaxBackoff
to zero, which means waiter.wait()
is empty, is it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do not want to turn ONNX Runtime performance for a special kind of CPU, especially when the tunning may hurt performance for the other kinds of CPUs.
I would say the distinction between Hybrid and non-Hybrid is pretty critical in the context of a scheduler. It fundamentally changes the characteristics of the threads.
I mentioned AMD because in my opinion this change is bad for all AMD CPUs. There is no good reason why we should change a spinlock to a busy loop for AMD CPUs. (You can argue it is not just hurting AMD CPUs, it hurts some other CPUs as well). I am questioning why the " _mm_pause();" function call was removed for most kind of CPUs, except Intel Hybrid CPUs
Again, benchmarks indicates it was profitable (including on AMD). Look this single pause is not the end all be all. If as the maintainer you say "I want the pause", I will update the PR, but I don't think your reasoning thus far is sound.
I didn't mention ARM because as you said, you even didn't test it on ARM
The changes the the wait code has no affect on arm. Its all a nop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My job is to find the deadlock. I commit I will finish it. I am a maintainer of this repo, but I do not want to use it as authority to influence this technical discussion. I am not a specialist in thread scheduling. We will find someone who knows it better to review this code, after the deadlock bug is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your clarification. We'll discuss what you're saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding ARM, it is right that nop was used. There is a draft #22787 to add yield (no perf test is done yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a guess, but a true yield would probably best be used far more sparingly than pause. Pause really exists to save a bit of power/prevent aggressive speculation, but it's orders of magnitude less costly than a yield. Especially if it's basically running at 1 thread per core, it seems if you actually want to go to the kernel you probably want to just be sleeping to save more power, using yield to bounce in and out of kernel space doesn't really accomplish that... But hey, Let's see what the benchmarks say
while (ps.tasks_finished < tasks_to_wait_for) { | ||
onnxruntime::concurrency::SpinPause(); | ||
waiter.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to revert this place. Since it is waiting for other tasks, it is better to use loop of spin pause instead of no-op (for hybrid mode). It could reduce the power consumed by the thread while it waits, minimizing resource contention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? The exact values are entirely based on benchmarks across a variety of machines. This performed the best in our benchmarks, but there is no intrinsic reason it needs to be as it is.
unsigned pause_time = pause_time_ + 1U; | ||
for (unsigned i = 0; i < pause_time; ++i) { | ||
onnxruntime::concurrency::SpinPause(); | ||
} | ||
pause_time_ = (pause_time * 2U) % kMaxBackoff; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems odd for kMaxBackoff=2 since kMaxBackoff = 2 has similar effect as kMaxBackoff = 1.
pause_time_ = 0; pause_time=0+1=1; pause_time_ is updated to (1 * 2) % 2 =0
When kMaxBackoff = 2, there is always one pause: 1, 1, 1, 1, ...
When kMaxBackoff = 4, pause: 1, 3, 3, 3, ...
When kMaxBackoff = 8, pause: 1, 3, 7, 7, ...
Ideally it could be something like:
kMaxBackoff = 2: pause 1, 2, 2, 2, 2, 2, ...
kMaxBackoff = 4: pause: 1, 2, 4, 4, 4, 4, ...
kMaxBackoff = 8: pause: 1, 2, 4, 8, 8, 8, ...
For example, when kMaxBackoff > 1, we only allow it be power of 2, then we can shift a bit each time until it reach the max.
I understand that we currently only use kMaxBackoff = 1 or 4 or 8. Suggest to test the "ideal one" to see whether that could help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I think just <=
would fix that although would need to re-benchmark.
I have not looked at the PR in depth. However, here is two scenarios that we want to account for. |
The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work). Changes are twofold: 1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The reality is after ~10^4 spins, if there hasn't been any new work added its unlikely any new work is imminent so sleep to preserve power. This aligns more closely with upstream EigenV3. 2) Use exponential backoff for waiting on memory. This saves a bit more power, and important increases the time between iterations in WorkerLoop to help accomidate the dramatically lowering spin counts. Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on `CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of `ThreadPoolTempl`. Measured performance on an [Intel Meteor Lake CPU](https://www.intel.com/content/www/us/en/products/sku/237329/intel-core-ultra-7-processor-165u-12m-cache-up-to-4-90-ghz/specifications.html) across a range of models. Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better). <div align="center"> <table> <tr> <th>Session creation time cost</th> <td>0.7179</td> </tr> <tr> <th>First inference time cost</th> <td>0.7156</td> </tr> <tr> <th>Total inference time cost</th> <td>1.0146</td> </tr> <tr> <th>Total inference requests</th> <td>0.8874</td> </tr> <tr> <th>Average inference time cost</th> <td>0.8800</td> </tr> <tr> <th>Total inference run time</th> <td>1.0146</td> </tr> <tr> <th>Number of inferences per second</th> <td>0.8955</td> </tr> <tr> <th>Avg CPU usage</th> <td>0.9462</td> </tr> <tr> <th>Peak working set size</th> <td>0.9922</td> </tr> <tr> <th>Runs</th> <td>1.1552</td> </tr> <tr> <th>Min Latency</th> <td>0.7283</td> </tr> <tr> <th>Max Latency</th> <td>0.9258</td> </tr> <tr> <th>P50 Latency</th> <td>0.9534</td> </tr> <tr> <th>P90 Latency</th> <td>0.9639</td> </tr> <tr> <th>P95 Latency</th> <td>0.9659</td> </tr> <tr> <th>P99 Latency</th> <td>0.9640</td> </tr> </table> </div> So the net result is a 1.16x improvement in throughput and between 1.08-1.37x improvement in latency.
…icrosoft#22350) This reverts commit 4e15b22. Reason: We are seeing an increase in the number of deadlocks after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. Moreover, this is causing an increase int he binary size. ### Description We are seeing an [increase in the number of deadlocks](microsoft#22315 (comment)) after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. ### Motivation and Context See above.
Agreed. There is no one size fits all here. Wait to long and you burn power/throttle CPU freq on cores potentially doing useful work. Don't spin enough and you end up wasting time going to/from the OS. The primary purpose of this patch is the prior threshold of 10^6 seemed WAY to hedged towards spinning. The new value 10^4 I think (and benchmarks support) strikes a better balance between these competing needs. |
@goldsteinn, please open a new PR (with #22315). Suggest to add some test scripts so that other people can verify it. |
got it, see: #23278 |
The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work).
Changes are twofold:
Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
reality is after ~10^4 spins, if there hasn't been any new work
added its unlikely any new work is imminent so sleep to
preserve power. This aligns more closely with upstream EigenV3.
Use exponential backoff for waiting on memory. This saves a bit
more power, and important increases the time between iterations
in WorkerLoop to help accomidate the dramatically lowering spin
counts.
Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on
CPUIDInfo::IsHybrid()
. This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety ofThreadPoolTempl
.Measured performance on an Intel Meteor Lake CPU across a range of models.
Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better).
So the net result is a 1.16x improvement in throughput and between 1.08-1.37x improvement in latency.