-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile with kineto and warmup for more accurate benchmarking #3580
base: main
Are you sure you want to change the base?
Conversation
Hi @amirakb89! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
9b1ec89
to
5bb4730
Compare
e2808f3
to
5308694
Compare
@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…h#3580) Summary: X-link: facebookresearch/FBGEMM#667 **Summary** This PR introduces: A new warm-up method to ensure sufficient GPU preparation before benchmarking. Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels. **Motivation** In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler: These overheads are eliminated. Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel. For small kernels the iteration based warm-up might not be sufficient. By leveraging the time based warmup: Users will be confident the GPU has done enough warm-up. **Test instruction** The below script shows how to use this features: python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50 Reviewed By: leitian Differential Revision: D68292871 Pulled By: q10
…h#3585) Summary: Pull Request resolved: pytorch#3585 X-link: facebookresearch/FBGEMM#667 **Summary** This PR introduces: A new warm-up method to ensure sufficient GPU preparation before benchmarking. Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels. **Motivation** In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler: These overheads are eliminated. Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel. For small kernels the iteration based warm-up might not be sufficient. By leveraging the time based warmup: Users will be confident the GPU has done enough warm-up. **Test instruction** The below script shows how to use this features: python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50 Pull Request resolved: pytorch#3580 Reviewed By: leitian Differential Revision: D68292871 Pulled By: q10
Summary
This PR introduces:
A new warm-up method to ensure sufficient GPU preparation before benchmarking.
Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels.
Motivation
In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler:
These overheads are eliminated.
Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel.
For small kernels the iteration based warm-up might not be sufficient.
By leveraging the time based warmup:
Users will be confident the GPU has done enough warm-up.
Test instruction
The below script shows how to use this features:
python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50