Profile with kineto and warmup for more accurate benchmarking (#3580) #3585

q10 · 2025-01-17T01:47:55Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/667

Summary
This PR introduces:
A new warm-up method to ensure sufficient GPU preparation before benchmarking.
Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels.

Motivation
In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler:
These overheads are eliminated.
Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel.

For small kernels the iteration based warm-up might not be sufficient.
By leveraging the time based warmup:
Users will be confident the GPU has done enough warm-up.

Test instruction
The below script shows how to use this features:
python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50

Reviewed By: leitian

Differential Revision: D68292871

Pulled By: q10

This is a re-export of #3580

facebook-github-bot · 2025-01-17T01:48:11Z

This pull request was exported from Phabricator. Differential Revision: D68292871

netlify · 2025-01-17T01:48:20Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`9db82e0`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6789f140b64b9600086543bd
😎 Deploy Preview	https://deploy-preview-3585--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2025-01-17T04:39:00Z

This pull request was exported from Phabricator. Differential Revision: D68292871

…h#3585) Summary: Pull Request resolved: pytorch#3585 X-link: facebookresearch/FBGEMM#667 **Summary** This PR introduces: A new warm-up method to ensure sufficient GPU preparation before benchmarking. Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels. **Motivation** In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler: These overheads are eliminated. Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel. For small kernels the iteration based warm-up might not be sufficient. By leveraging the time based warmup: Users will be confident the GPU has done enough warm-up. **Test instruction** The below script shows how to use this features: python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50 Pull Request resolved: pytorch#3580 Reviewed By: leitian Differential Revision: D68292871 Pulled By: q10

…h#3585) Summary: X-link: facebookresearch/FBGEMM#667 **Summary** This PR introduces: A new warm-up method to ensure sufficient GPU preparation before benchmarking. Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels. **Motivation** In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler: These overheads are eliminated. Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel. For small kernels the iteration based warm-up might not be sufficient. By leveraging the time based warmup: Users will be confident the GPU has done enough warm-up. **Test instruction** The below script shows how to use this features: python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50 Reviewed By: leitian Differential Revision: D68292871 Pulled By: q10

facebook-github-bot · 2025-01-17T05:57:32Z

This pull request was exported from Phabricator. Differential Revision: D68292871

facebook-github-bot added the cla signed label Jan 17, 2025

facebook-github-bot added the fb-exported label Jan 17, 2025

q10 force-pushed the export-D68292871 branch from 0256e52 to 9866d2d Compare January 17, 2025 04:39

q10 force-pushed the export-D68292871 branch from 9866d2d to 9db82e0 Compare January 17, 2025 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile with kineto and warmup for more accurate benchmarking (#3580) #3585

Profile with kineto and warmup for more accurate benchmarking (#3580) #3585

q10 commented Jan 17, 2025 •

edited

Loading

facebook-github-bot commented Jan 17, 2025

netlify bot commented Jan 17, 2025 •

edited

Loading

facebook-github-bot commented Jan 17, 2025

facebook-github-bot commented Jan 17, 2025

Profile with kineto and warmup for more accurate benchmarking (#3580) #3585

Are you sure you want to change the base?

Profile with kineto and warmup for more accurate benchmarking (#3580) #3585

Conversation

q10 commented Jan 17, 2025 • edited Loading

facebook-github-bot commented Jan 17, 2025

netlify bot commented Jan 17, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Jan 17, 2025

facebook-github-bot commented Jan 17, 2025

q10 commented Jan 17, 2025 •

edited

Loading

netlify bot commented Jan 17, 2025 •

edited

Loading