Profile with kineto and warmup for more accurate benchmarking #3580

amirakb89 · 2025-01-16T15:40:08Z

Summary
This PR introduces:
A new warm-up method to ensure sufficient GPU preparation before benchmarking.
Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels.

Motivation
In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler:
These overheads are eliminated.
Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel.

For small kernels the iteration based warm-up might not be sufficient.
By leveraging the time based warmup:
Users will be confident the GPU has done enough warm-up.

Test instruction
The below script shows how to use this features:
python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50

facebook-github-bot · 2025-01-16T15:40:15Z

Hi @amirakb89!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

netlify · 2025-01-16T15:40:28Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`5308694`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67897b281ac2fa000855def6
😎 Deploy Preview	https://deploy-preview-3580--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2025-01-16T17:06:21Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot · 2025-01-16T22:16:09Z

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…h#3580) Summary: X-link: facebookresearch/FBGEMM#667 **Summary** This PR introduces: A new warm-up method to ensure sufficient GPU preparation before benchmarking. Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels. **Motivation** In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler: These overheads are eliminated. Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel. For small kernels the iteration based warm-up might not be sufficient. By leveraging the time based warmup: Users will be confident the GPU has done enough warm-up. **Test instruction** The below script shows how to use this features: python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50 Reviewed By: leitian Differential Revision: D68292871 Pulled By: q10

…h#3585) Summary: Pull Request resolved: pytorch#3585 X-link: facebookresearch/FBGEMM#667 **Summary** This PR introduces: A new warm-up method to ensure sufficient GPU preparation before benchmarking. Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels. **Motivation** In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler: These overheads are eliminated. Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel. For small kernels the iteration based warm-up might not be sufficient. By leveraging the time based warmup: Users will be confident the GPU has done enough warm-up. **Test instruction** The below script shows how to use this features: python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50 Pull Request resolved: pytorch#3580 Reviewed By: leitian Differential Revision: D68292871 Pulled By: q10

facebook-github-bot added the cla signed label Jan 16, 2025

amirakb89 force-pushed the profile_with_kineto branch 2 times, most recently from 9b1ec89 to 5bb4730 Compare January 16, 2025 19:47

amirakb89 and others added 12 commits January 16, 2025 15:33

profile with kineto for small kernels

ef4e6f9

add warm-up for kineto profiler

a479495

Add warmup-ms argument to benchmark_requests

57e2587

cherry pickedthe timed warm-up and added updated the warm-up approach

ff9babd

fixed the lint issues, print the log with kineto

2f1c531

warmup-ms and kineto profiler added for the nbit_device-with-spec

9f4e6f8

warmup and kineto profiler moved to a function

0e8fc1e

if warmup_ms or num_warmups removed

d85e913

print the log if the export_trace is true

cfd36d2

remove torch.cuda.event time calculator

35f934a

fix lint issue

d4927a5

lint ufmt fix

5308694

amirakb89 force-pushed the profile_with_kineto branch from e2808f3 to 5308694 Compare January 16, 2025 21:33

q10 mentioned this pull request Jan 17, 2025

Profile with kineto and warmup for more accurate benchmarking (#3580) #3585

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile with kineto and warmup for more accurate benchmarking #3580

Profile with kineto and warmup for more accurate benchmarking #3580

amirakb89 commented Jan 16, 2025

facebook-github-bot commented Jan 16, 2025

netlify bot commented Jan 16, 2025 •

edited

Loading

facebook-github-bot commented Jan 16, 2025

facebook-github-bot commented Jan 16, 2025

Profile with kineto and warmup for more accurate benchmarking #3580

Are you sure you want to change the base?

Profile with kineto and warmup for more accurate benchmarking #3580

Conversation

amirakb89 commented Jan 16, 2025

facebook-github-bot commented Jan 16, 2025

Action Required

Process

netlify bot commented Jan 16, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Jan 16, 2025

facebook-github-bot commented Jan 16, 2025

netlify bot commented Jan 16, 2025 •

edited

Loading