Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile with kineto and warmup for more accurate benchmarking #3580

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

amirakb89
Copy link

Summary
This PR introduces:
A new warm-up method to ensure sufficient GPU preparation before benchmarking.
Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels.

Motivation
In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler:
These overheads are eliminated.
Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel.

For small kernels the iteration based warm-up might not be sufficient.
By leveraging the time based warmup:
Users will be confident the GPU has done enough warm-up.

Test instruction
The below script shows how to use this features:
python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50

@facebook-github-bot
Copy link
Contributor

Hi @amirakb89!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link

netlify bot commented Jan 16, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 5308694
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67897b281ac2fa000855def6
😎 Deploy Preview https://deploy-preview-3580--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@amirakb89 amirakb89 force-pushed the profile_with_kineto branch 2 times, most recently from 9b1ec89 to 5bb4730 Compare January 16, 2025 19:47
@amirakb89 amirakb89 force-pushed the profile_with_kineto branch from e2808f3 to 5308694 Compare January 16, 2025 21:33
@facebook-github-bot
Copy link
Contributor

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Jan 17, 2025
…h#3580)

Summary:
X-link: facebookresearch/FBGEMM#667

**Summary**
This PR introduces:
A new warm-up method to ensure sufficient GPU preparation before benchmarking.
Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels.

**Motivation**
In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler:
These overheads are eliminated.
Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel.

For small kernels the iteration based warm-up might not be sufficient.
By leveraging the time based warmup:
Users will be confident the GPU has done enough warm-up.

**Test instruction**
The below script shows how to use this features:
python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50


Reviewed By: leitian

Differential Revision: D68292871

Pulled By: q10
q10 pushed a commit to q10/FBGEMM that referenced this pull request Jan 17, 2025
…h#3585)

Summary:
Pull Request resolved: pytorch#3585

X-link: facebookresearch/FBGEMM#667

**Summary**
This PR introduces:
A new warm-up method to ensure sufficient GPU preparation before benchmarking.
Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels.

**Motivation**
In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler:
These overheads are eliminated.
Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel.

For small kernels the iteration based warm-up might not be sufficient.
By leveraging the time based warmup:
Users will be confident the GPU has done enough warm-up.

**Test instruction**
The below script shows how to use this features:
python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50

Pull Request resolved: pytorch#3580

Reviewed By: leitian

Differential Revision: D68292871

Pulled By: q10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants