Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile with kineto and warmup for more accurate benchmarking (#3580) #3585

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

q10
Copy link
Contributor

@q10 q10 commented Jan 17, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/667

Summary
This PR introduces:
A new warm-up method to ensure sufficient GPU preparation before benchmarking.
Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels.

Motivation
In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler:
These overheads are eliminated.
Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel.

For small kernels the iteration based warm-up might not be sufficient.
By leveraging the time based warmup:
Users will be confident the GPU has done enough warm-up.

Test instruction
The below script shows how to use this features:
python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50

Reviewed By: leitian

Differential Revision: D68292871

Pulled By: q10

This is a re-export of #3580

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68292871

Copy link

netlify bot commented Jan 17, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 9db82e0
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6789f140b64b9600086543bd
😎 Deploy Preview https://deploy-preview-3585--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68292871

q10 pushed a commit to q10/FBGEMM that referenced this pull request Jan 17, 2025
…h#3585)

Summary:
Pull Request resolved: pytorch#3585

X-link: facebookresearch/FBGEMM#667

**Summary**
This PR introduces:
A new warm-up method to ensure sufficient GPU preparation before benchmarking.
Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels.

**Motivation**
In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler:
These overheads are eliminated.
Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel.

For small kernels the iteration based warm-up might not be sufficient.
By leveraging the time based warmup:
Users will be confident the GPU has done enough warm-up.

**Test instruction**
The below script shows how to use this features:
python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50

Pull Request resolved: pytorch#3580

Reviewed By: leitian

Differential Revision: D68292871

Pulled By: q10
@q10 q10 force-pushed the export-D68292871 branch from 0256e52 to 9866d2d Compare January 17, 2025 04:39
…h#3585)

Summary:

X-link: facebookresearch/FBGEMM#667

**Summary**
This PR introduces:
A new warm-up method to ensure sufficient GPU preparation before benchmarking.
Benchmark time calculation using the Kineto profiler for measuring the time and bandwidth of inference forward kernels.

**Motivation**
In small benchmark cases, kernel launch and synchronization overheads can be significant compared to the actual kernel runtime. By leveraging the Kineto profiler:
These overheads are eliminated.
Users get a more accurate estimation of kernel execution time and bandwidth of the forward kernel.

For small kernels the iteration based warm-up might not be sufficient.
By leveraging the time based warmup:
Users will be confident the GPU has done enough warm-up.

**Test instruction**
The below script shows how to use this features:
python bench/split_table_batched_embeddings_benchmark.py nbit-device-with-spec --export-trace --warmup_ms 50


Reviewed By: leitian

Differential Revision: D68292871

Pulled By: q10
@q10 q10 force-pushed the export-D68292871 branch from 9866d2d to 9db82e0 Compare January 17, 2025 05:57
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68292871

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants