-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rope_benchmark #3550
rope_benchmark #3550
Conversation
} | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only part that's worth reviewing.
code above were directly dumped from Kevin's rope example script. (Note that I have to update the script with nv_enable_matmul
in thunder.jit, otherwise we are seeing segmentation at nvfuser definition level)
I also want to add another toy example where we'll sweep on the batch size. But I'll do that in a separate PR. |
@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions? |
We will also benchmark backward pass with Thunder backend. |
Yes, so, we don't need to have the backward implementations explicitly, right? |
Looking at the thunder-nvfuser timing. Strangely the benchmark number doesn't match with the benchmark from kevin's example.
But if I run the manual rope_example, I'm getting these
I'll double check the measurement script, as well as compile options (i.e. thunder trace options). We need to do the same sanity check for torchcompile later. |
Noted. I'll add another executor. |
I realized that Kevin's benchmark script has been updated to measure profiler time as well and I was two commits behind that. The previous discrepancy was coming from the different measurement. |
With updated manual benchmark, we are making apple to apple comparison now. On h100 manual benchmark
pytest benchmark
fwd time mostly matches, (except for hf_phi3, that's because we are enabling matmul in pytest benchmark, which is not enabled by manual benchmark). |
): | ||
kwargs = {} | ||
if executor == "thunder": | ||
kwargs["nv_enable_matmul"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is where I'm enabling matmul in nvfuser.
This gives us a single fusion region, I believe is something we would like. cc'ing @naoyam
return thunder.jit( | ||
fwd_fn, nv_enable_bookend=False, executors=[nvfuserex], **kwargs | ||
) | ||
if executor == "thunder-torchcompile": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thunder-torchcompile
is the config we wanted for the rope comparison. Not sure if this is something we would also like to enable for other benchmarks. cc'ing @Priya2698
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not enable it for other benchmarks just yet. That would further increase the weekly CI timings, so let's keep it to RoPE for now. We can revisit which executors to run in nightly/weekly.
def with_executor(executor: str, fwd_fn: Callable) -> Callable: | ||
assert executor in ["eager", "torchcompile", "thunder"] | ||
def with_executor(executor: str, fwd_fn: Callable, **kwargs) -> Callable: | ||
assert executor in ["eager", "torchcompile", "thunder", "thunder-torchcompile"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert executor in ["eager", "torchcompile", "thunder", "thunder-torchcompile"] | |
assert executor in ["eager", "torchcompile", "thunder-nvfuser", "thunder-torchcompile"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it as you are suggesting a renaming.
I can do that in a separate PR to make reviewing easier.
@@ -221,9 +226,9 @@ def set_metrics( | |||
% Peak Bandwidth (SOL): 100 * Bandwidth /PEAK_BANDWIDTH | |||
""" | |||
if not iobytes: | |||
if isinstance(inputs, torch.Tensor): | |||
if not isinstance(inputs, Iterable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code below is:
for inp in inputs:
if isinstance(inp, torch.Tensor):
iobytes += inp.element_size() * inp.numel()
So here we are really checking if inputs should be Iterable
. Same thing applies to outputs
For backward checks, we have outputs=None
, which would cause an exception when we don't provide iobytes
. It doesn't apply any more here, since we are providing iobytes
for rope. But I think it's still a legit fix.
benchmarks/python/test_rope.py
Outdated
@pytest.mark.parametrize( | ||
"executor", ["eager", "torchcompile", "thunder", "thunder-torchcompile"] | ||
) | ||
def test_rope_variations_fwd_benchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we also have test_rope_benchmark
, which is separate from test_rope_variations_fwd_benchmark
and test_rope_variations_bwd_benchmark
. What does test_rope_benchmark
evaluate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those are coming from @wujingyue 's experiment with rope performance earlier last year. We just have some manual definition of the rope module in llama examples.
I'm not sure if those benchmarks are still relevant? tagging @wujingyue .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, the new benchmarks covers more variations and is more realistic (captured from Thunder traces). So I don't see a reason to keep test_rope_benchmark, which is only one variation and is forward only. The without_cat
variation could be useful, but since no model implementations have adopted this trick and nvFuser is getting better on cat I don't see a reason to keep that either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some naming nits
for i in range(1, len(outputs)): | ||
output += outputs[i] | ||
|
||
# NOTE: the iobytes is computed based on how thunder autograd worked. So this is just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your previous point about the IObytes computation potentially being different between the executors. I discussed this with @kevinstephano but off-hand I am not sure of a more robust way. I will open an issue to look into it. For this PR, the nvfuser-definition based computation LGTM since that is consistent with the other benchmarks.
Any benchmark can add validation to the benchmark. For nvfuser (manual fusion definitions), we validate against torch reference by default (it can be turned off using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. The benchmarks look good to me overall.
For the variations: are there any good references we can add as comments?
Please also post the final numbers and the numbers from Kevin's script in the PR description for posterity.
"resid_pdrop": 0.0, | ||
"rms_norm_eps": 1e-05, | ||
"rope_scaling": { | ||
"long_factor": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these numbers come from? Can you add a reference if there is one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those were just dumped config from the model that we pulled.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3.5-mini-instruct")
print(model.config)
So we don't have to pull the real model. Does this sound right to you @kevinstephano ?
We run the benchmarks in the eager mode as well. Why can't we just use the results as the reference? |
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @jjsjann123.
@Priya2698 I still would like to have result validations. Resize indexing in these fusions tends to be complex, and I suspect compute-sanitizer would not be very helpful.
@xwang233, @jjsjann123 It would be really good to have these new benchmark cases showing up in the performance dashboard.
We can add validation here, same as we do in nvfuser benchmarks:
Instead of We do need to compute the eager output for other executors and cannot reuse it from the |
Rope benchmark extracted from lightning trace.
TODO: