-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lowering vectorized pad #3261
Lowering vectorized pad #3261
Conversation
This reverts commit d0addc4.
csrc/codegen.cpp
Outdated
@@ -402,6 +402,52 @@ class CudaKernelGenerator : private kir::ConstIrVisitor { | |||
} | |||
} | |||
|
|||
void generateVectorizedLdSt(Val* in, Val* out, CacheOp cache_op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mechanical change, this is lifted from void handle(const LoadStoreOp* ldst) final
, so it can be shared with TernaryOp handling.
!test |
!test |
@@ -4041,4 +4041,37 @@ TEST_F(ResizeTest, SliceSliceConcatConcat) { | |||
NVF_CHECK(ref.equal(cg_outputs[0])); | |||
} | |||
|
|||
// manual scheduling that should have vectorized load on padded inputs. | |||
TEST_F(ResizeTest, VectorizePadLowering) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a test for vectorizing where without using pad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call. almost forgot that we have where
directly 🤕
csrc/codegen.cpp
Outdated
@@ -1001,6 +1051,50 @@ class CudaKernelGenerator : private kir::ConstIrVisitor { | |||
} | |||
|
|||
void handle(const TernaryOp* top) final { | |||
// Get vectorization information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments about the expectation? IIUC, only in2
is allowed to be vectorized, but technically speaking, it should be possible to have vectorized loads in both in2
and in3
, right? Not sure if it's worthwhile to allow that as well, although the required change seems minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can have in2 / in3 as TensorViews, I'm trying to add that since @zasdfgbnm mentioned about having a where
test.
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Adding **conditional** support of reszie in vectorization analysis. This PR allows vectorized load on `PadOp` directly without using cache load. This PR improves performance of generated kernel. What's in this PR: 1. Add propagation rule for resize in vectorization analysis. The propagation rule works as: i. For supported resize: a). project the resize op to the frontier and clear `(frontier.begin(), resize_position)`; b). add projected extent of the new resize op as `gcd(id_from, resize_op->leftExpand(), resize_op->rightExpand)` ii. For unsupported resize: clear `[frontier.begin(), resize_position]`; no behavior change. 2. updating TensorView::cacheAfter to opt-in a set of uses to cache while leaving other uses unchanged. Necessary for cases where inputs are used by PadOp as well as other operation that relies on cached load for vectorization. Follow up to #3261. Work for supporting rope performance. [design doc](https://docs.google.com/document/d/1tafRMNIXMmHlIGAiNlaPkYp6mZAzJ2Rh_NtARHbmYNA/edit?disco=AAABYEnV_ZY): --------- Co-authored-by: Naoya Maruyama <[email protected]>
Added support for lowering TernaryOp:where with vectorization factor.
i.e.
Currently this can only be done via manual scheduling. The follow up PR on vectorization analysis will make this automatically applied in PR #3321