-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Enable GEMMs to first attempt LLVMGPUTileAndFuse with intrinsic by default #19520
base: main
Are you sure you want to change the base?
[GPU] Enable GEMMs to first attempt LLVMGPUTileAndFuse with intrinsic by default #19520
Conversation
38f5a22
to
7d687d7
Compare
There are compiler failures in the regression suite models, converting to draft while I debug |
7d687d7
to
7e2cdf8
Compare
The problem was a missing functionality for GEMMs of the type (f16,f16) ->f16. I filed this issue for it |
e6aa895
to
3bc822c
Compare
Found another issue with accumulating GEMMs #19546 |
0bcf683
to
ca7c4f3
Compare
…attern Signed-off-by: Nirvedh Meshram <[email protected]>
Signed-off-by: Nirvedh Meshram <[email protected]>
ca7c4f3
to
d82ec09
Compare
Signed-off-by: Nirvedh <[email protected]>
d82ec09
to
2adc85d
Compare
Based on comparisons with iree-kernel-benchmark here The performance between VectorDistribute vs TileAndFuse when using intrinisics seem comparable. Note that none of the tests in the sheet used the padding extension available in TileAndFuse after, #19484
so its a fair comparison of the pipelines themselves. TileAndFuse in some cases did have a speed up that seems beyond the noise level and overall it averages out to 1.25x faster.
However, we will be looking at LLAMA and SDXL numbers before actually considering this PR for merging,
Fixes : #18858
Depends on : #19587 , #19597