-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLVM-MCA thinks that GCC's vectorization of loop reduction for Zen3 is ~50% faster #121344
Comments
It appears to be a reassociation issue - clang is summing RMSD into the same ymm7 accumulation result: vfmadd231pd %ymm2, %ymm2, %ymm7
vfmadd231pd %ymm1, %ymm1, %ymm7
vfmadd231pd %ymm0, %ymm0, %ymm7 which means that all the vfmadd231pd instructions are performed serially, while gcc is splitting the FMA accumulation and then adding the sub-results together, improving IPC. vfmadd231pd %ymm0, %ymm0, %ymm1
..
vfmadd132pd %ymm0, %ymm4, %ymm0
..
vfmadd231pd %ymm0, %ymm0, %ymm2
..
vaddpd %ymm2, %ymm0, %ymm0
vaddpd %ymm1, %ymm0, %ymm0 |
@llvm/issue-subscribers-tools-llvm-mca Author: None (TiborGY)
I am trying to optimize a few hotspots in a larger body of code, one of which is an RMSD computation implemented as loop reduction, where the loop bounds are completely known at compile time:
```
double RMSD(const double* const dm1, const double* const dm2){
const size_t distsPerGeom = (12*11)/2;
double RMSD = 0.0;
for (size_t i=0; i<distsPerGeom; i++){
RMSD += std::pow(dm1[i] - dm2[i], 2);
}
return std::sqrt(RMSD / distsPerGeom);
}
```
What I have found, is that according to LLVM-MCA, GCC is far better at vectorizing this function, 2823 total cycles vs. 4455, see this link:
https://godbolt.org/z/ov45ebsoM
I have not tested this on real HW yet, but I cannot see how this is not indicative of some issue in an LLVM component. Either MCA's cycle counts are wrong for Zen3, or clang is not as good at vectorizing as GCC is. |
FYI The data might be tainted. |
Adding an explicit |
It is true that one almost certain wants to add an explicit |
In this case it indeed did not make a different and the underlying architecture is identical. It might be possible that using a |
I have looked at the MCA outputs before I have submitted this issue, and both showed that MCA used znver3 by default, so the data were valid. |
Assuming this is to be solved in MachineCombiner (though preferably it should be solved in ReassociatePass on the pre-vectorized scalar code), I think this can be solved by adding new patterns into |
I'd much prefer to see this handled in the middle end and not make it x86 specific |
This seems to be caused by the fact that ReassociatePass is ran before LoopUnroll. Because not until LoopUnroll do we see the long critical path consisted of fadd + fmul.
|
This is probably a dumb idea as I am not too familiar from the internals of LLVM, but could the ReassociatePass be simply moved to run after LoopUnroll? Or if that regresses optimiztation for something else, run ReassociatePass a second time after LoopUnroll? |
I am trying to optimize a few hotspots in a larger body of code, one of which is an RMSD computation implemented as loop reduction, where the loop bounds are completely known at compile time:
What I have found, is that according to LLVM-MCA, GCC is far better at vectorizing this function, 2823 total cycles vs. 4455, see this link:
https://godbolt.org/z/q85rq3bW4
I have not tested this on real HW yet, but I cannot see how this is not indicative of some issue in an LLVM component. Either MCA's cycle counts are wrong for Zen3, or clang is not as good at vectorizing as GCC is.
The text was updated successfully, but these errors were encountered: