LLVM-MCA thinks that GCC's vectorization of loop reduction for Zen3 is ~50% faster #121344

TiborGY · 2024-12-30T15:11:31Z

I am trying to optimize a few hotspots in a larger body of code, one of which is an RMSD computation implemented as loop reduction, where the loop bounds are completely known at compile time:

double RMSD(const double* const dm1, const double* const dm2){
    const size_t distsPerGeom = (12*11)/2;
    double RMSD = 0.0;
    for (size_t i=0; i<distsPerGeom; i++){
        RMSD += std::pow(dm1[i] - dm2[i], 2);
    }
    return std::sqrt(RMSD / distsPerGeom);
}

What I have found, is that according to LLVM-MCA, GCC is far better at vectorizing this function, 2823 total cycles vs. 4455, see this link:
https://godbolt.org/z/q85rq3bW4

I have not tested this on real HW yet, but I cannot see how this is not indicative of some issue in an LLVM component. Either MCA's cycle counts are wrong for Zen3, or clang is not as good at vectorizing as GCC is.

The text was updated successfully, but these errors were encountered:

RKSimon · 2024-12-30T15:25:29Z

It appears to be a reassociation issue - clang is summing RMSD into the same ymm7 accumulation result:

vfmadd231pd %ymm2, %ymm2, %ymm7
vfmadd231pd %ymm1, %ymm1, %ymm7
vfmadd231pd %ymm0, %ymm0, %ymm7

which means that all the vfmadd231pd instructions are performed serially, while gcc is splitting the FMA accumulation and then adding the sub-results together, improving IPC.

vfmadd231pd %ymm0, %ymm0, %ymm1
..
vfmadd132pd %ymm0, %ymm4, %ymm0
..
vfmadd231pd %ymm0, %ymm0, %ymm2
..
vaddpd %ymm2, %ymm0, %ymm0
vaddpd %ymm1, %ymm0, %ymm0

llvmbot · 2024-12-30T15:50:24Z

@llvm/issue-subscribers-tools-llvm-mca

Author: None (TiborGY)

I am trying to optimize a few hotspots in a larger body of code, one of which is an RMSD computation implemented as loop reduction, where the loop bounds are completely known at compile time: ``` double RMSD(const double* const dm1, const double* const dm2){ const size_t distsPerGeom = (12*11)/2; double RMSD = 0.0; for (size_t i=0; i<distsPerGeom; i++){ RMSD += std::pow(dm1[i] - dm2[i], 2); } return std::sqrt(RMSD / distsPerGeom); } ``` What I have found, is that according to LLVM-MCA, GCC is far better at vectorizing this function, 2823 total cycles vs. 4455, see this link: https://godbolt.org/z/ov45ebsoM

I have not tested this on real HW yet, but I cannot see how this is not indicative of some issue in an LLVM component. Either MCA's cycle counts are wrong for Zen3, or clang is not as good at vectorizing as GCC is.

firewave · 2024-12-30T17:38:30Z

FYI The data might be tainted. llvm-mca will default to the underlying CPU and since this is a cloud service it will obviously have various possible platforms. So it is possible to get different results for the same input. If you add -mcpu=generic (or your intended target) to the "Arguments" you will get consistent results.

See compiler-explorer/compiler-explorer#4085.

RKSimon · 2024-12-30T17:45:23Z

Adding an explicit -mcpu=znver3 to the llvm-mca command lines made very little difference to the estimated instruction cycles, and still shows the 2x ratio in the "Total Cycles" numbers.

mshockwave · 2024-12-30T17:50:58Z

FYI The data might be tainted. llvm-mca will default to the underlying CPU and since this is a cloud service it will obviously have various possible platforms. So it is possible to get different results for the same input. If you add -mcpu=generic (or your intended target) to the "Arguments" you will get consistent results.

See compiler-explorer/compiler-explorer#4085.

It is true that one almost certain wants to add an explicit -mcpu= option when using llvm-mca, but I don't think -mcpu=generic is a good one since IIRC it's effectively an alias of -mcpu=sandybridge -- a processor that is too old (of course it still depends on which kind of applications are you profiling)

firewave · 2024-12-30T17:55:31Z

Adding an explicit -mcpu=znver3 to the llvm-mca command lines made very little difference to the estimated instruction cycles, and still shows the 2x ratio in the "Total Cycles" numbers.

In this case it indeed did not make a different and the underlying architecture is identical. It might be possible that using a -mcpu on the compiler tab will also apply this to the llvm-mca invocation somehow (maybe what was suggested in the comment in my ticket was actually applied). I never used that flag and ended up with Zen vs. Skylake output (as outlined in the ticket).

TiborGY · 2024-12-30T18:08:31Z

I have looked at the MCA outputs before I have submitted this issue, and both showed that MCA used znver3 by default, so the data were valid.
But good catch, if MCA defaults to the host CPU it might drift to whatever cloud machine happens to be free at the moment, so I have updated the link with explicit znver3: https://godbolt.org/z/q85rq3bW4

mshockwave · 2024-12-30T18:21:53Z

It appears to be a reassociation issue - clang is summing RMSD into the same ymm7 accumulation result:

vfmadd231pd %ymm2, %ymm2, %ymm7
vfmadd231pd %ymm1, %ymm1, %ymm7
vfmadd231pd %ymm0, %ymm0, %ymm7
which means that all the vfmadd231pd instructions are performed serially, while gcc is splitting the FMA accumulation and then adding the sub-results together, improving IPC.

vfmadd231pd %ymm0, %ymm0, %ymm1
..
vfmadd132pd %ymm0, %ymm4, %ymm0
..
vfmadd231pd %ymm0, %ymm0, %ymm2
..
vaddpd %ymm2, %ymm0, %ymm0
vaddpd %ymm1, %ymm0, %ymm0

Assuming this is to be solved in MachineCombiner (though preferably it should be solved in ReassociatePass on the pre-vectorized scalar code), I think this can be solved by adding new patterns into X86InstrInfo::getMachineCombinerPatterns

RKSimon · 2024-12-30T18:45:14Z

I'd much prefer to see this handled in the middle end and not make it x86 specific

mshockwave · 2024-12-30T19:37:41Z

I'd much prefer to see this handled in the middle end and not make it x86 specific

This seems to be caused by the fact that ReassociatePass is ran before LoopUnroll. Because not until LoopUnroll do we see the long critical path consisted of fadd + fmul.

Before ReassociatePass: https://godbolt.org/z/feGPb9bqr
After LoopUnroll: https://godbolt.org/z/hqnqnjvqb

TiborGY · 2025-01-02T11:01:14Z

This seems to be caused by the fact that ReassociatePass is ran before LoopUnroll. Because not until LoopUnroll do we see the long critical path consisted of fadd + fmul.
* Before ReassociatePass: https://godbolt.org/z/feGPb9bqr

* After LoopUnroll: https://godbolt.org/z/hqnqnjvqb

This is probably a dumb idea as I am not too familiar from the internals of LLVM, but could the ReassociatePass be simply moved to run after LoopUnroll? Or if that regresses optimiztation for something else, run ReassociatePass a second time after LoopUnroll?

github-actions bot added the new issue label Dec 30, 2024

EugeneZelenko added tools:llvm-mca and removed new issue labels Dec 30, 2024

RKSimon added performance llvm:transforms and removed tools:llvm-mca labels Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLVM-MCA thinks that GCC's vectorization of loop reduction for Zen3 is ~50% faster #121344

LLVM-MCA thinks that GCC's vectorization of loop reduction for Zen3 is ~50% faster #121344

TiborGY commented Dec 30, 2024 •

edited

Loading

RKSimon commented Dec 30, 2024

llvmbot commented Dec 30, 2024

firewave commented Dec 30, 2024

RKSimon commented Dec 30, 2024

mshockwave commented Dec 30, 2024

firewave commented Dec 30, 2024

TiborGY commented Dec 30, 2024

mshockwave commented Dec 30, 2024

RKSimon commented Dec 30, 2024

mshockwave commented Dec 30, 2024

TiborGY commented Jan 2, 2025 •

edited

Loading

LLVM-MCA thinks that GCC's vectorization of loop reduction for Zen3 is ~50% faster #121344

LLVM-MCA thinks that GCC's vectorization of loop reduction for Zen3 is ~50% faster #121344

Comments

TiborGY commented Dec 30, 2024 • edited Loading

RKSimon commented Dec 30, 2024

llvmbot commented Dec 30, 2024

firewave commented Dec 30, 2024

RKSimon commented Dec 30, 2024

mshockwave commented Dec 30, 2024

firewave commented Dec 30, 2024

TiborGY commented Dec 30, 2024

mshockwave commented Dec 30, 2024

RKSimon commented Dec 30, 2024

mshockwave commented Dec 30, 2024

TiborGY commented Jan 2, 2025 • edited Loading

TiborGY commented Dec 30, 2024 •

edited

Loading

TiborGY commented Jan 2, 2025 •

edited

Loading