-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simde_mm256_fmadd_pd as two 128 bit FMA operations? #1196
Comments
When in doubt, check the compiler output, yeah? And then double check the timings for the 128 bit FMA operations versus the alternatives. Yes, an investigation into this is welcome! |
I investigated further and found that fmadd seems to compile reasonably with -O2 on gcc 10, as-is. I haven't checked fmsub or fnmsub or any single precision variants yet. |
I wanted to add an additional comment here that I've run into some additional issues handling FMAs, specifically on the Windows/MSVC platform and compiling AVX2+ code down to SSE. The various fallbacks in fma.h vary, but they mostly try to preserve using an FMA op if possible, which makes sense when porting from AVX+ level x86 to neon/webassembly/etc. On MSVC in particular this leads to really bad codegen however, where a single simde__m256 leads to scalar splay-out and individually running each scalar. When porting from AVX+ (which implies FMA on x86) to SSE (which does not), the primary fallback should crack the FMA apart into two 128bit FMAs, which then should crack apart into mul+add. I've performed this fixup locally for my purposes, and I'd like to contribute this work back if adding fallbacks like this are kosher for the project. |
@Remnant44 , thank you for investigating. Yes, that contribution would be welcome! |
simde_mm256_fmadd_pd
is defined as follows:When building for a target that doesn't have native 256 bit FMA support, why not use two 128 bit FMA operations on the two halves of the input?
If that's possible, I would be happy to attempt a patch adding that support. I wanted to check if there's some behavioral reason that two Neon 128 bit FMA operations wouldn't be appropriate here.
The text was updated successfully, but these errors were encountered: