'VFT dispatching' to call into SIMD-ISA-specific code #2364

kfjahnke · 2024-10-28T08:56:11Z

Hi! This is more of a little excursion than a 'true' issue, but it's about a technique which I've found useful and would like to share. The occasion is that I'm extending my library zimt to use highway's foreach_target mechanism.

My first remark - before I start out on the issue proper - is about this mechanism. I knew it was there, I thought it might be a good idea to use it, but the documentation was thin and I had a working solution already. Before I turned my attention to zimt again this autumn, I did a lot of reading in the SIMD literature, and I also decided to have a closer look at highway's foreach_target mechanism. Lacking extensive documentation, I sat myself down and read the code. Only then I realized just how well-thought-out and useful it actually is. Yes, using it is slightly intrusive to the client code, but you've really done a good job to hide the complexity and make it easy to 'suck' code into the SIMD-ISA-specific nested namespaces and dispatch to it. But here I do actually have some criticism - to figure that out I had to read and understand the code! It would have been much easier had there been some sort of technical outline, paper, or such - to explain the concept. This criticism goes beyond this specific topic - I think you'd be well advised to improve documentation, to address a wider user base.

My first step in introducing highway's multi-ISA capability into zimt was to introduce 'corresponding' nested namespaces in both my library's 'zimt' namespace and in the 'project' namespace (let's use this one for user-side code). I had a hard time initially figuring out the namespace scheme, probably because the name of the central macro HWY_NAMESPACE. The naming is unfortunate - of course it's a symbol for a namespace, but a name like HWY_SIMD_ISA would have hinted at it's semantics, not at syntax. With the namespaces set up, I could use foreach_target.h and dynamic dispatch. But I found the way to introduce the ISA-specific code via a free function verbose, so I tried to figure out a way to make this more concise and manageable. I 'bent' some of the code I used in lux to the purpose, and this is where I come to 'VFT dispatching'. The concept is quite simple:

Instead of using free functions, I work with virtual member functions in a 'dispatch' object
The dispatch object's base class has declarations of these as pure virtual static member functions
In each ISA-specific nested namespace I inherit from the dispatch base class and provide the implementations
I have a 'get_dispatch' routine (using highway's dynamic dispatch) which yields a dispatch base class pointer
which does in fact point to the derived class with the ISA-specific implementations

So this is where the 'VFT' in 'VFT dispatching' comes from: it uses the virtual function table of a class with virtual functions. The language guarantees that the VFTs of all classes derived from the base have the same layout (otherwise the mechanism could not function). What do I gain? the base class pointer is a uniform handle to a - possible large - set of functions I want to keep ISA-specific versions of. Dispatching is as simple as calling through the dispatcher base class pointer, so once I have obtained it, it serves as a conduit:

// inside code submitted to foreach_target I have:

struct _dispatch
: public dispatch // inherit from dispatch base class
{
   ...
} ;

static const _dispatch local_dispatch ; // instantiate the derived class

const dispatch * const get_dispatch() // install the ISA-specific fetch routine
{
  return & local_dispatch ; // return type will be cast to base class, as declared
}

// in the main body of code (in HWY_ONCE) I use highway's dynamic dispatch:

HWY_EXPORT(_get_dispatch);

const dispatch * const get_dispatch()
{
  return HWY_DYNAMIC_DISPATCH(_get_dispatch)() ;
}

// with a member function 'foo' in the dispatcher, I can now dispatch like this:

auto dp = get_dispatch() ;
bar = dp->foo() ; // calls the ISA-specific variant

This is more or less it, with one more little twist which I also found useful. In a first approach, I wrote out the declaration of the pure virtual member function in the base class, and again the declaration (now no longer pure) in the derived, ISA-specific, class. This is error-prone, so I now use an interface header, introducing the member functions via a macro. In a header 'interface.h' I put macro invocations only:

// register int dummy ( float z ) for dispatch:

ZIMT_REGISTER(int, dummy, float z)

Then I can #include this header into the class declarations, #defining the macro differently:

// the dispatch base class is coded like this:
struct dispatch
{
#define ZIMT_REGISTER(RET,NAME,...) virtual RET NAME ( __VA_ARGS__ ) const = 0 ;
#include "interface.h"
#undef ZIMT_REGISTER
} ;
// the ISA-specific derived class (inside code submitted to foreach_target):
struct _dispatch
: public dispatch
{
#define ZIMT_REGISTER(RET,NAME,...) RET NAME ( __VA_ARGS__ ) const ;
#include "interface.h"
#undef ZIMT_REGISTER
} ;
// followed, inside the same nested namespace, by the definition(s)
int _dispatch::dummy (float z) const
{
  return 23 ;
}

This ensures that the declarations are consistent. For the actual implementation, the signature has to be written out once again, but since there is a declaration, providing a definition with different signature is an error, and when providing the implementation, coding with the signature 'in sight' is advisable anyway - especially when the argument list becomes long. the 'interface.h' header provides a good reference to the set of functions using the dispatcher, and additional dispatch-specific functionality can be coded for the lot. I think it makes a neat addition to VFT dispatching.

To wrap up, I'd like to point out that this mechanism is generic and can be used to good effect for all sorts of dispatches - If appropriate specific derived dispatch classes are coded along with a mechanism to pick a specific one, it can function quite independently of highway's dispatch. It can also be used to 'pull in' code which doesn't even use highway - e.g. code with vector-friendly small loops ('goading') relying on autovectorization which will still benefit from being re-compiled several times with ISA-specific flags, be it with highway's foreach_target or by setting up separate TUs with externally supplied ISA-specific compiler flags - this is what I currently do in lux, but it requires quite some 'scaffolding' code in cmake.

Comments welcome! I hope you find this useful - I intended to share useful bits here every now and then and it's been a while since the last one (about goading), but better late than never. If you're interested, you can have a peek into zimt's new multi_isa branch, where I have a first working example using the method (see linspace.cc and driver.cc in the examples section). If you don't approve of my intruding into your issue space, let me know.

The text was updated successfully, but these errors were encountered:

jan-wassenberg · 2024-10-29T11:41:14Z

Thanks for sharing your result, which is of course welcome :)

I sat myself down and read the code. Only then I realized just how well-thought-out and useful it actually is. Yes, using it is slightly intrusive to the client code, but you've really done a good job to hide the complexity and make it easy to ..

Thank you! It has actually been a few years since we designed this. You might currently have the best understanding of dispatching and how the pieces fit together.
If you'd be willing to write an overview in g3doc/, in Markdown format, I think this would be useful for others and believe you'd be great at it because you write thorough comments.

a name like HWY_SIMD_ISA would have hinted at it's semantics

Yes, that's fair. Maybe HWY_TARGET_NAME, though we did want to discourage using it for anything other than a namespace.

If I understand your new system correctly, the derived class calls the hn:: implementation directly, so once you have your
base-class pointer, the dispatch is just one load of the vptr, plus one indirect call through it. This may actually
be faster than our dispatch, which looks up the current enabled targets (with the advantage that users can actually
change that at runtime, which does happen).

Very cool, congrats! I'd encourage you to also write up an intro on this system, it looks useful for when there are many functions to dispatch.

FYI some time ago we expanded in an unrelated direction: enabling dispatch of function templates, at least with
one template argument (due to MSVC macro limitations that could be worked around if there is demand for it).
This is the HWY_EXPORT_AND_DYNAMIC_DISPATCH_T and related.

kfjahnke · 2024-10-30T08:05:12Z

You might currently have the best understanding of dispatching and how the pieces fit together.

Claiming that would be presumptuous - I figured out how to use highway's mechanism, not quite how it actually does what it does. But I did notice that a lot of it works with the preprocessor, rather than relying on C++ language features like templates.

If you'd be willing to write an overview in g3doc/, in Markdown format, I think this would be useful for others and believe you'd be great at it because you write thorough comments.

I have something in the pipeline, I'll post a link to a draft version once I have it online.

If I understand your new system correctly, the derived class calls the hn:: implementation directly, so once you have your
base-class pointer, the dispatch is just one load of the vptr, plus one indirect call through it. This may actually
be faster than our dispatch, which looks up the current enabled targets (with the advantage that users can actually
change that at runtime, which does happen).

Dispatching this way is indeed very efficient, it's really just the load-the-vptr plus indirect call. The level at which this dispatch happens should not be too performance-critical, though - The stuff which really needs to be fast should all be inlined inside the chunks of code which contain performance-critical code, and the function-level dispatch should only happen when you do stuff like call a routine which processes an entire array of data or such - as zimt does when it works over nD arrays of *xel data.

The lookup of the current target is an issue, though. I think we discussed this once already, but my memory fails me. The question is when such a lookup may occur at run-time. If there is a possibility that the available SIMD ISA changes from one call into the SIMD-ISA-specific code to the next, my method is not applicable without a dispatch pointer refetch - the code would crash with an illegal instruction if the current ISA does not provide the instructions. But I think this is a rare exception, and it would be hard to handle anyway: what if you've just figured out you can use AVX3, and when you proceed to call the AVX3-code the system puts your thread on a core which only has AVX2? You'd have to make sure the code proceeds without such switches until it's done, and this may be hard to 'nail down'. I started this thread to also get feedback from you guys on potential stumbling stones, and this is certainly one - but you're in a better position to know whether this is indeed relevant, so I'd be glad to get some advice, beyond 'which does happen'. When and on which platforms does it happen?

Since I intend this level of dispatch for entire blocks of code rather than at the level of individual SIMD operations or small sequences, it would be unproblematic to re-fetch the dispatch pointer before using it (losing the direct VFT call speed advantage), if the ISA switch can happen in mid-run-time. But if the ISA switch can occur between the acquisition of the dispatch pointer and it's use immediately afterwards, it's a problem - I doubt, though - even without investigating deeply - that your dispatching code is shielded against such extreme disruption.

kfjahnke · 2024-10-31T10:03:00Z

Here's the text I've written on the topic:

https://github.com/kfjahnke/zimt/blob/multi_isa/examples/multi_isa_example/multi_simd_isa.md

The 'multi_isa_example' folder also has example code, a program using the dispatch mechanism I have proposed in my initial post. The .md file has a lengthy text which starts out by describing highway's dispatch mechanism and the layer of code I have added on top to use 'VFT dispatching'. The code and text in this folder describe the general how-to - zimt's own use of VFT dispatching is more involved (there's the zimt namespace to deal with as well) and it's still only half-finished as of this writing, with only the core of the zimt functionality accessible via VFT dispatching. I intend the multi_isa branch to evolve so that all zimt code can be dispatched this way but keeps the option of using the other SIMD library back-ends (Vc, std::simd, zimt's own 'goading' code) as an alternative. Once that's done, I'll merge it back to master.

If we can settle the open issue about the ISA switching while a program is running (if that ever occurs), I think my method should be generally viable. I think my example code and the text will clarify precisely how VFT dispatching works - it's much more elaborate than my initial post here. Again, comments welcome!

jan-wassenberg · 2024-10-31T10:34:09Z

The level at which this dispatch happens should not be too performance-critical, though

I agree, but it's surprising, nice, and rare that more convenience actually comes with more speed.

If there is a possibility that the available SIMD ISA changes from one call into the SIMD-ISA-specific code to the next, my method is not applicable without a dispatch pointer refetch

I wouldn't worry to much about this. Intel has warned since 10+ years that CPUID info might become heterogeneous, but no one bothers to check. The most common case is where someone disables targets at runtime using a flag. This happens early on in main(), and users can arrange to call get_dispatch after that, so no problem.

Excellent article, thanks for putting this together!
It builds up nicely and makes each step clear.

If you want to go into a bit more detail about the mechanism, we could expand on "deems most suitable at the time".
Under the hood, we have a bitfield of targets, which by convention has lowest=best, so we can use the bit-scan instruction
to find the index of the first/best bit, and use that as an index into a table of function pointers. Minor wrinkle: because
some targets' indices start above zero, we subtract the per-platform starting index so that the tables can be smaller.

Bonus: who initializes this bitfield? We don't want to get into the init order fiasco by setting it in a ctor. Instead we arrange that the first entry in the table is a special version of the user's code, which first sets the bitfield, then calls the user code. Subsequent dispatch will go straight to the user code.

Some typos:
dacades -> decades
get a bit -> see some speedup
sepcific -> specific
witout -> without
RISCV -> RISC-V
i86 -> x86?
hwy::SSE2, hwy::AVX2 -> FWIW they are actually called N_SSE2, N_AVX2.
HWY_NAMEPSACE -> HWY_NAMESPACE (2x)
lat's -> let's
oreach_target -> foreach_target (2x)
paylod -> payload
is it's -> if it's

Personally I'd stop before the SIMD_REGISTER step - some people like to minimize macros, and it might be useful to
see the declarations. The compiler will anyway complain if there is a mismatch.

libraries come and go or change license, so an exit strategy is a good idea

FYI Highway will remain open source. Google has a policy of not deleting open source projects.

and a lot of cmake scaffolding

Want to add a link to the code?

We'd welcome adding this to the Highway g3doc. Or would you prefer a link in the readme?
Consider also posting this to Hacker News and/or Reddit r/simd?

kfjahnke · 2024-10-31T17:37:03Z

I wouldn't worry to much about this. Intel has warned since 10+ years that CPUID info might become heterogeneous, but no one bothers to check. The most common case is where someone disables targets at runtime using a flag. This happens early on in main(), and users can arrange to call get_dispatch after that, so no problem.

That's a relief. I did suspect this was a no-go - it would be just too disruptive and make people even less likely to invest in coding with SIMD.

Some typos:
...

Fixed, thanks @jan-wassenberg!

Excellent article, thanks for putting this together!
It builds up nicely and makes each step clear.

I'm glad you approve of my tedious repetitive technical outporings :-)

Personally I'd stop before the SIMD_REGISTER step - some people like to minimize macros, and it might be useful to
see the declarations. The compiler will anyway complain if there is a mismatch.

I'll think about it. I used this to good effect in lux. But of course it's an extra frill which isn't strictly necessary - maybe I'll reduce it to a hint that it can be done, rather than 'going all the way' in the example code. You're right that less may be more, and it should really be about transporting the concept, to the advance of SIMD acceptance.

Want to add a link to the code?

Do you mean to the code in lux? lux' single cmake file is here, and the code about 'flavours' starts in line 318, as of this writing. You can see that's quite a mouthful. I'll be glad to get rid of all this cmake code once I have moved lux to use zimt with automatic ISA dispatching. If you mean the code for the article, it's here

We'd welcome adding this to the Highway g3doc. Or would you prefer a link in the readme?

Thanks for the offer. I think a link in the README would be more appropriate for now - I feel my text doesn't really qualify as documentation, it's more in style with a blog post. But of course you can quote me if you like.

I do have another concrete question. In my example code, I haven't used HWY_PREFIX, HWY_BEFORE_NAMESPACE or HWY_AFTER_NAMESPACE. While the code seems to run okay on my machine, this may be problematic elsewhere. Can you shed some light on these macros? If they are necessary, I'd like to add them to my code.

PiperOrigin-RevId: 692102588

jan-wassenberg · 2024-11-01T09:26:07Z

maybe I'll reduce it to a hint that it can be done, rather than 'going all the way' in the example code.

Sounds good!

You're right that less may be more, and it should really be about transporting the concept, to the advance of SIMD acceptance.

👍

Do you mean to the code in lux?

Yes, so readers can see the complexity there.

Thanks for the offer. I think a link in the README would be more appropriate for now - I feel my text doesn't really qualify as documentation, it's more in style with a blog post.

We'll add a link for now, but no worries, I would not be shy about calling this documentation, there is certainly a place for an introduction.

I do have another concrete question. In my example code, I haven't used HWY_PREFIX, HWY_BEFORE_NAMESPACE or HWY_AFTER_NAMESPACE. While the code seems to run okay on my machine, this may be problematic elsewhere. Can you shed some light on these macros?

Do you mean HWY_ATTR? These are definitely necessary, they the mechanism by which pragma target is applied. Without that, you might only get baseline SIMD code. From the README:

Additionally, each function that calls Highway ops (such as `Load`) must either
be prefixed with `HWY_ATTR`, OR reside between `HWY_BEFORE_NAMESPACE()` and
`HWY_AFTER_NAMESPACE()`. Lambda functions currently require `HWY_ATTR` before
their opening brace.

PiperOrigin-RevId: 692102588

PiperOrigin-RevId: 692107963

kfjahnke · 2024-11-10T18:12:51Z

I've gone over the text some more and just committed an updated version. On this occasion, I had a look at the link you put in your README. Thanks for placing it so prominently! Work on zimt's multi_isa branch is continuing, I have a good example set up and documented in the text. Two observations:

the binaries I get with clang++ are (much) faster than what g++ produces with my compiler flags.
when the highway 'flavour' dispatches to AVX2, that's slower on my machine than SSE2 or SSSE3.

The latter one is puzzling - The code is compute-heavy (repeated 1M evaluations of a 2D b-spline), so I would have thought that using AVX2 should speed things up. I haven't looked at the machine code yet to see if I've maybe made a mistake and my dispatch isn't working properly. Have you seen this happen in your tests?

Running b-spline code with zimt makes for good benchmarking code. Doing stuff like evaluating a b-spline of multi-channel data at 1M 2D coordinates and writing the results to memory is a 'realistic' workload testing memory access, de/interleaving, gather/scatter and raw arithmetic speed due to many additions and multiplications. The addition of b-spline evaluation code to zimt is quite recent and wraps up my porting effort from the vspline library to zimt.

jan-wassenberg · 2024-11-11T08:27:29Z

:)
I agree with your observation that clang code is better than gcc's, have often seen the same.

AVX2 being slower is surprising. Possible causes could be that the memory is only 16-byte aligned, or heavy use of shuffles which are more expensive (3-cycle latency) when crossing 128-bit blocks, whereas they are still single-cycle on SSSE3.
If it's possible to get your code on Godbolt, that's also a good way to quickly look at what's happening.

kfjahnke · 2024-11-11T08:49:49Z

I agree with your observation that clang code is better than gcc's, have often seen the same.

It's a mixed picture I get. At times, with specific compiler flags, back-end and workload, g++ can produce binary which outperforms everything else. But it doesn't do so consistently, and I usually see clang++ coming out on top. That's why I prefer it - and because it's error messages are more human-friendly.

I think I've figured out why the better targets didn't run faster: I only optimized with -O2 only and used a large-ish spline degree. I tried this morning with -O3 and cubic splines and got the expected performance increase going from SSE2 to AVX2 (I don't have a machine with AVX3). This also had the results for tests compiled with clang++ and g++ closer together. With larger spline degrees the 'better' ISAs tend to perform worse, and I don't have a good idea why this would be.

AVX2 being slower is surprising. Possible causes could be that the memory is only 16-byte aligned, or heavy use of shuffles which are more expensive (3-cycle latency) when crossing 128-bit blocks, whereas they are still single-cycle on SSSE3.

That's a good hint - my test code uses splines of three-channel xel data, to mimick a typical image processing workload. Such xel data need to be de/interleaved to/from SoA configuration for SIMD processing which is likely using shuffles.

There is one thing I notice with my back-ends which might merit a separate 'excursion issue': zimt uses C++ classes to 'harness' the 'naked' vectors. This goes beyond simply wrapping single vectors in objects - the objects contain several vectors (or their equivalent in memory), e.g. two or four. I find that this significantly influences performance, and here I have a good guess at what happens: I think that the resulting machine code, which performs several equal operations in sequence, hides latency which the SIMD instructions need to be set up and/or makes it possible for the CPU to use pipelining more effectively. Using this technique, I do get performance gains, and I've been using it for years now to good effect in lux. It may help to squeeze even more performance out of a given CPU. Give it a shot - using zimt, or simply by processing small batches of vectors rather than individual ones.

jan-wassenberg · 2024-11-13T18:59:02Z

de/interleaved to/from SoA configuration for SIMD processing which is likely using shuffles.

Yes, Load/StoreInterleaved3 does involve quite a few shuffles. Might be worth considering using 4 channels just for the faster interleaving :)

I agree unrolling is often helpful. One concern about storing vectors in classes is that it's harder to guarantee alignment.

PiperOrigin-RevId: 692107963

This reverts commit f3f3a4a.

kfjahnke · 2024-11-22T10:13:23Z

Hi again!
I've converted the zimt code base to become compatible with highway's foreach_target mechanism and merged the multi_isa branch back to main. The examples now all work with the modified headers, and there are some modified examples which can use internal multi-SIMD-ISA dispatch:

bsp_eval.cc repeatedly evaluates a 1M knot-point b-spline
convolve.cc applies a convolution kernel to an image (uses vigra)
linspace.cc uses zimt's linspace_t and zimt::process

I modified the examples.sh shell script, which compiles all examples, so that example files which use foreach_target and zimt's dispatch mechanism are compiled in an 'incarnation' using dynamic dispatch (once with clang++ and once with g++), while other examples, which still rely on picking a specific ISA by passing appropriate compiler arguments, are only compiled with the four zimt back-ends. If you have all back-ends installed, you can simply run
./examples.sh *.cc
in the examples folder, and get a large collection of binaries to play with. Some of them time the payload code, and it's interesting to see how the execution times vary with the compiler, back-end and SIMD ISA. I was surprised to see (on my AVX2 machine) that my Vc back-end does at time outperform all other variants, with the highway back-end performing almost as well. I'm not indicating that Vc is 'better' than highway - it's probably due to my implementation of the zimt back-ends which was derived from using Vc::SimdArray, so the Vc back-end does need little extra code for it's implementation, whereas employing highway for the task required quite some coding. The std::simd back-end performs quite well for some of the examples, while SIMD emulation with 'goading' usually comes out last.
I'd be interested in investigating why bsp_eval.cc in particular comes out fastest when using the Vc back-end. My implementation of b-spline evaluation uses a recursive function at it's core (to make the code dimension-agnostic) which uses (vectorized) data on the stack and return values. This core function seems to be what's slower with my highway back-end, data gathering, weight generation and data storing coming out roughly equal. I'd be grateful for any hints to tune the code - maybe these hints already ring a bell!

jan-wassenberg · 2024-11-22T15:52:08Z

hm, hard to say - there's a lot of code. It can be that unrolling or branch prediction differs depending on code structure. Performance counters might be useful to narrow down where the difference lies.

kfjahnke · 2024-12-15T10:15:34Z

It turned out that some of my code was not placed correctly into the ISA-specific nested namespaces, which resulted in sub-optimal performance. I've now managed to get the zimt library to fully cooperate with highway's foreach_target mechanism, and dynamically-dispatched versions run just as fast as single-ISA compiles.
With this success, I moved on to try and use the new zimt code with a 'real' application. I have a project going on which relies entirely on zimt for mass data processing: the envutil program. That's a utility to do geometrical reprojections of images, as they are needed e.g. in panorama viewing. I also use it as a test-bed for different interpolation methods, and to compare them to the ready-made methods coming with OpenImageIO - envutil can use either. Following the VFT dispatch method outlined in the examples in zimt I managed to get envutil to use highway's foreach_target mechanism, resulting in a binary with dynamic dispatch. I've put a debian package up for download if anyone is interested in trying it out - this package is for debian12 on x86_64 and may not install on other linuxes. Building envutil from source isn't hard, either - it's a standard cmake build and just requires a few dependencies.
So far, everything seems to work just as expected - I haven't been able to test on ISAs better than AVX2, though, because I don't have the hardware.
Since I use a scheme which can either exploit highway's foreach_target (for both the highway and 'goading' back-ends) or build as a single-ISA build for all zimt back-ends (goading, highway, Vc or std::simd) there's a variety of binary variants to play with - and different compilers do make a difference as well - as we discussed earlier, clang++ seems to come out on top for this kind of code. The cmake build is easy to configure for all of these variants. The debian package is built with the highway back-end.
I managed to switch to foreach_target-managed code with little modification of the code base - one pifall were #includes happening somewhere inside the code which had to be moved out from the nested namespaces.
One thing I noticed is that it seems to be problematic to instantiate templates from 'outside' - e.g. originating from headers of other template libraries - inside the foreach_target-managed code. It looks as if the instantiations only use some lowest-common-denominator SIMD ISA and do not compile according to the #pragma directives set up by HWY_BEFORE_NAMESPACE(). I think you mentioned something about dispatching to templated code, but I can't find it just now - it's probably addresing the same issue. I noticed it when I tried to use Imath's quaternion code with simdized data - replacing the use of Imath's template with equivalent 'handwritten' code brought the dynamically-dipatched code up to the same speed as a single-ISA build, where the significant discrepancy had me puzzled for a while.

jan-wassenberg · 2024-12-16T15:35:44Z

Thanks for the updates, and congrats on the result that dynamic == single ISA speed :)
For the template, I think you are referring to HWY_EXPORT_AND_DYNAMIC_DISPATCH_T?

kfjahnke · 2024-12-16T15:51:26Z

I think you are referring to HWY_EXPORT_AND_DYNAMIC_DISPATCH_T

Yes, that was it, thanks for the pointer!

kfjahnke · 2024-12-20T07:47:40Z

Please bear with me for a little longer. In my last post I wrote

One thing I noticed is that it seems to be problematic to instantiate templates from 'outside' - e.g. originating from headers of other template libraries - inside the foreach_target-managed code. It looks as if the instantiations only use some lowest-common-denominator SIMD ISA and do not compile according to the #pragma directives set up by HWY_BEFORE_NAMESPACE().

I think I have tracked the problem down, and I think it might be a stumbling stone which may potentially cost a lot of performance if users aren't aware of it. I set out to write a simpler program to pin the problem down. I used std::arrays of 'vec_t' type SIMD vectors instead of using Imath::Vec3 of zimt vectors and wrote a template for the cross product which I put outside any namespaces. When I tried to instantiate this template in foreach_target-managed code, the code would not compile. I got plenty of errors along the lines of

error: always_inline function 'operator*' requires target feature 'ssse3', but would be inlined
into function 'cross' that is compiled without support for 'ssse3'

Oops... I did not see this when I used the Imath template with my zimt data types. I figured out that my operator functions weren't declared to be always inlined. When I added HWY_INLINE declarations to my operator functions, I got the same compiler errors - I simply couldn't use the 'external' template any more, because it had already been 'fixed' to whatever ISA it was used with first time around. I think I prefer this behaviour, because it prevents the performance drop which you get from using the instantiation of the template using a low-grade SIMD ISA, but now I faced the problem of how to use external templates with foreach_target-managed code. The only way I found to do that - so far - was to produce separate SIMD-ISA-specific TUs, one for each SIMD ISA 'in play', an link them to the 'driver' code. Then, the template instantiations can happen in the right context, and use the given SIMD ISA. I found a relatively painless way to do this which still relies on foreach_target and ISA-specific #pragmas generated with HIGHWAY_BEFORE_NAMESPACE, rather than setting compiler flags externally. In the ISA-specific TUs, I do this:

#include <hwy/detect_targets.h>

// glean the target as 'TG_ISA' from outside (pass it with -DTG_ISA=HWY_... to the compiler)

#undef HWY_TARGET
#define HWY_TARGET TG_ISA

#include <hwy/highway.h>

// define dispatch_base

struct dispatch_base
{
  virtual void payload() const = 0 ;
} ;

// ...

HWY_BEFORE_NAMESPACE() ;

// now include headers which use SIMD data types

#include <xxx.h>

Then I proceed as usual: instantiate a dispatch object inside the nested namespace, provide a function _get_dispatch, use HWY_DYNAMIC_DISPATCH to pick the right variant and dispatch via the gleaned base class pointer. Since the header with the templates (xxx.h) is now #included after HWY_BEFORE_NAMESPACE(), it's in the same ISA environment as the SIMD data types and all is well. Do you think this method is fool-proof? There might be stumbling stones, e.g. with broken targets. I'd welcome suggestions to 'nail this down' - then I could add a section to my dispatching text on how to do it right.

The non-inlining of the operator functions does explain the performance drop in my initial approach. If the operator functions can't be inlined, they have to be called, which produces overhead.

Setting up the source files right is half the job - the other is to set up the build system to handle the variety of targets and produce ISA-specific object files. My approach is to let highway do as much work as possible and to refrain from supplying any ISA-specific arguments beyond setting HWY_TARGET. Here's a tentative bit of cmake code to build a multi-TU program:

# assuming we have detected the architecture, we list ISAs

if ( x86_64 )
  list ( APPEND isa_l SSE2 SSSE3 SSE4 AVX2 AVX3 AVX3_ZEN4 AVX3_SPR )
elseif ( arm64 )
  list ( APPEND isa_l HWY_NEON_WITHOUT_AES HWY_NEON HWY_NEON_BF16
         HWY_ALL_SVE HWY_SVE HWY_SVE2 HWY_SVE_256 HWY_SVE2_128 )
else()
  # tentative. on my system, this doesn't work.
  list ( APPEND isa_l HWY_EMU128 )
endif()

# for the test program composed of several ISA-specific TUs,
# we have the main program disp_to_tu.cc, which does the
# dispatching, and basic.cc, which has ISA-independent code

add_executable(disp_to_tu disp_to_tu.cc basic.cc)

# we need to link with libhwy

target_link_libraries(disp_to_tu hwy)

# the main program needs specific compile options:

set_source_files_properties ( disp_to_tu.cc PROPERTIES COMPILE_FLAGS
                "-DUSE_HWY -O3 -std=gnu++17 -DMULTI_SIMD_ISA -I.." )

# for the ISA-specific object files holding 'payload' code,
# we use cmake 'object libraries'. This places the ISA-specific
# object files in separate directories, for which we use the
# same name as the ISA. For this program, each of the object
# libraries will only contain a single object file made from
# inset.cc with ISA-specific compilation instructions. Since
# we're already running a loop over the ISAs, we add a line
# to tell cmake to link the object file in.

foreach ( isa IN LISTS isa_l )

    add_library ( ${isa} OBJECT inset.cc )

    target_compile_options ( ${isa} PUBLIC -DTG_ISA=HWY_${isa}
                 -DMULTI_SIMD_ISA -DUSE_HWY -O3 -std=gnu++17 )

    target_link_libraries(disp_to_tu $<TARGET_OBJECTS:${isa}>)

endforeach()

Since ISA-specific code is now in separate ISA-specific TUs, the scope of the #pragmas from HIGHWAY_BEFORE_NAMESPACE can be extended over headers which contain templates that are to be used with simdized data types. We get highway's 'wisdom' about which flags to use for which ISA without having to somehow extract these flags and feed them to the compiler 'from the outside'.

Comments welcome. I think with the job shared between cmake and highway, most of the handling of several ISAs becomes boilerplate, and users only need to slot their payload code in at the right place.

kfjahnke · 2024-12-20T10:35:51Z

...and there is one added bonus to generating separate ISA-specific TUs: they can all be compiled at the same time, rather than one after the other when building a monolithic TU with foreach_target.

I've pushed the new code to the envutil repo.

jan-wassenberg · 2024-12-20T16:23:27Z

hm, I acknowledge that multi-ISA build times make it tempting to shard .cc files. Probably better to do it at the level of "smaller files" rather than "one per ISA", because that's a maintenance burden. What happens when we add new targets, such as the upcoming AVX_10_2?

For the original "requires target feature" issue, that is indeed a serious problem. It indicates the code is not inside the pragmas as required. I thought the benefit of your approach was that the code would indeed be compiled inside HWY_BEFORE_NAMESPACE, and you only expose a "function pointer" (or vtable). That still seems like a reasonable approach if you can afford the function call, and the code should anyway be structured that we dispatch only infrequently/at a high level so that we can afford it?

kfjahnke · 2024-12-20T18:03:16Z

Probably better to do it at the level of "smaller files" rather than "one per ISA",

Ordering by ISA does not preclude splitting the objects up further - the add_library ( ${isa} OBJECT ... ) syntax can handle several object files. The maintenance effort is rather low:

What happens when we add new targets, such as the upcoming AVX_10_2?

'Old' executables should function as before, new ones won't build unless the new ISA is added to the list in cmake - the 'external' declaration of _get_dispatch in the main program requires the function to exist, and if there is no TU with the new ISA it doesn't, so a linker error occurs. Updating the list in cmake to contain the new ISA is a single-point change, everything else trickles down. Only the object code for the new ISA needs to be produced, the remainder of the object files can be reused.

serious problem. It indicates the code is not inside the pragmas as required. I thought the benefit of your approach was that the code would indeed be compiled inside HWY_BEFORE_NAMESPACE, and you only expose a "function pointer" (or vtable).

It's a tricky problem. Initially I thought that having the 'payload code proper' after HWY_BEFORE_NAMESPACE should be enough. The payload code 'proper' is indeed compiled after HWY_BEFORE_NAMESPACE, but if it uses a template from a header which was #included earlier, that does not pull the template's own code into the #pragma-defined environment. The invocation of HWY_BEFORE_NAMESPACE in a file managed by foreach_target.h can't stop the compiler from 'fixing' the ISA for the 'host' template to what the compile environment is when the template is first encountered (some baseline, typically SSE2). Putting the header with the template in between HWY_BEFORE_NAMESPACE and HWY_AFTER_NAMESPACE inside foreach_target-managed code does not help, because the header's own sentinel will stop it from being re-compiled for each ISA - for a good reason, such headers usually populate a single namespace, not separate nested namespaces with HWY_NAMESPACE as sub-namespace. Headers to populate a set of nested namespaces have to be specifically set up to do so.

If the payload code inside a foreach_target-managed build requires inlining, the build will fail as soon as the second ISA is compiled, because of the initial 'fixation' - you can see that in the error message I quoted: it occurs when the code trying to 'inline itself into the template' is in 'SSSE3 space' - the first pass when it's in 'SSE2 space' runs just fine. If inlining is not required, the compile succeeds and function calls happen, degrading performance. If the payload code goes into ISA-specific TUs, the header with the template can be placed after HWY_BEFORE_NAMESPACE with the desired effect, and inlining becomes possible, saving the function call and bringing performance up to a that of a single-ISA compile, at the cost of having to deal with the separate TUs.

Concerning the vtable approach - that is perfectly general. You can set up any number of classes derived from dispatch_base and provide your own method of providing a dispatch pointer. But things are much easier when following the structure suggested by highway's namespace system, and gleaning the dispatch pointer via a HWY_DYNAMIC_DISPATCH of _get_dispatch exploits highway's functionality fully. Whether the derived dispatch classes reside in the same or in a different TU makes no difference to dispatching.

jan-wassenberg · 2024-12-20T19:28:04Z

Before I head out for Christmas break.. :)
hm, I agree managing this in the build system would be feasible. But in general I've made the experience that the more logic there is in the build system, the likelier it is there will be issues.

The payload code 'proper' is indeed compiled after HWY_BEFORE_NAMESPACE, but if it uses a template from a header which was #included earlier, that does not pull the template's own code into the #pragma-defined environment.

I see. Yes, it is important that the header be pulled in inside/after our HWY_BEFORE_NAMESPACE.
As to the include guard, we handle this with the HWY_TARGET_TOGGLE mechanism - you can see this in action in dot-inl.h and others. That (and adding HWY_NAMESPACE) are likely what you meant by "specifically set up to do so". I'd still advocate for that approach if it's an option?

kfjahnke · 2024-12-21T07:11:26Z

Before I head out for Christmas break.. :)

Happy Holiday! Hope to hear from you after the break.

I agree managing this in the build system would be feasible.

highway already interacts with cmake and, during deployment, deposits code which is consumed on the receiving end to glean project- and deployment-specific information. Including the lists of ISAs there would make the process completely automatic for cmake builds. I agree it's more comfortable without another cook around to spoil the broth, but this looks like a clean solution to me.

we handle this with the HWY_TARGET_TOGGLE mechanism

If the templates/headers in question can be modified, that's a solution (I do so in zimt). But if they are from external sources (library code) that's not an option - the mechanism is intrusive. This is where the problems arise, and the 'sharding' to ISA-specific TUs is the only way I found so far to put such headers after HWY_BEFORE_NAMESPACE with the desired effect - if they are re-included with foreach-target, their sentinel blanks them out.

To wrap up, I've written a minimal program 'template_problem.cc' which demonstrates the issue. Try and compile it with

clang++ -O1 -c -o template_problem.o template_problem.cc -I.

Without optimization, the problem does not arise (no inlining), and the -I. is needed do that foreach_target can find the .cc file in the current folder. Here's the code:

#ifndef SENTINEL
#define SENTINEL

template < typename T >
T tsum ( T a , T b )
{
  return a + b ;
}

#endif // SENTINEL

#undef HWY_TARGET_INCLUDE
#define HWY_TARGET_INCLUDE "template_problem.cc" 

#include <hwy/foreach_target.h> 
#include <hwy/highway.h>

HWY_BEFORE_NAMESPACE() ;

namespace project
{
namespace HWY_NAMESPACE
{
  namespace hn = hwy::HWY_NAMESPACE ;
  typedef hn::ScalableTag < float > D ;
  typedef typename hn::Vec < D > vec_t ;

  vec_t add ( vec_t a , vec_t b )
  {
    return tsum ( a , b ) ;
  }
} ;
} ;

HWY_AFTER_NAMESPACE() ;

The template is guarded by a sentinel, just like templates in an external header would be - without the sentinel, the code won't compile (redefinition of tsum). vec_t has operator+ defined with HWY_INLINE, so the instantiation of the template with T=vec_t tries to inline the addition, which fails:

$ clang++ -O1 -c -o template_problem.o template_problem.cc -I.
template_problem.cc:7:12: error: always_inline function 'operator+' requires target feature 'ssse3',
but would be inlined into function 'tsum' that is compiled without support for 'ssse3'
    7 |   return a + b ;
      |            ^
  ...
6 errors generated.

kfjahnke · 2024-12-28T10:15:31Z

I have worked with a 'sharded' build for a bit now, and to have a 'real' application to try it out with, I've set up a modified build for envutil, which uses the same code base, so I can directly compare the two builds. The more I use the 'sharded' build, the more I like it. You have addressed the disadvantages, so I'll start out with them:

the build system has to be aware of the set of ISAs
and, especially, of new targets when there is a new hwy version
there are more source and object files to handle and coordinate

Now for the advantages:

the modular build makes for faster turn-around
memory consumption during the build is lessened considerably
cmake can provide infrastructure to generate the variant object files and keep them apart
the individual TUs are open to more extensive parameterization
the dispatching logic is still highway's and will automatically upgrade with new hwy versions
the target environment pragmas are also highway's, so no arch flags needed for the compiler
inlining into 'external' templates is possible (with performance gain)
more code (like external headers) can be put between HWY_BEFORE- and AFTER_NAMESPACE

I've added a new example program to the zimt repository, with a simple example of a 'multi-tu' build, with documentation to explain how it's done and why. The markdown file is a working draft, and I'd value your opinion on it!

copybara-service bot pushed a commit that referenced this issue Nov 1, 2024

reference runtime dispatch intro. Refs #2364

215588f

PiperOrigin-RevId: 692102588

copybara-service bot mentioned this issue Nov 1, 2024

reference runtime dispatch intro. Refs #2364 #2368

Merged

copybara-service bot pushed a commit that referenced this issue Nov 1, 2024

reference runtime dispatch intro. Refs #2364

2a71302

PiperOrigin-RevId: 692102588

copybara-service bot pushed a commit that referenced this issue Nov 1, 2024

reference runtime dispatch intro. Refs #2364

ee098a4

PiperOrigin-RevId: 692107963

jan-wassenberg mentioned this issue Nov 15, 2024

Question: How to use templated classes with user defined SIMD functions and non-SIMD functions. #2372

Closed

mazimkhan pushed a commit to cambridgeconsultants/aeroway that referenced this issue Nov 18, 2024

reference runtime dispatch intro. Refs google#2364

f3f3a4a

PiperOrigin-RevId: 692107963

mazimkhan added a commit to cambridgeconsultants/aeroway that referenced this issue Nov 18, 2024

Revert "reference runtime dispatch intro. Refs google#2364"

85b19bb

This reverts commit f3f3a4a.

mazimkhan added a commit to cambridgeconsultants/aeroway that referenced this issue Nov 18, 2024

Revert "reference runtime dispatch intro. Refs google#2364"

621b29c

This reverts commit f3f3a4a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'VFT dispatching' to call into SIMD-ISA-specific code #2364

'VFT dispatching' to call into SIMD-ISA-specific code #2364

kfjahnke commented Oct 28, 2024

jan-wassenberg commented Oct 29, 2024

kfjahnke commented Oct 30, 2024

kfjahnke commented Oct 31, 2024

jan-wassenberg commented Oct 31, 2024

kfjahnke commented Oct 31, 2024

jan-wassenberg commented Nov 1, 2024

kfjahnke commented Nov 10, 2024

jan-wassenberg commented Nov 11, 2024

kfjahnke commented Nov 11, 2024

jan-wassenberg commented Nov 13, 2024

kfjahnke commented Nov 22, 2024

jan-wassenberg commented Nov 22, 2024

kfjahnke commented Dec 15, 2024

jan-wassenberg commented Dec 16, 2024

kfjahnke commented Dec 16, 2024

kfjahnke commented Dec 20, 2024

kfjahnke commented Dec 20, 2024

jan-wassenberg commented Dec 20, 2024

kfjahnke commented Dec 20, 2024

jan-wassenberg commented Dec 20, 2024

kfjahnke commented Dec 21, 2024

kfjahnke commented Dec 28, 2024