-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'VFT dispatching' to call into SIMD-ISA-specific code #2364
Comments
Thanks for sharing your result, which is of course welcome :)
Thank you! It has actually been a few years since we designed this. You might currently have the best understanding of dispatching and how the pieces fit together.
Yes, that's fair. Maybe HWY_TARGET_NAME, though we did want to discourage using it for anything other than a namespace. If I understand your new system correctly, the derived class calls the hn:: implementation directly, so once you have your Very cool, congrats! I'd encourage you to also write up an intro on this system, it looks useful for when there are many functions to dispatch. FYI some time ago we expanded in an unrelated direction: enabling dispatch of function templates, at least with |
Claiming that would be presumptuous - I figured out how to use highway's mechanism, not quite how it actually does what it does. But I did notice that a lot of it works with the preprocessor, rather than relying on C++ language features like templates.
I have something in the pipeline, I'll post a link to a draft version once I have it online.
Dispatching this way is indeed very efficient, it's really just the load-the-vptr plus indirect call. The level at which this dispatch happens should not be too performance-critical, though - The stuff which really needs to be fast should all be inlined inside the chunks of code which contain performance-critical code, and the function-level dispatch should only happen when you do stuff like call a routine which processes an entire array of data or such - as zimt does when it works over nD arrays of *xel data. The lookup of the current target is an issue, though. I think we discussed this once already, but my memory fails me. The question is when such a lookup may occur at run-time. If there is a possibility that the available SIMD ISA changes from one call into the SIMD-ISA-specific code to the next, my method is not applicable without a dispatch pointer refetch - the code would crash with an illegal instruction if the current ISA does not provide the instructions. But I think this is a rare exception, and it would be hard to handle anyway: what if you've just figured out you can use AVX3, and when you proceed to call the AVX3-code the system puts your thread on a core which only has AVX2? You'd have to make sure the code proceeds without such switches until it's done, and this may be hard to 'nail down'. I started this thread to also get feedback from you guys on potential stumbling stones, and this is certainly one - but you're in a better position to know whether this is indeed relevant, so I'd be glad to get some advice, beyond 'which does happen'. When and on which platforms does it happen? Since I intend this level of dispatch for entire blocks of code rather than at the level of individual SIMD operations or small sequences, it would be unproblematic to re-fetch the dispatch pointer before using it (losing the direct VFT call speed advantage), if the ISA switch can happen in mid-run-time. But if the ISA switch can occur between the acquisition of the dispatch pointer and it's use immediately afterwards, it's a problem - I doubt, though - even without investigating deeply - that your dispatching code is shielded against such extreme disruption. |
Here's the text I've written on the topic: https://github.com/kfjahnke/zimt/blob/multi_isa/examples/multi_isa_example/multi_simd_isa.md The 'multi_isa_example' folder also has example code, a program using the dispatch mechanism I have proposed in my initial post. The .md file has a lengthy text which starts out by describing highway's dispatch mechanism and the layer of code I have added on top to use 'VFT dispatching'. The code and text in this folder describe the general how-to - zimt's own use of VFT dispatching is more involved (there's the zimt namespace to deal with as well) and it's still only half-finished as of this writing, with only the core of the zimt functionality accessible via VFT dispatching. I intend the multi_isa branch to evolve so that all zimt code can be dispatched this way but keeps the option of using the other SIMD library back-ends (Vc, std::simd, zimt's own 'goading' code) as an alternative. Once that's done, I'll merge it back to master. If we can settle the open issue about the ISA switching while a program is running (if that ever occurs), I think my method should be generally viable. I think my example code and the text will clarify precisely how VFT dispatching works - it's much more elaborate than my initial post here. Again, comments welcome! |
I agree, but it's surprising, nice, and rare that more convenience actually comes with more speed.
I wouldn't worry to much about this. Intel has warned since 10+ years that CPUID info might become heterogeneous, but no one bothers to check. The most common case is where someone disables targets at runtime using a flag. This happens early on in main(), and users can arrange to call get_dispatch after that, so no problem. Excellent article, thanks for putting this together! If you want to go into a bit more detail about the mechanism, we could expand on "deems most suitable at the time". Bonus: who initializes this bitfield? We don't want to get into the init order fiasco by setting it in a ctor. Instead we arrange that the first entry in the table is a special version of the user's code, which first sets the bitfield, then calls the user code. Subsequent dispatch will go straight to the user code. Some typos: Personally I'd stop before the SIMD_REGISTER step - some people like to minimize macros, and it might be useful to
FYI Highway will remain open source. Google has a policy of not deleting open source projects.
Want to add a link to the code? We'd welcome adding this to the Highway g3doc. Or would you prefer a link in the readme? |
That's a relief. I did suspect this was a no-go - it would be just too disruptive and make people even less likely to invest in coding with SIMD.
Fixed, thanks @jan-wassenberg!
I'm glad you approve of my tedious repetitive technical outporings :-)
I'll think about it. I used this to good effect in lux. But of course it's an extra frill which isn't strictly necessary - maybe I'll reduce it to a hint that it can be done, rather than 'going all the way' in the example code. You're right that less may be more, and it should really be about transporting the concept, to the advance of SIMD acceptance.
Do you mean to the code in lux? lux' single cmake file is here, and the code about 'flavours' starts in line 318, as of this writing. You can see that's quite a mouthful. I'll be glad to get rid of all this cmake code once I have moved lux to use zimt with automatic ISA dispatching. If you mean the code for the article, it's here
Thanks for the offer. I think a link in the README would be more appropriate for now - I feel my text doesn't really qualify as documentation, it's more in style with a blog post. But of course you can quote me if you like. I do have another concrete question. In my example code, I haven't used HWY_PREFIX, HWY_BEFORE_NAMESPACE or HWY_AFTER_NAMESPACE. While the code seems to run okay on my machine, this may be problematic elsewhere. Can you shed some light on these macros? If they are necessary, I'd like to add them to my code. |
PiperOrigin-RevId: 692102588
Sounds good!
👍
Yes, so readers can see the complexity there.
We'll add a link for now, but no worries, I would not be shy about calling this documentation, there is certainly a place for an introduction.
Do you mean HWY_ATTR? These are definitely necessary, they the mechanism by which pragma target is applied. Without that, you might only get baseline SIMD code. From the README:
|
PiperOrigin-RevId: 692102588
PiperOrigin-RevId: 692107963
I've gone over the text some more and just committed an updated version. On this occasion, I had a look at the link you put in your README. Thanks for placing it so prominently! Work on zimt's multi_isa branch is continuing, I have a good example set up and documented in the text. Two observations:
The latter one is puzzling - The code is compute-heavy (repeated 1M evaluations of a 2D b-spline), so I would have thought that using AVX2 should speed things up. I haven't looked at the machine code yet to see if I've maybe made a mistake and my dispatch isn't working properly. Have you seen this happen in your tests? Running b-spline code with zimt makes for good benchmarking code. Doing stuff like evaluating a b-spline of multi-channel data at 1M 2D coordinates and writing the results to memory is a 'realistic' workload testing memory access, de/interleaving, gather/scatter and raw arithmetic speed due to many additions and multiplications. The addition of b-spline evaluation code to zimt is quite recent and wraps up my porting effort from the vspline library to zimt. |
:) AVX2 being slower is surprising. Possible causes could be that the memory is only 16-byte aligned, or heavy use of shuffles which are more expensive (3-cycle latency) when crossing 128-bit blocks, whereas they are still single-cycle on SSSE3. |
It's a mixed picture I get. At times, with specific compiler flags, back-end and workload, g++ can produce binary which outperforms everything else. But it doesn't do so consistently, and I usually see clang++ coming out on top. That's why I prefer it - and because it's error messages are more human-friendly. I think I've figured out why the better targets didn't run faster: I only optimized with -O2 only and used a large-ish spline degree. I tried this morning with -O3 and cubic splines and got the expected performance increase going from SSE2 to AVX2 (I don't have a machine with AVX3). This also had the results for tests compiled with clang++ and g++ closer together. With larger spline degrees the 'better' ISAs tend to perform worse, and I don't have a good idea why this would be.
That's a good hint - my test code uses splines of three-channel xel data, to mimick a typical image processing workload. Such xel data need to be de/interleaved to/from SoA configuration for SIMD processing which is likely using shuffles. There is one thing I notice with my back-ends which might merit a separate 'excursion issue': zimt uses C++ classes to 'harness' the 'naked' vectors. This goes beyond simply wrapping single vectors in objects - the objects contain several vectors (or their equivalent in memory), e.g. two or four. I find that this significantly influences performance, and here I have a good guess at what happens: I think that the resulting machine code, which performs several equal operations in sequence, hides latency which the SIMD instructions need to be set up and/or makes it possible for the CPU to use pipelining more effectively. Using this technique, I do get performance gains, and I've been using it for years now to good effect in lux. It may help to squeeze even more performance out of a given CPU. Give it a shot - using zimt, or simply by processing small batches of vectors rather than individual ones. |
Yes, Load/StoreInterleaved3 does involve quite a few shuffles. Might be worth considering using 4 channels just for the faster interleaving :) I agree unrolling is often helpful. One concern about storing vectors in classes is that it's harder to guarantee alignment. |
PiperOrigin-RevId: 692107963
This reverts commit f3f3a4a.
This reverts commit f3f3a4a.
Hi again!
I modified the examples.sh shell script, which compiles all examples, so that example files which use foreach_target and zimt's dispatch mechanism are compiled in an 'incarnation' using dynamic dispatch (once with clang++ and once with g++), while other examples, which still rely on picking a specific ISA by passing appropriate compiler arguments, are only compiled with the four zimt back-ends. If you have all back-ends installed, you can simply run |
hm, hard to say - there's a lot of code. It can be that unrolling or branch prediction differs depending on code structure. Performance counters might be useful to narrow down where the difference lies. |
It turned out that some of my code was not placed correctly into the ISA-specific nested namespaces, which resulted in sub-optimal performance. I've now managed to get the zimt library to fully cooperate with highway's foreach_target mechanism, and dynamically-dispatched versions run just as fast as single-ISA compiles. |
Thanks for the updates, and congrats on the result that dynamic == single ISA speed :) |
Yes, that was it, thanks for the pointer! |
Please bear with me for a little longer. In my last post I wrote
I think I have tracked the problem down, and I think it might be a stumbling stone which may potentially cost a lot of performance if users aren't aware of it. I set out to write a simpler program to pin the problem down. I used std::arrays of 'vec_t' type SIMD vectors instead of using Imath::Vec3 of zimt vectors and wrote a template for the cross product which I put outside any namespaces. When I tried to instantiate this template in foreach_target-managed code, the code would not compile. I got plenty of errors along the lines of
Oops... I did not see this when I used the Imath template with my zimt data types. I figured out that my operator functions weren't declared to be always inlined. When I added HWY_INLINE declarations to my operator functions, I got the same compiler errors - I simply couldn't use the 'external' template any more, because it had already been 'fixed' to whatever ISA it was used with first time around. I think I prefer this behaviour, because it prevents the performance drop which you get from using the instantiation of the template using a low-grade SIMD ISA, but now I faced the problem of how to use external templates with foreach_target-managed code. The only way I found to do that - so far - was to produce separate SIMD-ISA-specific TUs, one for each SIMD ISA 'in play', an link them to the 'driver' code. Then, the template instantiations can happen in the right context, and use the given SIMD ISA. I found a relatively painless way to do this which still relies on foreach_target and ISA-specific #pragmas generated with HIGHWAY_BEFORE_NAMESPACE, rather than setting compiler flags externally. In the ISA-specific TUs, I do this:
Then I proceed as usual: instantiate a dispatch object inside the nested namespace, provide a function _get_dispatch, use HWY_DYNAMIC_DISPATCH to pick the right variant and dispatch via the gleaned base class pointer. Since the header with the templates (xxx.h) is now #included after HWY_BEFORE_NAMESPACE(), it's in the same ISA environment as the SIMD data types and all is well. Do you think this method is fool-proof? There might be stumbling stones, e.g. with broken targets. I'd welcome suggestions to 'nail this down' - then I could add a section to my dispatching text on how to do it right. The non-inlining of the operator functions does explain the performance drop in my initial approach. If the operator functions can't be inlined, they have to be called, which produces overhead. Setting up the source files right is half the job - the other is to set up the build system to handle the variety of targets and produce ISA-specific object files. My approach is to let highway do as much work as possible and to refrain from supplying any ISA-specific arguments beyond setting HWY_TARGET. Here's a tentative bit of cmake code to build a multi-TU program:
Since ISA-specific code is now in separate ISA-specific TUs, the scope of the #pragmas from HIGHWAY_BEFORE_NAMESPACE can be extended over headers which contain templates that are to be used with simdized data types. We get highway's 'wisdom' about which flags to use for which ISA without having to somehow extract these flags and feed them to the compiler 'from the outside'. Comments welcome. I think with the job shared between cmake and highway, most of the handling of several ISAs becomes boilerplate, and users only need to slot their payload code in at the right place. |
...and there is one added bonus to generating separate ISA-specific TUs: they can all be compiled at the same time, rather than one after the other when building a monolithic TU with foreach_target. I've pushed the new code to the envutil repo. |
hm, I acknowledge that multi-ISA build times make it tempting to shard .cc files. Probably better to do it at the level of "smaller files" rather than "one per ISA", because that's a maintenance burden. What happens when we add new targets, such as the upcoming AVX_10_2? For the original "requires target feature" issue, that is indeed a serious problem. It indicates the code is not inside the pragmas as required. I thought the benefit of your approach was that the code would indeed be compiled inside |
Ordering by ISA does not preclude splitting the objects up further - the
'Old' executables should function as before, new ones won't build unless the new ISA is added to the list in cmake - the 'external' declaration of _get_dispatch in the main program requires the function to exist, and if there is no TU with the new ISA it doesn't, so a linker error occurs. Updating the list in cmake to contain the new ISA is a single-point change, everything else trickles down. Only the object code for the new ISA needs to be produced, the remainder of the object files can be reused.
It's a tricky problem. Initially I thought that having the 'payload code proper' after HWY_BEFORE_NAMESPACE should be enough. The payload code 'proper' is indeed compiled after HWY_BEFORE_NAMESPACE, but if it uses a template from a header which was #included earlier, that does not pull the template's own code into the #pragma-defined environment. The invocation of HWY_BEFORE_NAMESPACE in a file managed by foreach_target.h can't stop the compiler from 'fixing' the ISA for the 'host' template to what the compile environment is when the template is first encountered (some baseline, typically SSE2). Putting the header with the template in between HWY_BEFORE_NAMESPACE and HWY_AFTER_NAMESPACE inside foreach_target-managed code does not help, because the header's own sentinel will stop it from being re-compiled for each ISA - for a good reason, such headers usually populate a single namespace, not separate nested namespaces with HWY_NAMESPACE as sub-namespace. Headers to populate a set of nested namespaces have to be specifically set up to do so. If the payload code inside a foreach_target-managed build requires inlining, the build will fail as soon as the second ISA is compiled, because of the initial 'fixation' - you can see that in the error message I quoted: it occurs when the code trying to 'inline itself into the template' is in 'SSSE3 space' - the first pass when it's in 'SSE2 space' runs just fine. If inlining is not required, the compile succeeds and function calls happen, degrading performance. If the payload code goes into ISA-specific TUs, the header with the template can be placed after HWY_BEFORE_NAMESPACE with the desired effect, and inlining becomes possible, saving the function call and bringing performance up to a that of a single-ISA compile, at the cost of having to deal with the separate TUs. Concerning the vtable approach - that is perfectly general. You can set up any number of classes derived from dispatch_base and provide your own method of providing a dispatch pointer. But things are much easier when following the structure suggested by highway's namespace system, and gleaning the dispatch pointer via a HWY_DYNAMIC_DISPATCH of _get_dispatch exploits highway's functionality fully. Whether the derived dispatch classes reside in the same or in a different TU makes no difference to dispatching. |
Before I head out for Christmas break.. :)
I see. Yes, it is important that the header be pulled in inside/after our |
Happy Holiday! Hope to hear from you after the break.
highway already interacts with cmake and, during deployment, deposits code which is consumed on the receiving end to glean project- and deployment-specific information. Including the lists of ISAs there would make the process completely automatic for cmake builds. I agree it's more comfortable without another cook around to spoil the broth, but this looks like a clean solution to me.
If the templates/headers in question can be modified, that's a solution (I do so in zimt). But if they are from external sources (library code) that's not an option - the mechanism is intrusive. This is where the problems arise, and the 'sharding' to ISA-specific TUs is the only way I found so far to put such headers after HWY_BEFORE_NAMESPACE with the desired effect - if they are re-included with foreach-target, their sentinel blanks them out. To wrap up, I've written a minimal program 'template_problem.cc' which demonstrates the issue. Try and compile it with
Without optimization, the problem does not arise (no inlining), and the -I. is needed do that foreach_target can find the .cc file in the current folder. Here's the code:
The template is guarded by a sentinel, just like templates in an external header would be - without the sentinel, the code won't compile (redefinition of tsum). vec_t has operator+ defined with HWY_INLINE, so the instantiation of the template with T=vec_t tries to inline the addition, which fails:
|
I have worked with a 'sharded' build for a bit now, and to have a 'real' application to try it out with, I've set up a modified build for envutil, which uses the same code base, so I can directly compare the two builds. The more I use the 'sharded' build, the more I like it. You have addressed the disadvantages, so I'll start out with them:
Now for the advantages:
I've added a new example program to the zimt repository, with a simple example of a 'multi-tu' build, with documentation to explain how it's done and why. The markdown file is a working draft, and I'd value your opinion on it! |
Hi! This is more of a little excursion than a 'true' issue, but it's about a technique which I've found useful and would like to share. The occasion is that I'm extending my library zimt to use highway's foreach_target mechanism.
My first remark - before I start out on the issue proper - is about this mechanism. I knew it was there, I thought it might be a good idea to use it, but the documentation was thin and I had a working solution already. Before I turned my attention to zimt again this autumn, I did a lot of reading in the SIMD literature, and I also decided to have a closer look at highway's foreach_target mechanism. Lacking extensive documentation, I sat myself down and read the code. Only then I realized just how well-thought-out and useful it actually is. Yes, using it is slightly intrusive to the client code, but you've really done a good job to hide the complexity and make it easy to 'suck' code into the SIMD-ISA-specific nested namespaces and dispatch to it. But here I do actually have some criticism - to figure that out I had to read and understand the code! It would have been much easier had there been some sort of technical outline, paper, or such - to explain the concept. This criticism goes beyond this specific topic - I think you'd be well advised to improve documentation, to address a wider user base.
My first step in introducing highway's multi-ISA capability into zimt was to introduce 'corresponding' nested namespaces in both my library's 'zimt' namespace and in the 'project' namespace (let's use this one for user-side code). I had a hard time initially figuring out the namespace scheme, probably because the name of the central macro HWY_NAMESPACE. The naming is unfortunate - of course it's a symbol for a namespace, but a name like HWY_SIMD_ISA would have hinted at it's semantics, not at syntax. With the namespaces set up, I could use foreach_target.h and dynamic dispatch. But I found the way to introduce the ISA-specific code via a free function verbose, so I tried to figure out a way to make this more concise and manageable. I 'bent' some of the code I used in lux to the purpose, and this is where I come to 'VFT dispatching'. The concept is quite simple:
So this is where the 'VFT' in 'VFT dispatching' comes from: it uses the virtual function table of a class with virtual functions. The language guarantees that the VFTs of all classes derived from the base have the same layout (otherwise the mechanism could not function). What do I gain? the base class pointer is a uniform handle to a - possible large - set of functions I want to keep ISA-specific versions of. Dispatching is as simple as calling through the dispatcher base class pointer, so once I have obtained it, it serves as a conduit:
This is more or less it, with one more little twist which I also found useful. In a first approach, I wrote out the declaration of the pure virtual member function in the base class, and again the declaration (now no longer pure) in the derived, ISA-specific, class. This is error-prone, so I now use an interface header, introducing the member functions via a macro. In a header 'interface.h' I put macro invocations only:
Then I can #include this header into the class declarations, #defining the macro differently:
This ensures that the declarations are consistent. For the actual implementation, the signature has to be written out once again, but since there is a declaration, providing a definition with different signature is an error, and when providing the implementation, coding with the signature 'in sight' is advisable anyway - especially when the argument list becomes long. the 'interface.h' header provides a good reference to the set of functions using the dispatcher, and additional dispatch-specific functionality can be coded for the lot. I think it makes a neat addition to VFT dispatching.
To wrap up, I'd like to point out that this mechanism is generic and can be used to good effect for all sorts of dispatches - If appropriate specific derived dispatch classes are coded along with a mechanism to pick a specific one, it can function quite independently of highway's dispatch. It can also be used to 'pull in' code which doesn't even use highway - e.g. code with vector-friendly small loops ('goading') relying on autovectorization which will still benefit from being re-compiled several times with ISA-specific flags, be it with highway's foreach_target or by setting up separate TUs with externally supplied ISA-specific compiler flags - this is what I currently do in lux, but it requires quite some 'scaffolding' code in cmake.
Comments welcome! I hope you find this useful - I intended to share useful bits here every now and then and it's been a while since the last one (about goading), but better late than never. If you're interested, you can have a peek into zimt's new multi_isa branch, where I have a first working example using the method (see linspace.cc and driver.cc in the examples section). If you don't approve of my intruding into your issue space, let me know.
The text was updated successfully, but these errors were encountered: