-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-buffer SIMD kernels - The case of LDPC #611
Comments
@Aang23 first off, a fast LDPC decoder implementation is pretty cool. We have some convolutional and polar decoder code in VOLK. LDPC might fit as well. Speaking of those other implementations, I see a problem: reliable tests. Personally, I started to re-write VOLK tests with gtest. This system requires more manual implementation than the old system but does not try to do a "one-size fits all". I had a look at your AVX implementation. It looks nice. However, it deviates from how things are done in VOLK at the moment. VOLK kernels usually don't have state. i.e. every buffer is passed in as a parameter. Also, VOLK kernels are implemented in C. Only the test system etc. are implemented in C++. The inter frame parallelism is another challenge. We might have a system where 1, 8, 32, etc. are available options. e.g. if you want to use 8x parallelism, you will use the SSE kernel even if AVX is available. It'd be a good idea to discuss this further. As already pointed out, there are some areas where VOLK would need improvements and we need to figure out how we would want to fit your implementation into the VOLK structure. |
@jdemel yes, that's part of the reason I've thought of this, as Volk already has similar FEC kernels. As for tests, with everything FEC I'd suggest doing both a synthetic "output test" as it was done before and an AWGN channel simulation and perhaps generate a BER curve. Not especially for automated tests there, but it's a quick way for someone running the tests to tell if they are performing as expected. I was initially thinking only the generic_cn_kernel would be provided as a Volk kernel, ported over to C (it's already C pretty much either way). It'd take all parameters as initialized by the user (but the parallelism will need to be known beforehand still), leaving preparing / handling buffers to the user. Perhaps another way could be for Volk to provide a "ldpc_decoder_state_t" struct or similar, which could be initialized by a first kernel to then provide a more generic, less SIMD-sensitive "volk_decode_ldpc(ldpc_decoder_state_t*state, int8_t *input, uint8_t *output)" etc. That'd keep it more akin to the C++ implementation without as much having to be handled by the user. Yes, if the user wants a specify parallelism... That'd be yet another challenge, as it would have to be selected by the frame count when calling the kernel. That's not something Volk support either, but perhaps in that case manual override by the user could be a solution. I'd assume in most cases only the fastest available will be utilized. |
So which functions do you suggest to port to VOLK? The inner loop with SIMD instructions? The whole At the moment, we'd have multiplicties: 1 (generic), 8 (SSE, NEON), 16 (AVX), 32 (AVX512) OR did I mix smth up? Is 1, 16, 32, 64 correct? I started to work on a new test infrastructure: https://github.com/jdemel/volk/tree/newtest Do you suggest a specific way forward? |
Yes,
With the current kernels, that is correct (1, 8, 16, 32). Using 16-bits values. In the future I'd like to try and use only 8-bits which could be another variant.
All kernels of the sort will definitely benefit from that!
Personally, from my experience with Volk I'd maybe suggest deviating from the "kernel-only" system for some things. It's great for most DSP functions but it does make Volk a bit more challenging to utilize than other (often slower) solutions. In the case of adding the
This would allow the user to setup a That's probably, in rough lines how I would suggest doing it, as it'd make a it a lot easier for the "average developer" to use it. Perhaps also providing something similar for other similar FEC kernels such as the convolutional one. However, this is drastically different from Volk's usual ways :-) |
My biggest worry with this approach would be that VOLK currently tries to not have state. At least none of the kernels do. This would drastically change with this kernel. The benefit of VOLK is that it provides a common interface for different implementations. Also, it comes with a system to select the best kernel automatically. I guess these features are the most important things you're looking for. If we kept the state implementation etc. separate where would you put it? FEC tends to be way more complex than the other kernels that we usually see. |
Hello!
First of all this is not an issue - more of a question / discussion about something I may consider PRing into Volk.
Short version :
I've worked on LDPC decoding to implement in SatDump, and I'm thinking this could be a decent addition to Volk / GNU Radio but this comes with a few challenges differing from how things are usually done in Volk.
Long version :
In SatDump & for my own needs, a proper and decently performant LDPC implementation has been something I had been in need of for quite a while.
While there are already quite a few projects able to decode generic (and CCSDS) LDPC codes such as AFF3CT, gr-ccsds un-merged code and more, all of them are pretty slow especially for the usecase of processing large amount of satellite data possibly in realtime.
Hence, none of them were suitable as-is. I'll skip the details of the "story" (you can see some of them here daniestevez/ldpc-toolbox#1), but I've ended up porting gr-ccsds' code over and writing SIMD (SSE, AVX2 and NEON) versions.
If anyone wants to see the code :
Generic : https://github.com/altillimity/SatDump/blob/master/src-core/common/codings/ldpc/ldpc_decoder_generic.cpp
SSE : https://github.com/altillimity/SatDump/blob/master/plugins/simd_extensions/simd_sse41/ldpc_decoder/ldpc_decoder_sse.cpp
AVX2 : https://github.com/altillimity/SatDump/blob/master/plugins/simd_extensions/simd_avx2/ldpc_decoder/ldpc_decoder_avx.cpp
NEON : https://github.com/altillimity/SatDump/blob/master/plugins/simd_extensions/simd_neon/ldpc_decoder/ldpc_decoder_neon.cpp
SIMD Selection : https://github.com/altillimity/SatDump/blob/master/src-core/common/codings/ldpc/ldpc_decoder.cpp and https://github.com/altillimity/SatDump/blob/master/src-core/common/cpu_features.cpp
However, handling those SIMD versions of the decoder led me to implementing a Volk-like system of my own, which obviously led me to thinking about adding those implementation as Volk kernels which could also benefit other projects and remove the need to handle this on my own as I am currently doing.
But as what I said previous implies this comes with major challenges :
All kernels currently take an aligned buffer of a size which does NOT change depending on the underlying implementation... But in this instance, it would need to not be the case. The way SIMD is done above (and usually is for LDPC) is by processing "x" frames at once instead of a single one using SIMD operators. Hence, a plain C generic implementation will process a single frame at once, my SSE version 8 and AVX2 16. That means the underlying implementation's "simd factor" needs to be known by the user beforehand. (so that "x" frames can be buffered and packed accordingly beforehand)
An easy fix could be simply setting all kernels to process the maximum number of frames the largest SIMD can process at once (so, in this insteance taking AVX512, it would be 32), but that's a pretty awful solution in my opinion.
What do you all think? And about the idea of adding this sort of kernels to Volk?
The text was updated successfully, but these errors were encountered: