-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elbrus (e2k) architecture support #700
base: wip/e2k
Are you sure you want to change the base?
Conversation
This is awesome, thank you! I'm not familiar with the architecture, but I'd like to support it as best we can. I'm willing to merge more or less as-is, but maintenance is going to be a bit tricky… Is it be possible to add a CI build, even if all we can do is cross-compile the tests from x86? If lcc can be installed on an x86 Linux machine it should be possible to add something to the development container as well. For what it's worth I'm happy to help turn generic installation instructions into a CI job. Even better, of course, would be the ability to actually run the tests. Since no CI providers I'm aware of support e2k, this would likely mean an emulator; I don't see anything about qemu support for e2k, but maybe there is an out-of-tree patch we could use or something? Again, I'm happy to help get this integrated into our CI if possible. |
Thanks! It's nice to see that project maintainer is interested in such a PR. Unfortunately, there are some problems that may cause trouble for CI. First, there is no adequate e2k emulator (it is WIP, but may take a lot of time until it will be ready). There is instruction-precise simulator though, but it is not available publicly, since it is considered an engineering tool, and even if it was available, it is too slow to run any CPU-heavy stuff in it, like building or testing (it's about 1000 times slower than an actual hardware). There is also cross compiler, but it is available on request only (lcc compilers, both native and cross ones, have EDG frontend, which is proprietary and can't be freely distributed AFAIK). But instead we have three publicly available E2K machines, which can be accessed by users if they provide desired username and public SSH key (still no root access though). If it is enough to setup CI, then I'd be happy to provide access (so you may set up CI not just for building, but also for running tests). Regarding access, you may contact me through Discord (makise-homura#8793) or join Telegram group which is a discussion related to mentioned SSH-accessible machines (it is primarily in Russian language, but most of us can speak back in English if someone is asking there for something in English). If either one isn't an option for you, you may suggest another option instead I guess. |
I pushed recently some other changes enabling OpenMP (which is implicitly enabled with LCC without any option like
But they are probably low-level ones raised when assembler code is packed into VLIW word. Don't know what to do with them, and also have no idea if I should take them into account. Looks like kind of overflowed shifts, but I'm not sure. |
Okay, it sounds like requesting access via SSH is the right way to go for now; I'll do that in a bit. That way at least I'll be able to do some periodic builds and debugging. I'm definitely interested in adding a CI job, but that would require either setting up CI servers on e2k hardware or a freely distributable emulator + (cross-)compiler. Hopefully one day :)
That could be bugs in SIMDe. Intel tends to accept any 8-bit value for a lot of functions, so we have to handle things like shifting 16-bit lanes by 16+ bits, but IIRC that is UB. If we need to add some checks to do something like |
Well, culprits are
But, if I use Have you any ideas what to do with that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also conditional expression like
((imm8 > 15) ? _mm_setzero_si128() : _mm_srli_si128(a, imm8))
is never optimized to exclude unreachable branch
You mean you see it compiled in the output? What compiler? I don't remember ever having a problem with it before on gcc, clang, or even MSVC, assuming the optimizer is configured to be sufficiently aggressive (for SIMDe I usually only worry about -O3).
Or is the problem that you still get an error because of the value passed to the imm8 parameter, even though that path isn't taken? If that's the case, what you can do is pass imm8 & 15
; that will restrict the imm8 param to make the compiler happy, but since the value is going to be known at compile time the AND gets compiled away to nothing too.
Also, I suggest something like (imm8 & ~15)
instead of (imm8 > 15)
because of negatives. That's not just a completely insane thing to pass, either; Arm uses negatives to reverse direction (e.g., right shift -1 == left shift 1).
Yes, it was in LCC's assembler output.
Yes! That worked. That was quite a clever hack. Pushed it in 42ecd54.
Hm, that may be an idea. Rewritten my 42ecd54 into 8b0b8fa, sorry for that previous commit. |
I've been playing around a bit with this, and I have a pretty small test case for the reduced-alignment issue: #include <stdint.h>
#include <stdio.h>
typedef union {
int8_t i8 __attribute__((__vector_size__(32)));
} simde__m256i;
simde__m256i
simde_mm256_set1_epi8(int8_t a) {
simde__m256i r;
for (size_t i = 0 ; i < sizeof(r.i8) / sizeof(r.i8[0]) ; i++) {
r.i8[i] = a;
}
return r;
}
int main(void) {
simde__m256i a = simde_mm256_set1_epi8(42);
return 0;
} This is interesting because we're not actually requesting a specific alignment anywhere; LCC aligns 256-bit vectors to 32-byte boundaries by default. I would definitely classify that as an LCC bug. The other time I ran into this diagnostic today was when I looked into implementing a maximum alignment on LCC (not necessary, it turns out; it was just my first guess for the cause of this). In that case, if you specify the IMHO it's not really appropriate to emit this as GCC's documentation says: "When used on a struct, or struct member, the aligned attribute can only increase the alignment; in order to decrease it, the packed attribute must be specified as well." It's not unreasonable to use it that way, as SIMDe does. However, I can see a reasonable argument for having it as long as it's off-by-default (which it is). The real problem is that AFAICT there is no way to disable it in code; no diagnostic number is provided so I have no idea what to suppress. Of course, since lcc doesn't offer a way to pop the warning stack we would be disabling the warning in any code which uses SIMDe, too, which is obviously not optimal… if there were a way to pop the stack I'd be a lot more comfortable just disabling the warning for SIMDe. I also took a look at the XOP performance. This was a bit more work than I'd hoped since google-benchmark doesn't work on LCC… there are a couple of warnings to suppress (IIRC one in google-benchmark and another in google-test), IIRC about unused functions), but then there is another problem I couldn't find a quick work-around for. However, I tried using Hayai it it worked well. So far I've only tested _mm_permute2_ps, but the "native" version is faster. Here is the code: #include <sys/random.h>
#define SIMDE_NO_NATIVE
#include <simde/x86/xop.h>
#include <x86intrin.h>
class RandomVectorsFixture
: public ::hayai::Fixture
{
public:
virtual void SetUp()
{
getrandom(&a, sizeof(a), GRND_RANDOM);
getrandom(&b, sizeof(b), GRND_RANDOM);
getrandom(&c, sizeof(c), GRND_RANDOM);
}
virtual void TearDown()
{ }
union {
__m128 native;
simde__m128 simde;
} a;
union {
__m128 native;
simde__m128 simde;
} b;
union {
__m128i native;
simde__m128i simde;
} c;
union {
__m128 native;
simde__m128 simde;
} r;
};
BENCHMARK_F(RandomVectorsFixture, test_simde_mm_permute2_ps, 1000, 1) {
r.simde = simde_mm_permute2_ps(a.simde, b.simde, c.simde, 2);
}
#pragma diag_suppress 1444
BENCHMARK_F(RandomVectorsFixture, test_mm_permute2_ps, 1000, 1) {
r.native = _mm_permute2_ps(a.native, b.native, c.native, 2);
} And the results:
I'd like to test each function (shouldn't be hard to modify that code), but I suspect it's going to be better to just ignore that warning. |
0366dab, e38fe50, and ad8c7e0 move this along pretty well. With those patches in place I'm able to get to the point where the compilation fails due to the inefficient implementations. I'll try to get working on testing each of those tomorrow to make sure they are all faster than SIMDe's implementations. e38fe50 is an excellent example of why I don't trust |
f4817b6
to
6d9669a
Compare
Sorry for long wait, I was a bit busy with work this week, so had any time to deal with SIMDe just today. The only thing remaining to do for now, I guess, is to do something with reduced alignment warnings (remove Also I hope you approve the way I avoid deprecation warnings in 9ff51ed (with a tiny fix in 7791129), and give an advice on what to do with OPENMP_SIMD. |
No worries, I didn't have internet (except on my phone) for a while due to a move, so I wouldn't have been able to review this anyways. I think I do like your idea for the deprecated diagnostics. I'm conflicted since I'm worried about someone doing something like #pragma diag_suppress 1215,1444
// ...
#include "path/to/simde/x86/sse2.h"
call_deprecated_function(); But I think it's the best we're going to do for LCC (and a big improvement over just disabling them and leaving them that way), and unfortunately I don't think LCC is going to support push/pop any time soon.
You mean whether to just It seems like this is basically ready to merge, right? I see a few minor things I'd like to change, but nothing major; I'll just tweak those when I merge everything. |
Yes, it is the exact case I'm expecting to be affected by using
Yes, I did like this. So now OpenMP SIMD is enabled by default, and can be disabled if
Yes, now it's really ready to merge I guess. Today's commits fixed the every single thing remaining until E2K support could be considered implemented, so unless you have any change requests for it, I think these changes can land into master. BTW, github still shows me that there's one more unresolved change requested, but I can't see it here. |
Sorry it took so long, but this is almost done. As of a few days ago everything (416c243 and 269db2a) works on e2k, but the tests generate a lot of those -Wreduced-alignment warnings. For now you can get rid of them with -Wno-reduced-alignment and everything will work as expected. I slightly changed how the calls to "deprecated" functions were made, mostly just wrapping them in statement exprs to squash the error, and I moved some stuff around a bit. |
Oops, sorry for long wait, I've totally forgot about this PR, and only few days ago I've been reminded of it. Actually I was like 'it is being merged, so it's all ok with it', and while I've had no further notifications, I forgot about it. |
bump! cc @nemequ @makise-homura |
Sorry, I've been away from SIMDe for a while but I'm trying to get back into it now. I'll take a look at this over the weekend. My memory is a bit foggy on the details, but I seem to remember that everything except for one set of changes for one issue ( |
Another bump :) |
Can someone point to place(s) where the supported instructions in the various e2k-v1/2/3/4/... are listed?
What I see in various other tickets and websites is conflicting, e.g. is AVX supported or not, which models support what, etc. |
This pull request introduces support of Elbrus hardware platform (which is based on Russian Elbrus CPU family) with its native lcc (eLbrus Compiler Collection) compiler.
This is linked to the corresponding PR for obs-studio.
ninja test
has been run after building, all tests passed, both on x86_64 and e2k.There is no ARM instructions support for now, because it looks too hard to make current implementations compile normally on e2k; but it may be introduced in following PRs, if needed.