-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for WebAssembly SIMD #81
Comments
Hi @omnisip, We do not master web technologies very well. I have searched a bit the internet and found "emscripten". As I understand it C/C++ code can be compiled into LLVM IR and then emscripten takes the latter to produce WebAssembly. So NSIMD should work out of the box to produce WebAssembly. But I feel I am missing something. Can you elaborate? Thanks. |
Discussing it with a colleague, I (think I) found and understood that you want in fact a new SIMD architecture for NSIMD namely WebAssembly which wraps WebAssembly intrinsics. Is it correct? |
Yes basically
…On Fri, Dec 18, 2020, 08:15 gquintin ***@***.***> wrote:
Discussing it with a colleague, I (think I) found and understood that you
want in fact a new SIMD architecture for NSIMD namely WebAssembly which
wraps WebAssembly intrinsics
<https://github.com/llvm/llvm-project/blob/master/clang/lib/Headers/wasm_simd128.h>.
Is it correct?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#81 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKQJWLDJ5XXYVW7ZAJPFDSVNW23ANCNFSM4VAKDTJA>
.
|
@omnisip, is there a "simple" way to test an implementation? Is it necessary to have a browser to execute the resulting code? On my testing servers I do not have any GUI installed and would like to continue this way. Can you give us any insight on this? |
Sure. There are at least a few ways to get going. First, if any of the functions you use were part of resort standardization efforts (bulk majority presumably) -- it's been backported to nodejs 14 by adding the flags --experimental-wasm-simd. See here: https://emscripten.org/docs/porting/simd.html If you need the newer stuff you can compile v8 from source and run d8 which is like a nodejs light. For your purposes, in case you weren't planning on using it already, I'd use emscripten. I'm CC'ing @tlively here who works on that as part of standardization efforts. |
@omnisip, thank you for your answer, also I assume you plan to support AVX and so on in the future, so I should take that into account in NSIMD and prepare for 256-bits WebAssembly and so on. Is it correct? |
The answer is that we expect to add 256 bit, but probably not until after the current standard is finalized. I would first start with 128 before messing with 256. |
Hi @omnisip, I have successfully built and install emscripten, binaryen and d8 (from v8) and tried a C++ Hello World: $ cat test.cpp
#include <iostream>
int main() {
std::cout << "Hello World!\n";
return 0;
}
$ em++ test.cpp
$ d8 a.out.js
Hello World! I will try some SIMD now. |
@omnisip, I use |
You might find it easier to install nightly V8 from jsvu than to build it yourself. |
@tlively thanks for the tip. SIMD seems ok. #include <iostream>
#include <wasm_simd128.h>
int main() {
float a[128], b[128], c[128];
for (int i = 0; i < sizeof(a) / sizeof(a[0]); i++) {
a[i] = (float)i;
b[i] = (float)(sizeof(a) / sizeof(a[0]) - i);
c[i] = 0.0;
}
for (int i = 0; i < sizeof(a) / sizeof(a[0]); i += 4) {
v128_t va = wasm_v128_load((void *)(a + i));
v128_t vb = wasm_v128_load((void *)(b + i));
v128_t vc = wasm_f32x4_add(va, vb);
wasm_v128_store((void *)(c + i), vc);
}
for (int i = 0; i < sizeof(a) / sizeof(a[0]); i++) {
std::cout << a[i] << " + " << b[i] << " = " << c[i] << "\n";
}
return 0;
} It compiles fine and executes fine: $ em++ -msimd128 test.cpp
$ d8 --experimental-wasm-simd a.out.js
0 + 128 = 128
1 + 127 = 128
2 + 126 = 128
[...]
126 + 2 = 128
127 + 1 = 128 Wrapping the intrinsics into NSIMD should not take long but I have to find a contiguous pack of 3/4 hours. |
Note that the intrinsics are not quite stable yet because the proposal is still being finalized. I don't anticipate many changes to existing intrinsics, but there will be new ones added. Hopefully they will be stabilized in the next few weeks. |
I am looking at emcc defines and found: #define __wasm 1
#define __wasm32 1
#define __wasm32__ 1
#define __wasm__ 1 Does this mean that WASM bytecode is "32-bits". If yes, is there a "64-bits" bytecode planned? |
Yes, you can track its progress at https://github.com/WebAssembly/memory64. |
Hi @tlively I have compiled NSIMD (emulation version) with emscripten. It worked fine. But there are warnings that I want to get rid off: em++: warning: LLVM version appears incorrect (seeing "11.0", expected "12.0") [-Wversion-check] I tried to compile with |
How did you install Emscripten? With the recommended method of installing via the emsdk, you shouldn't see that warning. |
I found out why, I made a mistake in my .emscripten file, I tried something but it did not work. I am currently wrapping intrinsics. If I am not mistaken the
From what I know only NVCC has builtins for that : https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__CAST.html ( Most compilers recognize this pattern and optimize it but writing: i32 reinterpret(u32 a) {
i32 ret;
memcpy((void *)ret, (void *)a, sizeof(i32));
return ret;
}
int main() {
v128_t vec;
u32 val;
vec = wasm_i32x4_replace_lane(vec, 0, reinterpret(val));
vec = wasm_i32x4_replace_lane(vec, 0, (i32)val); // not correct (except maybe in C++20)
} is not convenient and weird and a lot of people cannot use C++20. Moreover reading the code is not straighforward as one has to determine what is the base type the code is manipulating (i32 or u32?). So my advice is to provide intrinsics on all types (where it makes sense of course) even if several of them translate into the same assembly opcode. |
I did not find the equivalent of v128_t div_i32x4(v128_t a, v128_t b) {
v128_t ret;
ret = wasm_i32x4_replace_lane(ret, 0, // warning here, but it is intended
wasm_i32x4_extract_lane(a, 0) / wasm_i32x4_extract_lane(b, 0));
ret = wasm_i32x4_replace_lane(ret, 1,
wasm_i32x4_extract_lane(a, 1) / wasm_i32x4_extract_lane(b, 1));
ret = wasm_i32x4_replace_lane(ret, 2,
wasm_i32x4_extract_lane(a, 2) / wasm_i32x4_extract_lane(b, 2));
ret = wasm_i32x4_replace_lane(ret, 3,
wasm_i32x4_extract_lane(a, 3) / wasm_i32x4_extract_lane(b, 3));
return ret;
} |
Sorry to make requests like that... |
I think there is a extract_lane and replace_lane for all intrinsics. See here: https://github.com/WebAssembly/simd/blob/master/proposals/simd/ImplementationStatus.md With respect to the undefined, there are no intrinsics for that, but, I'm certain there is a way to avoid the warnings. If x64 doesn't have it, the emulation layer for neon does with all of the vreinterpret* |
If you just set it to zero in that case, I'm pretty sure clang will optimize it. Second, you can use the compiler built-in vector support to do those ops too even if we don't provide intrinsics. |
I meant for all base types. Taking
I think you should also provide:
For the undefined, I agree with you but I want to make the life of the compiler as easy as possible. Counting on the compiler to optimize code has often led to disappointing results hence I do not trust the compiler on large codebase with |
Okay. I'm pretty sure the unsigned and signed variants are synonyms for each other with the exception of 8bit. The underlying bit representations don't change. If you'd like I can file a ticket or make a PR to add the synonyms for those two. With respect to undefined, I've been thinking about this today and would suggest using the gcc vector intrinsic syntax for the cases where your need it. That way you can treat each input as part of an array and when you want to return a vector you can use the same syntax or the corresponding make*. Examples for both of these cases should exist in wasm_simd128.h. |
You are right about the fact that the sign and unsigned variants "are the same" from the chip point of view. This also goes for |
This is great feedback, thanks @gquintin! We can definitely add some sort of |
That's because the standard does not define how to do this kind of cast in all cases. More precisely when casting to/from signed/unsigned/float types the standard basically says that if the number can be represented in the destination type then all goes well but otherwise it is undefined-behavior or implemented-defined I do not remember. But in our situation: int main() {
v128_t vec;
u32 val;
vec = wasm_i32x4_replace_lane(vec, 0, (i32)val);
} if But you know I think that in 99% the C-cast will just work as we expect but it is not standard C/C++. I know I am a pain and "tatillon" (french for persnicket ??) but in NSIMD we chose to write code as portable as possible hence we do not use C-cast for this kind of operation as we want to support many compilers/OS/architectures. As an example here is our u32 to i32 reinterpret function: NSIMD_INLINE i64 nsimd_scalar_reinterpret_i64_u64(u64 a0) {
#ifdef NSIMD_IS_GCC
union { u64 from; i64 to; } buf;
buf.from = a0;
return buf.to;
#else
i64 ret;
memcpy((void *)&ret, (void *)&a0, sizeof(ret));
return ret;
#endif
} |
I quote the C89 standard (sorry I am kind of old fashionned), I guess that one can find the same in all other standards:
|
Does the WebAssembly spec sufficiently address your concerns? See here: https://webassembly.github.io/spec/core/exec/numerics.html |
If in some standards it is undefined behavior, no because it specifies how the VM for the WebAssembly works. Like Intel or Arm chips, integers are represented using 2's complement but it does not change the C/C++ standards and compilers can do what they want with For C89 it is implementation-defined, in the "Conversions" paragraph of numerics.html there are "Else the result is undefined." about the "trunc" operators which is the same as undefined behavior for the C/C++ standards. Plus I am not sure that counting on implementation behaviors is wise. |
This is a little confusing because of the structure of the WebAssembly spec, but the actual truncation instructions are defined to trap if the result cannot be represented. WebAssembly does not have any undefined behavior. (In this particular case, the LLVM WebAssembly backend inserts range checks to avoid the trap at runtime.) But given that clang is currently the only C/C++ compiler that compiles to WebAssembly, you should be able to depend on its signed/unsigned integer casting behavior in WebAssembly SIMD code. But what do other architectures do about this problem? Does x86 have a separate signed and unsigned version for their corresponding intrinsics? |
Intel does as you do, intrinsics are not provided for all types. You have On Arm you have both intrinsics:
So I guess there are both approaches. I don't know for all programmers but for my collegues and I we prefer to have intrinsics on all types knowing that it will eventually give the same binary code. We see several advantages to that approach:
Take the example of the NSIMD __m128d nsimd_gather_linear_sse42_f64(f64 const* a0, int a1) {
__m128d ret;
ret = _mm_undefined_pd();
ret = _mm_castsi128_pd(_mm_insert_epi64(
_mm_castpd_si128(ret),
nsimd_scalar_reinterpret_i64_f64(a0[0 * a1]), 0));
ret = _mm_castsi128_pd(_mm_insert_epi64(
_mm_castpd_si128(ret),
nsimd_scalar_reinterpret_i64_f64(a0[1 * a1]), 1));
return ret;
} vs float64x2_t nsimd_gather_linear_aarch64_f64(f64 const* a0, int a1) {
float64x2_t ret;
ret = vdupq_n_f64(a0[0]);
ret = vsetq_lane_f64(a0[1 * a1], ret, 1);
return ret;
} There is not Intel intrinsic for inserting doubles (f64) into a SIMD register which would be called f64 nsimd_scalar_reinterpret_f64_i64(i64 a0) {
#ifdef NSIMD_IS_GCC
union { i64 from; f64 to; } buf;
buf.from = a0;
return buf.to;
#else
f64 ret;
memcpy((void *)&ret, (void *)&a0, sizeof(ret));
return ret;
#endif
} So I guess one could add to the advantages:
For the "truncation" stuff I think I did not look at your document carefully but if indeed you spec says something like or the following can be implied from other statements "conversion from N-bits unsigned integer to N-bits signed integers leaves the bits as-is" then indeed one can use the C-style cast conversion. In NSIMD we avoid as much as possible undefined behaviors and implementation defined stuff. |
Thanks for the detailed information! This is very useful feedback, and I will take it under consideration as we finalize the WebAssembly SIMD intrinsics over the next couple months 👍 |
That's an interesting sample. Dup is the standard behavior on arm for initializing a register, but isn't the standard on x86/x64. When writing code for WASM SIMD it's helpful to know some of these things since you want the compiler(s), clang + v8 for instance, to generate optimal code for your target platform. The best solution for this case might be a bit different than you expect. For instance, if you're implementing gather which always pulls from memory, you might be incentivized to use the load32_zero or load64_zero instructions that will initialize a vector according to the first 4 bytes or 8 bytes without exceeding the boundary. Then from there, inserting. There's even an argument to use load64_ zero twice causing two normal memory loads into vectors, before shuffling the vectors together. Such would ensure that you're not getting hit with a cross boundary penalty as well as minimizing the number of shuffle ops. For testing and optimization purposes, it would be good to compare this solution against what wasm simd make ops do to see if the compiler is smart enough to make it. As a side note, such optimizations would probably prudent on your existing x64 and neon code if the compiler isn't making them already. With respect to the casts, I understand both perspectives. If we came up with a solution with preprocessor macros that didn't increase code file size much and didn't add significant processing time for builds, we could probably add the synonyms with minimal complexity. |
Let me put here what I have said to @omnisip by mail as a status update for this issue: I have created a "wasm" branch: https://github.com/agenium-scale/nsimd/tree/wasm It is only a matter of filling the
Same for |
Status update: a collegue is currently continuing the implementation of WASM into NSIMD. I'll keep you posted. |
Hi there,
Are you intending on adding support for WebAssembly SIMD?
The text was updated successfully, but these errors were encountered: