-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running FLOPS in Windows #1
Comments
I'm starting to be labeled a stalker, but back to the topic. I've listed all the information below, if you have any questions just let me know :)
> peakperf.exe -b zen2 -w 0
------------------------------------------------------
peakperf (https://github.com/Dr-Noob/peakperf)
------------------------------------------------------
CPU: AMD Ryzen 9 3900X 12-Core Processor
Microarch: Zen 2
Benchmark: Zen 2 (AVX2)
Iterations: 1.00e+09
GFLOP: 0.30
Threads: 24
N┬║ Time(s) GFLOP/s
1 2.35462 0.13
2 2.35085 0.13
3 2.34874 0.13
4 2.35693 0.13
5 2.35561 0.13
6 2.34889 0.13
7 2.34961 0.13
8 2.34183 0.13
9 2.34504 0.13
10 2.34760 0.13
------------------------------------------------------
Average performance: 0.13 +- 0.00 GFLOP/s
------------------------------------------------------ |
Update:I have first fixed all problems under MSVC and the result is just sobering ... The performance under Windows is catastrophic & I just don't know why. I could now reach these values (Compiled with MSVC X64): > MinSizeRel\peakperf.exe -w 0
------------------------------------------------------
peakperf (https://github.com/Dr-Noob/peakperf)
------------------------------------------------------
CPU: AMD Ryzen 9 3900X 12-Core Processor
Microarch: Zen 2
Benchmark: Zen 2 (AVX2)
Iterations: 1.00e+09
GFLOP: 3840.00
Threads: 24
N┬║ Time(s) GFLOP/s
1 3.34606 1147.62
2 3.29593 1165.07
3 3.29363 1165.89
4 3.31118 1159.71
5 3.29446 1165.59
6 3.29978 1163.71
7 3.28723 1168.16
8 3.29356 1165.91
9 3.30641 1161.38
10 3.29375 1165.84
------------------------------------------------------
Average performance: 1162.86 +- 5.59 GFLOP/s
------------------------------------------------------ @Dr-Noob , do you have any suggestions on what else I could test? |
In the case of peakperf, it is not as easy as the others programs (cpufetch, gpufetch). peakperf is extremely delicate; if the compiler does not generate the exact needed assembly instructions in the correct order, you will not reach the peak performance (you will not be even close to it, as you have experienced). Honestly, I think that making peakperf work on Windows is quite hard (I don't know if it is even possible), so this is more of a thing that I should do on my own. That said, the thing I would do in your case is to check the assembly. The generated assembly will surely be wrong (you can compare it against the assembly generated in Linux), and you would need to make the compiler to generate the correct one. As I said, this is not trivial at all... |
Under Windows it is really not easy. MinGW only produces segfaults, I also found something about this on StackOverflow (but I don't know if this is still true, anyway I link it).
MSVC produces only miserable assembly code and I can't optimize it further, so I'll have another look at clang (was also recommended on StackOverflow, maybe I can achieve something with it). |
UpdateAfter intensive research, I was at least able to achieve better performance with Clang under Windows. But it doesn't come close to the performance under Linux ... The assembly of the compute functions looks good so far (have added compute_zen2 here respectively, since I own a Ryzen 9 3900X). Now I have to do a profiling and look where there are still performance hotspots to eliminate them. Result of the microbenchmark with the CPU AMD Ryzen 9 3900X under Windows 10: C:\Users\[USER NAME]\Documents\AAA_Environment\peakperf.win.env\peakperf-clang\build-clang-msvc>peakperf.exe -w 0
------------------------------------------------------
peakperf (https://github.com/Dr-Noob/peakperf)
------------------------------------------------------
CPU: AMD Ryzen 9 3900X 12-Core Processor
Microarch: Zen 2
Benchmark: Zen 2 (AVX2)
Iterations: 1.00e+09
GFLOP: 3840.00
Threads: 24
N┬║ Time(s) GFLOP/s
1 2.34313 1638.84
2 2.34213 1639.54
3 2.33812 1642.34
4 2.36314 1624.95
5 2.34313 1638.83
6 2.33912 1641.64
7 2.34713 1636.04
8 2.35013 1633.95
9 2.34813 1635.34
10 2.34413 1638.14
------------------------------------------------------
Average performance: 1636.95 +- 4.72 GFLOP/s
------------------------------------------------------
C:\Users\[USER NAME]\Documents\AAA_Environment\peakperf.win.env\peakperf-clang\build-clang-msvc> The assembler code from the file .text
.def @feat.00;
.scl 3;
.type 0;
.endef
.globl @feat.00
.set @feat.00, 0
.intel_syntax noprefix
.file "zen2.cpp"
.def "?compute_zen2@@YAXPEAT__m256@@T1@H@Z";
.scl 2;
.type 32;
.endef
.section .text,"xr",one_only,"?compute_zen2@@YAXPEAT__m256@@T1@H@Z"
.globl "?compute_zen2@@YAXPEAT__m256@@T1@H@Z" # -- Begin function ?compute_zen2@@YAXPEAT__m256@@T1@H@Z
.p2align 4, 0x90
"?compute_zen2@@YAXPEAT__m256@@T1@H@Z": # @"?compute_zen2@@YAXPEAT__m256@@T1@H@Z"
.seh_proc "?compute_zen2@@YAXPEAT__m256@@T1@H@Z"
# %bb.0:
sub rsp, 184
movsxd rax, esi
lea rdi, [rip + farr_zen2]
.seh_stackalloc 184
.seh_endprologue
mov edx, 1000000000
lea rsi, [rax + 4*rax]
shl rsi, 7
vmovaps ymm11, ymmword ptr [rsi + rdi + 160]
vmovaps ymm2, ymmword ptr [rsi + rdi + 32]
vmovaps ymm3, ymmword ptr [rsi + rdi + 96]
vmovaps ymm1, ymmword ptr [rsi + rdi]
vmovaps ymm10, ymmword ptr [rsi + rdi + 128]
vmovaps ymm9, ymmword ptr [rsi + rdi + 192]
vmovaps ymm8, ymmword ptr [rsi + rdi + 256]
vmovaps ymm7, ymmword ptr [rsi + rdi + 320]
vmovaps ymm6, ymmword ptr [rsi + rdi + 384]
vmovaps ymm5, ymmword ptr [rsi + rdi + 448]
vmovaps ymm4, ymmword ptr [rsi + rdi + 512]
vmovaps ymm12, ymmword ptr [rsi + rdi + 416]
vmovaps ymm13, ymmword ptr [rsi + rdi + 480]
vmovaps ymm14, ymmword ptr [rsi + rdi + 544]
vmovaps ymm15, ymmword ptr [rsi + rdi + 608]
lea rcx, [rdi + rsi]
lea rax, [rsi + rdi + 64]
vmovups ymmword ptr [rsp + 64], ymm11 # 32-byte Spill
vmovaps ymm11, ymmword ptr [rsi + rdi + 224]
vmovups ymmword ptr [rsp + 128], ymm2 # 32-byte Spill
vmovups ymmword ptr [rsp + 96], ymm3 # 32-byte Spill
vmovaps ymm2, ymmword ptr [rsi + rdi + 64]
vmovaps ymm3, ymmword ptr [rsi + rdi + 576]
vmovups ymmword ptr [rsp + 32], ymm11 # 32-byte Spill
vmovaps ymm11, ymmword ptr [rsi + rdi + 288]
vmovups ymmword ptr [rsp], ymm11 # 32-byte Spill
vmovaps ymm11, ymmword ptr [rsi + rdi + 352]
.p2align 4, 0x90
.LBB0_1: # =>This Inner Loop Header: Depth=1
vfmadd213ps ymm1, ymm0, ymmword ptr [rsp + 128] # 32-byte Folded Reload
# ymm1 = (ymm0 * ymm1) + mem
vfmadd213ps ymm2, ymm0, ymmword ptr [rsp + 96] # 32-byte Folded Reload
# ymm2 = (ymm0 * ymm2) + mem
vfmadd213ps ymm10, ymm0, ymmword ptr [rsp + 64] # 32-byte Folded Reload
# ymm10 = (ymm0 * ymm10) + mem
vfmadd213ps ymm9, ymm0, ymmword ptr [rsp + 32] # 32-byte Folded Reload
# ymm9 = (ymm0 * ymm9) + mem
vfmadd213ps ymm8, ymm0, ymmword ptr [rsp] # 32-byte Folded Reload
# ymm8 = (ymm0 * ymm8) + mem
vfmadd213ps ymm7, ymm0, ymm11 # ymm7 = (ymm0 * ymm7) + ymm11
vfmadd213ps ymm6, ymm0, ymm12 # ymm6 = (ymm0 * ymm6) + ymm12
vfmadd213ps ymm5, ymm0, ymm13 # ymm5 = (ymm0 * ymm5) + ymm13
vfmadd213ps ymm4, ymm0, ymm14 # ymm4 = (ymm0 * ymm4) + ymm14
vfmadd213ps ymm3, ymm0, ymm15 # ymm3 = (ymm0 * ymm3) + ymm15
dec rdx
jne .LBB0_1
# %bb.2:
vmovaps ymmword ptr [rcx], ymm1
vmovaps ymmword ptr [rax], ymm2
vmovaps ymmword ptr [rax + 64], ymm10
vmovaps ymmword ptr [rax + 128], ymm9
vmovaps ymmword ptr [rax + 192], ymm8
vmovaps ymmword ptr [rax + 256], ymm7
vmovaps ymmword ptr [rax + 320], ymm6
vmovaps ymmword ptr [rax + 384], ymm5
vmovaps ymmword ptr [rax + 448], ymm4
vmovaps ymmword ptr [rax + 512], ymm3
add rsp, 184
vzeroupper
ret
.seh_endproc
# -- End function
.lcomm farr_zen2,327680,64 # @farr_zen2
.section .drectve,"yn"
.ascii " /DEFAULTLIB:libcmt.lib"
.ascii " /DEFAULTLIB:oldnames.lib"
.addrsig
.globl _fltused |
What performance do you achieve in Linux? 1600 GFLOP/s doesn't look bad for your CPU. |
Okay, that's funny … Under Linux (headless) I achieve between 1675 GFLOP/s to 1700 GFLOP/s.
Under Windows, the GFLOP/s calculation is a bit slower. I will do some more tests under Windows. The previous result may have been distorted by the antivirus program or other background services. But on the whole it already looks very good :) |
Yeah, that's what I meant, it looks like your last version is working in Windows because you are already achieving the peak performance. So, in the end, what you did was just to use clang instead of msvc and mingw? Did you do any changes to the source code? |
In the end, I did indeed use clang. The Cmake file I had to adapt partly, because some GCC options also clang does not understand under Windows. I have listed the biggest changes below. File /* [+](Added) __attribute__((sysv_abi)) to the function pointers */
struct benchmark_cpu {
void (*compute_function_256)(__m256*farr_ptr, __m256, int32_t) __attribute__((sysv_abi));
void (*compute_function_512)(__m512 *farr_ptr, __m512, int32_t) __attribute__((sysv_abi));
const char* name;
double gflops;
int32_t n_threads;
bench_type benchmark_type;
};
/* ... */
/* [+](Added) __attribute__((ms_abi)) */
__attribute__((ms_abi)) bool compute_cpu (struct benchmark_cpu* bench, double* e_time)
{
/* ... */
} File #ifndef __ZEN2__
#define __ZEN2__
#include "arch.hpp"
/* [+](Added) __attribute__((sysv_abi)) */
__attribute__((sysv_abi)) void compute_zen2(__m256 *farr_ptr, __m256 mult, int32_t index);
#endif File #include "zen2.hpp"
#define OP_PER_IT B_256_10_OP_IT
/* [+](Added) Keyword 'static' because clang throws a lot of warnings */
static TYPE farr_zen2[MAX_NUMBER_THREADS][SIZE] __attribute__((aligned(64)));
/* [+](Added) __attribute__((sysv_abi))
* [~](Updated)
* int index (changed to) -> int32_t index
* long i (changed to) -> int64_t i
*/
__attribute__((sysv_abi)) void compute_zen2(TYPE *farr, TYPE mult, int32_t index)
{
farr = farr_zen2[index];
for(int64_t i=0; i < BENCHMARK_CPU_ITERS; i++)
{
farr[0] = _mm256_fmadd_ps(mult, farr[0], farr[1]);
farr[2] = _mm256_fmadd_ps(mult, farr[2], farr[3]);
farr[4] = _mm256_fmadd_ps(mult, farr[4], farr[5]);
farr[6] = _mm256_fmadd_ps(mult, farr[6], farr[7]);
farr[8] = _mm256_fmadd_ps(mult, farr[8], farr[9]);
farr[10] = _mm256_fmadd_ps(mult, farr[10], farr[11]);
farr[12] = _mm256_fmadd_ps(mult, farr[12], farr[13]);
farr[14] = _mm256_fmadd_ps(mult, farr[14], farr[15]);
farr[16] = _mm256_fmadd_ps(mult, farr[16], farr[17]);
farr[18] = _mm256_fmadd_ps(mult, farr[18], farr[19]);
}
} These are mainly the most important changes. Of course I had to implement other things like the method 'gettimeofday' myself ... Maybe it is even possible to get more performance out of it. |
My friend was kind enough to test the program on her computer. She has an i7 4790K @ 4.0 GHz and the results are horrible ... There is definitely something wrong yet, I'll have another look at the assembler code. ------------------------------------------------------
peakperf (https://github.com/Dr-Noob/peakperf)
------------------------------------------------------
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
Microarch: Haswell
Benchmark: Haswell (AVX2)
Iterations: 1.00e+09
GFLOP: 640.00
Threads: 4
N┬║ Time(s) GFLOP/s
1 1.93591 330.59
2 1.90480 335.99
3 1.92203 332.98
4 1.95600 327.20
5 1.96473 325.74
6 2.05039 312.14
7 2.30242 277.97
8 2.00046 319.93
9 2.03192 314.97
10 2.11022 303.29
------------------------------------------------------
Average performance: 317.16 +- 16.51 GFLOP/s
------------------------------------------------------ |
I've tried running FLOPS in Windows:
First, one have to change some int and long to stdint's type (
int32_t
andint64_t
). After that, I tried running it and the performance was horrible. Then, I figured out, looking at assembly, that the loop was compiled poorly. I noticed I was using a 32 bit compiler (mingw
which is 32 bits). Lastly, I tried compiling it with a 64 bit compiler (I usedmingw-w64
), using the Windows build (MingW-W64-builds
) and the Arch Linux one (mingw-w64-gcc-bin
). Both gave me the same result: segmentation fault. I found somewhere that running it with 32 bits but having a segfault with 64 bits could be caused to some issues with the stack, which should be solved using-fno-stack-protector
. This does not solve the segfault. I love you, Windows ❤️The text was updated successfully, but these errors were encountered: