-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark example using Intel MKL (for history) #10
Comments
Newer results: cd Downloads/Nim
rm -rf laser
source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout dbfb31d
git submodule init
git submodule update
cd build
gedit benchmarks/third_party/blas.nim
gedit benchmarks/gemm/gemm_bench_float32.nim
gedit laser/primitives/matrix_multiplication/gemm_tiling.nim
export OMP_NUM_THREADS=72
rm -rf build
mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim Results (OpenBLAS = MKL):
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512)
After chatting for hours with @mratsim to find benchmark Laser with a 72 thread machine and getting a working MKL setup, here is an example benchmark using Intel MKL. We are assuming multiple MKL installations, and using a specific version stored in
/opt/intel/compilers_and_libraries_2019.0.117
with the following settings:We assume also you do not have any nim installation, if you do have you know what lines to skip.
Change the number of threads right at the beginning (
OMP_NUM_THREADS
). We are using commit 990e59fBefore compiling, change the following in https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/blas.nim#L5 to the following (change the MKL folders if needed):
Change the following here https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/gemm/gemm_bench_float32.nim#L53-L55 to:
Tune the following to your likings, here I used my Dual Xeon Gold 6154 and put 100 repeated computations:
For the CpuFlopCycle, you need to check the implemented instructions here:
https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_ukernel_avx_fma.nim#L10-L23
Also, tune this to your preference https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_tiling.nim#L234-L235 (I tuned again for my Dual Xeon Gold 6154):
And now you can compile with MKL (change the MKL folders if needed):
On a Dual Xeon 6154 setup (36 physical cores / 72 logical cores, 3.7 GHz all turbo), you should get the following:
As you can see, we are nearly reaching the maximum possible theoretical performance:
The text was updated successfully, but these errors were encountered: