-
Notifications
You must be signed in to change notification settings - Fork 572
BlockCRS Benchmark
This benchmark measures the performance of Tpetra::BlockCrsMatrix in a realistic application context. The code uses 7 point stencile operator to mimic finite volume CFD code. The problem domain is a 3D cube and is distributed over MPI processors. Internally, the code exploits node-level parallelism using Kokkos. This benchmark measures the following performance features.
- logal/global graph construction
- local/global block crs matrix and multivector fill
- block crs matrix vector multiplication
- equivalent flat scalar matrix vector multiplication This benchmarks provides a baseline performance of the current Tpetra::BlockCrsMatrix implementation.
In this section, we show how to configure the Trilinos code for Intel and NVIDIA GPU architectures. First we show the base configuration that is commonly used for our target architectures and we explain customized cmake variables and setup for each target architecture.
#!/bin/bash
USE_CUDA=OFF # ON if GPU
USE_OPENMP=ON
EXAMPLE=ON
TEST=ON
BUILD_TYPE=RELEASE # or DEBUG
TRILINOS_DIR=/your/trilinos/source/directory
INSTALL_DIR=/your/trilinos/install/directory
rm -rf C*
cmake \
-D BUILD_SHARED_LIBS:BOOL=OFF \
-D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
-D Trilinos_ENABLE_INSTALL_CMAKE_CONFIG_FILES:BOOL=ON \
-D Trilinos_ENABLE_EXAMPLES:BOOL=${EXAMPLE} \
-D Trilinos_ENABLE_TESTS:BOOL=${TEST} \
-D Trilinos_ENABLE_Fortran:BOOL=OFF \
-D Trilinos_ENABLE_KokkosCore:BOOL=ON \
-D Trilinos_ENABLE_KokkosAlgorithms:BOOL=ON \
-D Trilinos_ENABLE_ALL_PACKAGES:BOOL=OFF \
-D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
-D Trilinos_ENABLE_Tpetra:BOOL=ON \
-D Teuchos_ENABLE_LONG_LONG_INT:BOOL=OFF \
-D CMAKE_BUILD_TYPE:STRING=${BUILD_TYPE} \
-D CMAKE_CXX_COMPILER:FILEPATH="mpicxx" \
-D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF \
-D CMAKE_SKIP_RULE_DEPENDENCY=ON \
-D CMAKE_INSTALL_PREFIX:PATH=${INSTALL_DIR} \
-D TPL_ENABLE_GLM=OFF \
-D TPL_ENABLE_MPI:BOOL=ON \
-D TPL_ENABLE_LAPACK:BOOL=ON \
-D TPL_ENABLE_BLAS:BOOL=ON \
-D CMAKE_SKIP_RULE_DEPENDENCY=ON \
-D Trilinos_ENABLE_OpenMP=${USE_OPENMP} \
-D Kokkos_ENABLE_OpenMP:BOOL=${USE_OPENMP} \
-D Kokkos_ENABLE_TESTS:BOOL=ON \
-D TPL_ENABLE_CUDA:BOOL=${USE_CUDA} \
-D TPL_ENABLE_CUSPARSE:BOOL=${USE_CUDA} \
-D Kokkos_ENABLE_Cuda:BOOL=${USE_CUDA} \
-D Kokkos_ENABLE_Cuda_UVM:BOOL=${USE_CUDA} \
$TRILINOS_DIR
- specify KOKKOS_ARCH
-D KOKKOS_ARCH="[OPT]", available options are
[AMD]
AMDAVX = AMD CPU
[ARM]
ARMv80 = ARMv8.0 Compatible CPU
ARMv81 = ARMv8.1 Compatible CPU
ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU
[IBM]
Power7 = IBM POWER7 and POWER7+ CPUs
Power8 = IBM POWER8 CPUs
Power9 = IBM POWER9 CPUs
[Intel]
WSM = Intel Westmere CPUs
SNB = Intel Sandy/Ivy Bridge CPUs
HSW = Intel Haswell CPUs
BDW = Intel Broadwell Xeon E-class CPUs
SKX = Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
[Intel Xeon Phi]
KNC = Intel Knights Corner Xeon Phi
KNL = Intel Knights Landing Xeon Phi
[NVIDIA]
Kepler30 = NVIDIA Kepler generation CC 3.0
Kepler32 = NVIDIA Kepler generation CC 3.2
Kepler35 = NVIDIA Kepler generation CC 3.5
Kepler37 = NVIDIA Kepler generation CC 3.7
Maxwell50 = NVIDIA Maxwell generation CC 5.0
Maxwell52 = NVIDIA Maxwell generation CC 5.2
Maxwell53 = NVIDIA Maxwell generation CC 5.3
Pascal60 = NVIDIA Pascal generation CC 6.0
Pascal61 = NVIDIA Pascal generation CC 6.1
Volta70 = NVIDIA Volta generation CC 7.0
Volta72 = NVIDIA Volta generation CC 7.2
for heterogeneous architectures, put each arch variables with comma
e.g., "Power8,Pascal60"
- specify LAPACK and BLAS libraries
-D TPL_LAPACK_LIBRARIES:FILEPATH="-llapack" or "-mkl" (Intel compiler)
-D TPL_BLAS_LIBRARIES:FILEPATH="-lblas" or "-mkl" (Intel compiler)
if your BLAS and LAPACK is located in a non-standard path, please
append the path to LD_LIBRARY_PATH.
- For CUDA, set CUDA specfiic environment varialbes as follows.
export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper
export CUDA_LAUNCH_BLOCKING=1
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
- $BUILD/packages/tpetra/core/example/BlockCrs/TpetraCore_BlockCrsPerfTest.exe
[kyukim @bread] BlockCrs > ./TpetraCore_BlockCrsPerfTest.exe --help
Usage: ./TpetraCore_BlockCrsPerfTest.exe [options]
options:
--help Prints this help message
--pause-for-debugging Pauses for user input to allow attaching a debugger
--echo-command-line Echo the command-line but continue as normal
--num-elements-i int Number of cells in the I dimension.
(default: --num-elements-i=2)
--num-elements-j int Number of cells in the J dimension.
(default: --num-elements-j=2)
--num-elements-k int Number of cells in the K dimension.
(default: --num-elements-k=2)
--num-procs-i int Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
(default: --num-procs-i=1)
--num-procs-j int Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
(default: --num-procs-j=1)
--num-procs-k int Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
(default: --num-procs-k=1)
--blocksize int Block size. The # of DOFs coupled in a multiphysics flow problem.
(default: --blocksize=5)
--nrhs int Number of right hand sides to solve for.
(default: --nrhs=1)
--repeat int Number of iterations of matvec operations to measure performance.
(default: --repeat=100)
- Single Node OpenMP Strong Scale
OMP_NUM_THREADS=4 OMP_PROC_BIND=spread OMP_PLACES=threads \
./TpetraCore_BlockCrsPerfTest.exe \
--num-elements-i=32 --num-elements-j=32 --num-elements-k=32 \
--blocksize=5 --nrhs=1 \
--repeat=20
- Single Node CUDA
OMP_NUM_THREADS=1 \
./TpetraCore_BlockCrsPerfTest.exe \
--num-elements-i=32 --num-elements-j=32 --num-elements-k=32 \
--blocksize=5 --nrhs=1 \
--repeat=20
- Multi Node Weak Scale
OMP_NUM_THREADS=2 OMP_PROC_BIND=spread OMP_PLACES=threads mpirun -np 32 \
./TpetraCore_BlockCrsPerfTest.exe \
--num-elements-i=32 --num-elements-j=32 --num-elements-k=32 \
--num-procs-i=4 --num-procs-j=8 --num-procs-k=1 \
--blocksize=5 --nrhs=1 \
--repeat=20
- Platform used:
- Summary or screenshot:
Copyright © Trilinos a Series of LF Projects, LLC
For web site terms of use, trademark policy and other project policies please see https://lfprojects.org.
Trilinos Developer Home
Trilinos Package Owners
Policies
New Developers
Trilinos PR/CR
Productivity++
Support Policy
Test Dashboard Policy
Testing Policy
Managing Issues
New Issue Quick Ref
Handling Stale Issues and Pull Requests
Release Notes
Software Quality Plan
Compiler Warnings/Errors
Proposing a New Package
Guidance on Copyrights and Licenses
Tools
CMake
Doxygen
git
GitHub Notifications
Mail lists
Clang-format
Version Control
Initial git setup
'feature'/'develop'/'master' (cheatsheet)
Simple centralized workflow
Building
SEMS Dev Env
Mac OS X
ATDM Platforms
Containers
Development Tips
Automated Workflows
Testing
Test Harness
Pull Request Testing
Submitting a Pull Request
Pull Request Workflow
Reproducing PR Errors
Addressing Test Failures
Trilinos Status Table Archive
Pre-push (Checkin) Testing
Remote pull/test/push
PR Creation & Approval Guidelines for Tpetra, Ifpack2, and MueLu Developers