-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TpetraBelos: solver time not decreasing with increased threading #13665
Comments
@trilinos/belos |
@mredenti What BLAS library are you using? This info should be in the CMake configure output. |
@cgcgcg I am using -- Found BLAS: <>/linux-rhel8-icelake/gcc-11.3.0/openblas-0.3.21-rihdzrfndzafks2zngo4tkiyw23ayeza/lib/libopenblas.so I would expect that Belos would parallelise via Kokkos just as I am currently doing for the assembly step. Are you suggesting that threading support should come from the blas libraries? I guess Kokkos Kernels do interface to BLAS unless there is a way to tell Trilinos to use Kokkos Kernels' own implementation of BLAS operations? |
Correct, I would expect thread parallelism for SpMV and dense operations to come from Kokkos Kernels. Could you check if you have EDIT Looks like you're using a Spack installed OpenBLAS. Could you check that OpenMP threads were enabled in it? |
With a single thread mpirun -x OPENBLAS_NUM_THREADS=1 -x OMP_NUM_THREADS=1 -x OMP_PLACES=threads -x OMP_PROC_BIND=spread -n 1 --bind-to none ./sim.fixed_cylinder_2d I get Kokkos execution space: OpenMP
Number of threads: 1
*** Teuchos::StackedTimer::report() - Remainder for a level will be ***
*** incorrect if a timer in the level does not exist on every rank ***
*** of the MPI Communicator. ***
Flexible GMRES: 10.0946 [1]
| RILUK::initialize: 0.789323 - 7.81926% [1]
| RILUK::compute: 0.561141 - 5.55882% [1]
| Belos: Operation Op*x: 0.00354874 - 0.0351548% [1]
| Belos: BlockGmresSolMgr total solve time: 8.72281 - 86.4107% [1]
| | Belos: ICGS[2]: Orthogonalization: 5.63935 - 64.6506% [238]
| | | Belos: ICGS[2]: Ortho (Norm): 0.0136018 - 0.241195% [238]
| | | Belos: ICGS[2]: Ortho (Inner Product): 3.25822 - 57.7764% [472]
| | | Belos: ICGS[2]: Ortho (Update): 2.35588 - 41.7756% [472]
| | | Remainder: 0.0116587 - 0.206739%
| | Belos: Operation Prec*x: 2.25213 - 25.8188% [236]
| | | RILUK::apply: 2.25165 - 99.9787% [236]
| | | Remainder: 0.00047899 - 0.0212683%
| | Belos: Operation Op*x: 0.810919 - 9.29654% [237]
| | Remainder: 0.02041 - 0.233984%
| Remainder: 0.0177771 - 0.176105% while with 4 threads mpirun -x OPENBLAS_NUM_THREADS=4 -x OMP_NUM_THREADS=4 -x OMP_PLACES=threads -x OMP_PROC_BIND=spread -n 1 --bind-to none ./sim.fixed_cylinder_2d I have Kokkos execution space: OpenMP
Number of threads: 4
*** Teuchos::StackedTimer::report() - Remainder for a level will be ***
*** incorrect if a timer in the level does not exist on every rank ***
*** of the MPI Communicator. ***
Flexible GMRES: 9.58006 [1]
| RILUK::initialize: 0.822165 - 8.58204% [1]
| RILUK::compute: 0.559013 - 5.83518% [1]
| Belos: Operation Op*x: 0.00131843 - 0.0137622% [1]
| Belos: BlockGmresSolMgr total solve time: 8.18074 - 85.3934% [1]
| | Belos: ICGS[2]: Orthogonalization: 5.63029 - 68.8238% [238]
| | | Belos: ICGS[2]: Ortho (Norm): 0.0134419 - 0.238742% [238]
| | | Belos: ICGS[2]: Ortho (Inner Product): 3.22177 - 57.2221% [472]
| | | Belos: ICGS[2]: Ortho (Update): 2.36602 - 42.0231% [472]
| | | Remainder: 0.0290529 - 0.516011%
| | Belos: Operation Prec*x: 2.24144 - 27.3989% [236]
| | | RILUK::apply: 2.24078 - 99.9709% [236]
| | | Remainder: 0.000652138 - 0.0290947%
| | Belos: Operation Op*x: 0.28825 - 3.52351% [237]
| | Remainder: 0.0207603 - 0.25377%
| Remainder: 0.0168208 - 0.175581% so I still see no effect for the solver. I guess I should maybe check that the installation of Openblas does have threading enabled, likely through PThreads |
From your results, 1 thread:
but 4 threads
So it looks like SpMV does run using threads. The orthogonalization and the preconditioner, on the other hand, do not. I am not sure what amount of parallelism RILUK exposes, but ortho should speed up. |
Right, missed that :D Ok, so kokkos-kernels is doing its job as you suggested - at least for the SpMV. Just to recap, you would expect the at least the ortho to speed up because it is being offloaded to kokkos-kernels? I guess it would be useful to have a reference implementation that shows whatever it is supposed to speed up... would you suggest looking into the threading support of the underlying blas library? |
Yes, ortho is a bunch of BLAS operations, so that should speed up for a threaded BLAS. RILUK does have Kokkos Kernels support, so there should be a way to get improvements. But it might depend on the settings that are used. |
Ok, I will look into this. Thank you for the suggestions. I will also look into whether I can tell kokkos-kernels not to interface to a TP Blas but rather make it use its own BLAS implementations |
@cgcgcg , I have checked the openblas installation I was using and indeed no threading (neither OpenMP nor pthreads) had been enabled. Thus, I've installed openblas with openmp threading spack install [email protected]%[email protected]~bignuma~consistent_fpcsr+fortran~ilp64+locking+pic+shared build_system=makefile symbol_suffix=none threads=openmp arch=linux-rhel8-icelake and I can now observe the speedup in the orthogonalisation step too mpirun -x OMP_NUM_THREADS=4 -x OMP_PLACES=threads -x OMP_PROC_BIND=spread -n 1 --bind-to none ./sim.fixed_cylinder_2d Kokkos execution space: OpenMP
Number of threads: 4
*** Teuchos::StackedTimer::report() - Remainder for a level will be ***
*** incorrect if a timer in the level does not exist on every rank ***
*** of the MPI Communicator. ***
Flexible GMRES: 5.46512 [1]
| RILUK::initialize: 0.68263 - 12.4907% [1]
| RILUK::compute: 0.499326 - 9.13659% [1]
| Belos: Operation Op*x: 0.00128426 - 0.0234993% [1]
| Belos: BlockGmresSolMgr total solve time: 4.26543 - 78.0482% [1]
| | Belos: ICGS[2]: Orthogonalization: 1.59607 - 37.4187% [238]
| | | Belos: ICGS[2]: Ortho (Norm): 0.0121343 - 0.760262% [238]
| | | Belos: ICGS[2]: Ortho (Inner Product): 0.973465 - 60.9915% [472]
| | | Belos: ICGS[2]: Ortho (Update): 0.605409 - 37.9313% [472]
| | | Remainder: 0.00505928 - 0.316984%
| | Belos: Operation Prec*x: 2.38311 - 55.8704% [236]
| | | RILUK::apply: 2.38254 - 99.9762% [236]
| | | Remainder: 0.000568038 - 0.023836%
| | Belos: Operation Op*x: 0.275299 - 6.4542% [237]
| | Remainder: 0.0109497 - 0.256707%
| Remainder: 0.0164537 - 0.301067% So this confirms what you were saying, that SpMV ops are offloaded to kokkos kernels while dense operations to TP BLAS. Maybe there still room for threading the preconditioner ops |
It looks like RILUK is serial by default. You could try setting "fact: type" = "KSPILUK" in the construction to see if that performs better. |
KSPILUK might bring some benefits but it is still limited due to the nature of the ILU factorization so I am not expecting something nearly as good as the scaling for dot product or sparse matrix-vector product. That said it is good that you can find some reasonable amount of speedup already : ) |
Yes this is already an improvement and at least now I know what was wrong. I will try KSPILUK just for curiosity but will also need to check the implications in terms convergence. I am still curious to investigate the performance of TP BLAS to kokkos kernels' own implementation. Although I suspect TP BLAS can also take advantage of vectorization while I am not sure this can be achieved by kokkos kernels' own blas implementation |
Question
Hi,
I’m working on a finite element application that uses Trilinos (Tpetra, Belos) for the solution phase. The assembly of my system is parallelized with Kokkos and, as I increase the number of threads (using OpenMP), I see a significant decrease in the assembly time, which is great. However, when it comes to the Tpetra-Belos solver step (block GMRES), the solver time does not seem to improve at all as I increase the number of threads. I really do not understand this. Even with the examples like
packages/belos/tpetra/example/BlockGmres/Belos_Tpetra_BlockGmres_Galeri_Ex.exe
I do not observe changes in timings with OpenMP threading.Are there some required configuration or runtime parameter to enable threading in Belos’ iterative solvers that I am missing?
The text was updated successfully, but these errors were encountered: