TpetraBelos: solver time not decreasing with increased threading #13665

mredenti · 2024-12-10T16:10:27Z

Question

Hi,
I’m working on a finite element application that uses Trilinos (Tpetra, Belos) for the solution phase. The assembly of my system is parallelized with Kokkos and, as I increase the number of threads (using OpenMP), I see a significant decrease in the assembly time, which is great. However, when it comes to the Tpetra-Belos solver step (block GMRES), the solver time does not seem to improve at all as I increase the number of threads. I really do not understand this. Even with the examples like packages/belos/tpetra/example/BlockGmres/Belos_Tpetra_BlockGmres_Galeri_Ex.exe I do not observe changes in timings with OpenMP threading.

Are there some required configuration or runtime parameter to enable threading in Belos’ iterative solvers that I am missing?

The text was updated successfully, but these errors were encountered:

cgcgcg · 2024-12-10T16:28:16Z

@trilinos/belos

cgcgcg · 2024-12-10T16:28:53Z

@mredenti What BLAS library are you using? This info should be in the CMake configure output.

mredenti · 2024-12-10T17:41:40Z

@mredenti What BLAS library are you using? This info should be in the CMake configure output.

@cgcgcg I am using openblas version 0.3.21:

-- Found BLAS:  <>/linux-rhel8-icelake/gcc-11.3.0/openblas-0.3.21-rihdzrfndzafks2zngo4tkiyw23ayeza/lib/libopenblas.so

I would expect that Belos would parallelise via Kokkos just as I am currently doing for the assembly step. Are you suggesting that threading support should come from the blas libraries? I guess Kokkos Kernels do interface to BLAS unless there is a way to tell Trilinos to use Kokkos Kernels' own implementation of BLAS operations?

cgcgcg · 2024-12-10T17:59:11Z

Correct, I would expect thread parallelism for SpMV and dense operations to come from Kokkos Kernels. Could you check if you have OPENBLAS_NUM_THREADS set in your environment, as well as OMP_NUM_THREADS?

EDIT Looks like you're using a Spack installed OpenBLAS. Could you check that OpenMP threads were enabled in it?

mredenti · 2024-12-10T18:11:53Z

With a single thread

mpirun -x OPENBLAS_NUM_THREADS=1 -x OMP_NUM_THREADS=1 -x OMP_PLACES=threads -x OMP_PROC_BIND=spread -n 1 --bind-to none ./sim.fixed_cylinder_2d

I get

Kokkos execution space: OpenMP
Number of threads: 1
*** Teuchos::StackedTimer::report() - Remainder for a level will be ***
*** incorrect if a timer in the level does not exist on every rank  ***
*** of the MPI Communicator.                                        ***
Flexible GMRES: 10.0946 [1]
|   RILUK::initialize: 0.789323 - 7.81926% [1]
|   RILUK::compute: 0.561141 - 5.55882% [1]
|   Belos: Operation Op*x: 0.00354874 - 0.0351548% [1]
|   Belos: BlockGmresSolMgr total solve time: 8.72281 - 86.4107% [1]
|   |   Belos: ICGS[2]: Orthogonalization: 5.63935 - 64.6506% [238]
|   |   |   Belos: ICGS[2]: Ortho (Norm): 0.0136018 - 0.241195% [238]
|   |   |   Belos: ICGS[2]: Ortho (Inner Product): 3.25822 - 57.7764% [472]
|   |   |   Belos: ICGS[2]: Ortho (Update): 2.35588 - 41.7756% [472]
|   |   |   Remainder: 0.0116587 - 0.206739%
|   |   Belos: Operation Prec*x: 2.25213 - 25.8188% [236]
|   |   |   RILUK::apply: 2.25165 - 99.9787% [236]
|   |   |   Remainder: 0.00047899 - 0.0212683%
|   |   Belos: Operation Op*x: 0.810919 - 9.29654% [237]
|   |   Remainder: 0.02041 - 0.233984%
|   Remainder: 0.0177771 - 0.176105%

while with 4 threads

mpirun -x OPENBLAS_NUM_THREADS=4 -x OMP_NUM_THREADS=4 -x OMP_PLACES=threads -x OMP_PROC_BIND=spread -n 1 --bind-to none ./sim.fixed_cylinder_2d

I have

Kokkos execution space: OpenMP
Number of threads: 4
*** Teuchos::StackedTimer::report() - Remainder for a level will be ***
*** incorrect if a timer in the level does not exist on every rank  ***
*** of the MPI Communicator.                                        ***
Flexible GMRES: 9.58006 [1]
|   RILUK::initialize: 0.822165 - 8.58204% [1]
|   RILUK::compute: 0.559013 - 5.83518% [1]
|   Belos: Operation Op*x: 0.00131843 - 0.0137622% [1]
|   Belos: BlockGmresSolMgr total solve time: 8.18074 - 85.3934% [1]
|   |   Belos: ICGS[2]: Orthogonalization: 5.63029 - 68.8238% [238]
|   |   |   Belos: ICGS[2]: Ortho (Norm): 0.0134419 - 0.238742% [238]
|   |   |   Belos: ICGS[2]: Ortho (Inner Product): 3.22177 - 57.2221% [472]
|   |   |   Belos: ICGS[2]: Ortho (Update): 2.36602 - 42.0231% [472]
|   |   |   Remainder: 0.0290529 - 0.516011%
|   |   Belos: Operation Prec*x: 2.24144 - 27.3989% [236]
|   |   |   RILUK::apply: 2.24078 - 99.9709% [236]
|   |   |   Remainder: 0.000652138 - 0.0290947%
|   |   Belos: Operation Op*x: 0.28825 - 3.52351% [237]
|   |   Remainder: 0.0207603 - 0.25377%
|   Remainder: 0.0168208 - 0.175581%

so I still see no effect for the solver. I guess I should maybe check that the installation of Openblas does have threading enabled, likely through PThreads

cgcgcg · 2024-12-10T18:15:46Z

From your results, 1 thread:

|   |   Belos: Operation Op*x: 0.810919 - 9.29654% [237]

but 4 threads

|   |   Belos: Operation Op*x: 0.28825 - 3.52351% [237]

So it looks like SpMV does run using threads. The orthogonalization and the preconditioner, on the other hand, do not. I am not sure what amount of parallelism RILUK exposes, but ortho should speed up.

mredenti · 2024-12-10T18:21:57Z

Right, missed that :D

Ok, so kokkos-kernels is doing its job as you suggested - at least for the SpMV.

Just to recap, you would expect the at least the ortho to speed up because it is being offloaded to kokkos-kernels?

I guess it would be useful to have a reference implementation that shows whatever it is supposed to speed up... would you suggest looking into the threading support of the underlying blas library?

cgcgcg · 2024-12-10T18:28:22Z

Yes, ortho is a bunch of BLAS operations, so that should speed up for a threaded BLAS. RILUK does have Kokkos Kernels support, so there should be a way to get improvements. But it might depend on the settings that are used.

mredenti · 2024-12-10T18:31:19Z

Ok, I will look into this. Thank you for the suggestions.

I will also look into whether I can tell kokkos-kernels not to interface to a TP Blas but rather make it use its own BLAS implementations

mredenti · 2024-12-10T21:23:35Z

@cgcgcg ,

I have checked the openblas installation I was using and indeed no threading (neither OpenMP nor pthreads) had been enabled.

Thus, I've installed openblas with openmp threading

spack install [email protected]%[email protected]~bignuma~consistent_fpcsr+fortran~ilp64+locking+pic+shared build_system=makefile symbol_suffix=none threads=openmp arch=linux-rhel8-icelake

and I can now observe the speedup in the orthogonalisation step too

mpirun -x OMP_NUM_THREADS=4 -x OMP_PLACES=threads -x OMP_PROC_BIND=spread -n 1 --bind-to none ./sim.fixed_cylinder_2d

Kokkos execution space: OpenMP
Number of threads: 4
*** Teuchos::StackedTimer::report() - Remainder for a level will be ***
*** incorrect if a timer in the level does not exist on every rank  ***
*** of the MPI Communicator.                                        ***
Flexible GMRES: 5.46512 [1]
|   RILUK::initialize: 0.68263 - 12.4907% [1]
|   RILUK::compute: 0.499326 - 9.13659% [1]
|   Belos: Operation Op*x: 0.00128426 - 0.0234993% [1]
|   Belos: BlockGmresSolMgr total solve time: 4.26543 - 78.0482% [1]
|   |   Belos: ICGS[2]: Orthogonalization: 1.59607 - 37.4187% [238]
|   |   |   Belos: ICGS[2]: Ortho (Norm): 0.0121343 - 0.760262% [238]
|   |   |   Belos: ICGS[2]: Ortho (Inner Product): 0.973465 - 60.9915% [472]
|   |   |   Belos: ICGS[2]: Ortho (Update): 0.605409 - 37.9313% [472]
|   |   |   Remainder: 0.00505928 - 0.316984%
|   |   Belos: Operation Prec*x: 2.38311 - 55.8704% [236]
|   |   |   RILUK::apply: 2.38254 - 99.9762% [236]
|   |   |   Remainder: 0.000568038 - 0.023836%
|   |   Belos: Operation Op*x: 0.275299 - 6.4542% [237]
|   |   Remainder: 0.0109497 - 0.256707%
|   Remainder: 0.0164537 - 0.301067%

So this confirms what you were saying, that SpMV ops are offloaded to kokkos kernels while dense operations to TP BLAS.

Maybe there still room for threading the preconditioner ops

cgcgcg · 2024-12-10T21:49:36Z

It looks like RILUK is serial by default. You could try setting "fact: type" = "KSPILUK" in the construction to see if that performs better.

lucbv · 2024-12-11T15:42:08Z

KSPILUK might bring some benefits but it is still limited due to the nature of the ILU factorization so I am not expecting something nearly as good as the scaling for dot product or sparse matrix-vector product. That said it is good that you can find some reasonable amount of speedup already : )

mredenti · 2024-12-11T15:59:33Z

Yes this is already an improvement and at least now I know what was wrong. I will try KSPILUK just for curiosity but will also need to check the implications in terms convergence. I am still curious to investigate the performance of TP BLAS to kokkos kernels' own implementation. Although I suspect TP BLAS can also take advantage of vectorization while I am not sure this can be achieved by kokkos kernels' own blas implementation

mredenti added the type: question label Dec 10, 2024

github-actions bot added the pkg: Tpetra label Dec 10, 2024

cgcgcg added the pkg: Belos label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TpetraBelos: solver time not decreasing with increased threading #13665

TpetraBelos: solver time not decreasing with increased threading #13665

mredenti commented Dec 10, 2024 •

edited

Loading

cgcgcg commented Dec 10, 2024

cgcgcg commented Dec 10, 2024 •

edited

Loading

mredenti commented Dec 10, 2024 •

edited

Loading

cgcgcg commented Dec 10, 2024 •

edited

Loading

mredenti commented Dec 10, 2024 •

edited

Loading

cgcgcg commented Dec 10, 2024

mredenti commented Dec 10, 2024 •

edited

Loading

cgcgcg commented Dec 10, 2024

mredenti commented Dec 10, 2024

mredenti commented Dec 10, 2024 •

edited

Loading

cgcgcg commented Dec 10, 2024

lucbv commented Dec 11, 2024

mredenti commented Dec 11, 2024 •

edited

Loading

TpetraBelos: solver time not decreasing with increased threading #13665

TpetraBelos: solver time not decreasing with increased threading #13665

Comments

mredenti commented Dec 10, 2024 • edited Loading

Question

cgcgcg commented Dec 10, 2024

cgcgcg commented Dec 10, 2024 • edited Loading

mredenti commented Dec 10, 2024 • edited Loading

cgcgcg commented Dec 10, 2024 • edited Loading

mredenti commented Dec 10, 2024 • edited Loading

cgcgcg commented Dec 10, 2024

mredenti commented Dec 10, 2024 • edited Loading

cgcgcg commented Dec 10, 2024

mredenti commented Dec 10, 2024

mredenti commented Dec 10, 2024 • edited Loading

cgcgcg commented Dec 10, 2024

lucbv commented Dec 11, 2024

mredenti commented Dec 11, 2024 • edited Loading

mredenti commented Dec 10, 2024 •

edited

Loading

cgcgcg commented Dec 10, 2024 •

edited

Loading

mredenti commented Dec 10, 2024 •

edited

Loading

cgcgcg commented Dec 10, 2024 •

edited

Loading

mredenti commented Dec 10, 2024 •

edited

Loading

mredenti commented Dec 10, 2024 •

edited

Loading

mredenti commented Dec 10, 2024 •

edited

Loading

mredenti commented Dec 11, 2024 •

edited

Loading