Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TpetraBelos: solver time not decreasing with increased threading #13665

Open
mredenti opened this issue Dec 10, 2024 · 13 comments
Open

TpetraBelos: solver time not decreasing with increased threading #13665

mredenti opened this issue Dec 10, 2024 · 13 comments

Comments

@mredenti
Copy link

mredenti commented Dec 10, 2024

Question

Hi,
I’m working on a finite element application that uses Trilinos (Tpetra, Belos) for the solution phase. The assembly of my system is parallelized with Kokkos and, as I increase the number of threads (using OpenMP), I see a significant decrease in the assembly time, which is great. However, when it comes to the Tpetra-Belos solver step (block GMRES), the solver time does not seem to improve at all as I increase the number of threads. I really do not understand this. Even with the examples like packages/belos/tpetra/example/BlockGmres/Belos_Tpetra_BlockGmres_Galeri_Ex.exe I do not observe changes in timings with OpenMP threading.

Are there some required configuration or runtime parameter to enable threading in Belos’ iterative solvers that I am missing?

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 10, 2024

@trilinos/belos

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 10, 2024

@mredenti What BLAS library are you using? This info should be in the CMake configure output.

@mredenti
Copy link
Author

mredenti commented Dec 10, 2024

@mredenti What BLAS library are you using? This info should be in the CMake configure output.

@cgcgcg I am using openblas version 0.3.21:

-- Found BLAS:  <>/linux-rhel8-icelake/gcc-11.3.0/openblas-0.3.21-rihdzrfndzafks2zngo4tkiyw23ayeza/lib/libopenblas.so

I would expect that Belos would parallelise via Kokkos just as I am currently doing for the assembly step. Are you suggesting that threading support should come from the blas libraries? I guess Kokkos Kernels do interface to BLAS unless there is a way to tell Trilinos to use Kokkos Kernels' own implementation of BLAS operations?

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 10, 2024

Correct, I would expect thread parallelism for SpMV and dense operations to come from Kokkos Kernels. Could you check if you have OPENBLAS_NUM_THREADS set in your environment, as well as OMP_NUM_THREADS?

EDIT Looks like you're using a Spack installed OpenBLAS. Could you check that OpenMP threads were enabled in it?

@mredenti
Copy link
Author

mredenti commented Dec 10, 2024

With a single thread

mpirun -x OPENBLAS_NUM_THREADS=1 -x OMP_NUM_THREADS=1 -x OMP_PLACES=threads -x OMP_PROC_BIND=spread -n 1 --bind-to none ./sim.fixed_cylinder_2d

I get

Kokkos execution space: OpenMP
Number of threads: 1
*** Teuchos::StackedTimer::report() - Remainder for a level will be ***
*** incorrect if a timer in the level does not exist on every rank  ***
*** of the MPI Communicator.                                        ***
Flexible GMRES: 10.0946 [1]
|   RILUK::initialize: 0.789323 - 7.81926% [1]
|   RILUK::compute: 0.561141 - 5.55882% [1]
|   Belos: Operation Op*x: 0.00354874 - 0.0351548% [1]
|   Belos: BlockGmresSolMgr total solve time: 8.72281 - 86.4107% [1]
|   |   Belos: ICGS[2]: Orthogonalization: 5.63935 - 64.6506% [238]
|   |   |   Belos: ICGS[2]: Ortho (Norm): 0.0136018 - 0.241195% [238]
|   |   |   Belos: ICGS[2]: Ortho (Inner Product): 3.25822 - 57.7764% [472]
|   |   |   Belos: ICGS[2]: Ortho (Update): 2.35588 - 41.7756% [472]
|   |   |   Remainder: 0.0116587 - 0.206739%
|   |   Belos: Operation Prec*x: 2.25213 - 25.8188% [236]
|   |   |   RILUK::apply: 2.25165 - 99.9787% [236]
|   |   |   Remainder: 0.00047899 - 0.0212683%
|   |   Belos: Operation Op*x: 0.810919 - 9.29654% [237]
|   |   Remainder: 0.02041 - 0.233984%
|   Remainder: 0.0177771 - 0.176105%

while with 4 threads

mpirun -x OPENBLAS_NUM_THREADS=4 -x OMP_NUM_THREADS=4 -x OMP_PLACES=threads -x OMP_PROC_BIND=spread -n 1 --bind-to none ./sim.fixed_cylinder_2d

I have

Kokkos execution space: OpenMP
Number of threads: 4
*** Teuchos::StackedTimer::report() - Remainder for a level will be ***
*** incorrect if a timer in the level does not exist on every rank  ***
*** of the MPI Communicator.                                        ***
Flexible GMRES: 9.58006 [1]
|   RILUK::initialize: 0.822165 - 8.58204% [1]
|   RILUK::compute: 0.559013 - 5.83518% [1]
|   Belos: Operation Op*x: 0.00131843 - 0.0137622% [1]
|   Belos: BlockGmresSolMgr total solve time: 8.18074 - 85.3934% [1]
|   |   Belos: ICGS[2]: Orthogonalization: 5.63029 - 68.8238% [238]
|   |   |   Belos: ICGS[2]: Ortho (Norm): 0.0134419 - 0.238742% [238]
|   |   |   Belos: ICGS[2]: Ortho (Inner Product): 3.22177 - 57.2221% [472]
|   |   |   Belos: ICGS[2]: Ortho (Update): 2.36602 - 42.0231% [472]
|   |   |   Remainder: 0.0290529 - 0.516011%
|   |   Belos: Operation Prec*x: 2.24144 - 27.3989% [236]
|   |   |   RILUK::apply: 2.24078 - 99.9709% [236]
|   |   |   Remainder: 0.000652138 - 0.0290947%
|   |   Belos: Operation Op*x: 0.28825 - 3.52351% [237]
|   |   Remainder: 0.0207603 - 0.25377%
|   Remainder: 0.0168208 - 0.175581%

so I still see no effect for the solver. I guess I should maybe check that the installation of Openblas does have threading enabled, likely through PThreads

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 10, 2024

From your results, 1 thread:

|   |   Belos: Operation Op*x: 0.810919 - 9.29654% [237]

but 4 threads

|   |   Belos: Operation Op*x: 0.28825 - 3.52351% [237]

So it looks like SpMV does run using threads. The orthogonalization and the preconditioner, on the other hand, do not. I am not sure what amount of parallelism RILUK exposes, but ortho should speed up.

@mredenti
Copy link
Author

mredenti commented Dec 10, 2024

Right, missed that :D

Ok, so kokkos-kernels is doing its job as you suggested - at least for the SpMV.

Just to recap, you would expect the at least the ortho to speed up because it is being offloaded to kokkos-kernels?

I guess it would be useful to have a reference implementation that shows whatever it is supposed to speed up... would you suggest looking into the threading support of the underlying blas library?

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 10, 2024

Yes, ortho is a bunch of BLAS operations, so that should speed up for a threaded BLAS. RILUK does have Kokkos Kernels support, so there should be a way to get improvements. But it might depend on the settings that are used.

@mredenti
Copy link
Author

Ok, I will look into this. Thank you for the suggestions.

I will also look into whether I can tell kokkos-kernels not to interface to a TP Blas but rather make it use its own BLAS implementations

@mredenti
Copy link
Author

mredenti commented Dec 10, 2024

@cgcgcg ,

I have checked the openblas installation I was using and indeed no threading (neither OpenMP nor pthreads) had been enabled.

Thus, I've installed openblas with openmp threading

spack install [email protected]%[email protected]~bignuma~consistent_fpcsr+fortran~ilp64+locking+pic+shared build_system=makefile symbol_suffix=none threads=openmp arch=linux-rhel8-icelake

and I can now observe the speedup in the orthogonalisation step too

mpirun -x OMP_NUM_THREADS=4 -x OMP_PLACES=threads -x OMP_PROC_BIND=spread -n 1 --bind-to none ./sim.fixed_cylinder_2d
Kokkos execution space: OpenMP
Number of threads: 4
*** Teuchos::StackedTimer::report() - Remainder for a level will be ***
*** incorrect if a timer in the level does not exist on every rank  ***
*** of the MPI Communicator.                                        ***
Flexible GMRES: 5.46512 [1]
|   RILUK::initialize: 0.68263 - 12.4907% [1]
|   RILUK::compute: 0.499326 - 9.13659% [1]
|   Belos: Operation Op*x: 0.00128426 - 0.0234993% [1]
|   Belos: BlockGmresSolMgr total solve time: 4.26543 - 78.0482% [1]
|   |   Belos: ICGS[2]: Orthogonalization: 1.59607 - 37.4187% [238]
|   |   |   Belos: ICGS[2]: Ortho (Norm): 0.0121343 - 0.760262% [238]
|   |   |   Belos: ICGS[2]: Ortho (Inner Product): 0.973465 - 60.9915% [472]
|   |   |   Belos: ICGS[2]: Ortho (Update): 0.605409 - 37.9313% [472]
|   |   |   Remainder: 0.00505928 - 0.316984%
|   |   Belos: Operation Prec*x: 2.38311 - 55.8704% [236]
|   |   |   RILUK::apply: 2.38254 - 99.9762% [236]
|   |   |   Remainder: 0.000568038 - 0.023836%
|   |   Belos: Operation Op*x: 0.275299 - 6.4542% [237]
|   |   Remainder: 0.0109497 - 0.256707%
|   Remainder: 0.0164537 - 0.301067%

So this confirms what you were saying, that SpMV ops are offloaded to kokkos kernels while dense operations to TP BLAS.

Maybe there still room for threading the preconditioner ops

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 10, 2024

It looks like RILUK is serial by default. You could try setting "fact: type" = "KSPILUK" in the construction to see if that performs better.

@lucbv
Copy link
Contributor

lucbv commented Dec 11, 2024

KSPILUK might bring some benefits but it is still limited due to the nature of the ILU factorization so I am not expecting something nearly as good as the scaling for dot product or sparse matrix-vector product. That said it is good that you can find some reasonable amount of speedup already : )

@mredenti
Copy link
Author

mredenti commented Dec 11, 2024

Yes this is already an improvement and at least now I know what was wrong. I will try KSPILUK just for curiosity but will also need to check the implications in terms convergence. I am still curious to investigate the performance of TP BLAS to kokkos kernels' own implementation. Although I suspect TP BLAS can also take advantage of vectorization while I am not sure this can be achieved by kokkos kernels' own blas implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants