Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of buoyancy_gradients #2530

Closed
7 of 9 tasks
szy21 opened this issue Jan 16, 2024 · 15 comments
Closed
7 of 9 tasks

Improve the performance of buoyancy_gradients #2530

szy21 opened this issue Jan 16, 2024 · 15 comments
Assignees

Comments

@szy21
Copy link
Member

szy21 commented Jan 16, 2024

#2456 makes all the dycore simulations slower by a factor of two. When looking at it more closely, it seems most of the slowdown is from buoyancy_gradients. As a temporary solution, the mixing length and thus buoyancy gradients calculation is moved to a callback in #2466 without EDMF. With EDMF, this is still called at each timestep. It would be good if we can improve the performance of buoyancy_gradients. cc @charleskawczynski

Tasks

Preview Give feedback
@charleskawczynski
Copy link
Member

charleskawczynski commented Feb 12, 2024

Update on this: the performance of buoyancy_gradients is poor, however, the full story is that two kernels are expensive, from calling cloud_fraction_model_callback!:

The NVTX trace, showing the relative performance hit compared to other high level kernels:
Screen Shot 2024-02-12 at 6 45 11 PM

And the flame graph, showing what's being called with more granularity:

Screen Shot 2024-02-12 at 6 47 04 PM

So, the buoyancy_gradients is about 33% responsible for the cost of adding the call of cloud_fraction_model_callback!, and the quad_loop call is about 63% responsible. There are a few other broadcast expressions, but they are relatively inexpensive, so I'll focus on these two.

I think, to start with, we should simply break up buoyancy_gradients and try to see if/when there's a catastrophic performance improvement. Just eye-balling it, I'd say that we should hoist out ᶜgradᵥ(ᶠinterp(ϕ)) for both broadcast expressions, and see where that gets us. At the very least, we could reuse these precomputed quantities.

We may need to apply some RootSolvers/Thermodynamics optimizations (e.g., CliMA/RootSolvers.jl#51) to reduce the arithmetic intensity of those kernels (they are very expensive).

Also, next, I'll add a gpu job and see what the performance of this kernel looks like on the gpu, since that's probably a more important target to optimize.

cc @szy21, @trontrytel, @tapios

@szy21
Copy link
Member Author

szy21 commented Feb 13, 2024

That's interesting, thanks! I reached the conclusion that buoyancy_gradients is more expensive than quad_loop as when I commented out buoyancy_gradients the time-to-solution is similar to commenting out the entire cloud_fraction_model_callback!, when I first noticed the problem. But maybe that's not an accurate way to estimate it, or maybe we improved the performance of buoyancy_gradients since then.

@trontrytel
Copy link
Member

Sounds good. If there is anything easy to optimize in Thermodynamics, I think we should start with it. We are calling each function 9 times when running with quadrature points. So even small improvements should show up

@charleskawczynski
Copy link
Member

Here are some updates (cc @szy21):

  • I've converted the cloud diagnostics job into a gpu job (Make cloud diagnostics job run on gpu #2684), and the kernel is significantly cheaper (our gpu implementation is much better than our cpu implementation). So, one solution is to just switch more jobs over to the gpu.
  • The quad loop has optimization room in Thermodynamics / RootSolvers (by doing less work), which will improve cpu and gpu performance
  • The buoyancy gradients kernel may have optimization room in using shared memory for our FD kernels. This will mean fewer reads/writes, which are serial on the cpu, and parallel on the gpu (i.e., improve global memory traffic). Using shared memory for our FD kernels is already on our perf road map: Performance roadmap #2632, so it's not just the buoyancy gradients kernel that will benefit.

@szy21
Copy link
Member Author

szy21 commented Feb 26, 2024

Could you post the time-to-solution for the job with and without cloud diagnostics on GPU? Other than that I'm ok with closing this issue. Thanks for all the work!

@charleskawczynski
Copy link
Member

Yep, from this PR (with 1 p100 gpu):

[ Info: solve!: 450.030 s (149709528 allocations: 19.45 GiB)
[ Info: sypd: 5.259918262132573
[ Info: wall_time_per_timestep: 234 milliseconds, 390 microseconds

So, it's actually not bad.

@szy21
Copy link
Member Author

szy21 commented Feb 26, 2024

Yep, from this PR (with 1 p100 gpu):

[ Info: solve!: 450.030 s (149709528 allocations: 19.45 GiB)
[ Info: sypd: 5.259918262132573
[ Info: wall_time_per_timestep: 234 milliseconds, 390 microseconds

So, it's actually not bad.

And how about the one without cloud diagnostics on GPU?

@charleskawczynski
Copy link
Member

Good question, I'll convert the other one, too in that PR so that we can compare.

@charleskawczynski
Copy link
Member

Without cloud diagnostics, we have:

[ Info: solve!: 442.539 s (150425525 allocations: 19.44 GiB)
[ Info: sypd: 5.348959731633479
[ Info: wall_time_per_timestep: 230 milliseconds, 489 microseconds

cc @szy21

@szy21
Copy link
Member Author

szy21 commented Feb 27, 2024

Great, thanks!

@charleskawczynski
Copy link
Member

@tapios mentioned that this is still an issue, so I'm reopening.

@charleskawczynski
Copy link
Member

We should get updated numbers.

@tapios
Copy link
Contributor

tapios commented Apr 17, 2024

For cloud diagnostics, the issue is not directly buoyancy gradients, but gradientes of moisture/enthalpy. This may be related to the buoyancy gradient issue though. @szy21 knows more.

@szy21
Copy link
Member Author

szy21 commented Apr 18, 2024

The latest build has 2.03 SYPD for held suarez and 1.47 SYPD for held suarez with cloud diagnostics per stage, so ~30% difference (assuming they are using the same GPU on central). I don't know whether the slowdown is mostly from compute_gm_mixing_length! (which includes buoyancy_gradients) or quad_loop.

@charleskawczynski
Copy link
Member

The buoyancy gradients itself is now only 547 μs (xref: #2951 (comment)). Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants