Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying @time to simulate reveals discrepancies on resources and timing reports #392

Closed
gsahonero opened this issue May 15, 2024 · 25 comments · Fixed by #444 or #431
Closed

Applying @time to simulate reveals discrepancies on resources and timing reports #392

gsahonero opened this issue May 15, 2024 · 25 comments · Fixed by #444 or #431

Comments

@gsahonero
Copy link
Member

gsahonero commented May 15, 2024

Hi,

I simulated some sequences using simulate and found that the timed data differed from what @time provides. Check this screenshot for reference:

image

There, the last line refers to the data that @time produces, and the previous one refers to what simulate provides.
To reproduce this, the following code can be used:

using KomaMRI

x_range = [0]
y_range = [0]
z_range = [0]
T1_range = [1000e-3]
T2_range = [35e-3]
T2s_range = [20e-3]
rho_range = [1]
obj = Phantom{Float64}(name="Test phantom", x = x_range, y = y_range, z = z_range, T1 = T1_range, T2 = T2_range, T2s = T2s_range, ρ=rho_range);

seq = read_seq("...\\ge.seq")

plot_seq(seq)
sys = Scanner()
@time raw_x = simulate(obj, seq, sys)

I think the differences are due to other functions unrelated to run_sime_time_iter which is the one with @time on the simulate function, but it would be useful to trace them and check what could be optimized even further (~2GB is way much larger than 300 MB!). Also, it would be useful to clarify that the timing information returned by simulate considers only the simulation. Thinking of end-users, they might get confused - like me - by the reported and the perceived time.

Best,
Guillermo

@cncastillo
Copy link
Member

As you said, there are some preprocessing steps before the simularion that are not accounted for in the reported time. As run_sime_time_iter is the actual simulation I think is fair to time that function, if the preprocessing is taking 2 seconds that is a problem, and needs to be optimized.

Can you send me the sequence you are using?

@gsahonero
Copy link
Member Author

I used the pulseq gradient echo sequence in the examples folder for the screenshot I shared. However, I have two other sequences with different ADC durations that show increased times and resources too. I'm attaching them here:

sequences.zip

The zip contains a JLD2 file with a dictionary including two sequences.

@gsahonero
Copy link
Member Author

Hello there,

By checking simulate, this line seems to take a lot of time by itself:

if sim_params["gpu"] GC.gc(true); CUDA.reclaim() end

Since I'm using a dedicated GPU, it makes sense to have it executed all the time. CPU-only users don't need it and should not experience these significant timing discrepancies.

In any case, I wonder whether that line can be removed or skipped given that according to https://cuda.juliagpu.org/stable/usage/memory/#Garbage-collection, "There is no need for manual memory management, just make sure your objects are not reachable (i.e., there are no instances or references)."

@cncastillo
Copy link
Member

We could just remove that line, maybe it doesn't help. If you do a for loop of simulate's, and there is no memory problems without without that line, I would say we can just remove it. Related to #125.

@cncastillo
Copy link
Member

@gsahonero can you report if this did not generate any problems?

If you do a for loop of simulate's, and there is no memory problems without without that line, I would say we can just remove it. Related to #125.

@gsahonero
Copy link
Member Author

This past month, I have been using the modified version (without the garbage collection function) of simulate. I didn't find any problem. In any case, the discrepancies between reports are still there, but they no longer show that large gap. I think the documentation should state that the reported @time information belongs to the simulation only and that other processes are around too, i.e., state a clarification.

@cncastillo
Copy link
Member

Where would you put it in the docs? in the docstring for simulate?

@gsahonero
Copy link
Member Author

Yes! Actually, after checking https://juliahealth.org/KomaMRI.jl/dev/reference/3-koma-core/#KomaMRICore.simulate, I can see that there is no mention of the @time prints. Understanding what is happening with simulate prints could be useful.

@cncastillo
Copy link
Member

Sounds like a good first PR 😄 (You can do it directly on GitHub by pressing the ✏️ icon)

https://github.com/JuliaHealth/KomaMRI.jl/blob/master/KomaMRICore/src/simulation/SimulatorCore.jl

@gsahonero
Copy link
Member Author

The PR is there: #431 :)

@Tooine
Copy link

Tooine commented Jul 10, 2024

Hi,

I'm using KomaMRI to simulate MR Fingerprinting sequences. For such long sequences with multiple pulses, I observe a much longer preprocessing time than the simulation time, either on CPU or GPU.

For instance, on the simulation below, I get a simulation time of 9min12 while the total run is 49min, giving a preprocessing time of about 40min!
I used the following MRF sequence with 70 pulses:
2D_MRF_bSSFP_TR21_TE5_10_15_inv.zip
My phantom has about 1,000,000 points but it does not seem to explain the long preprocessing time (which is much less if I simulate with smaller sequences). In that case, it seems that the longest task is the preprocessing of the sequence (either seqd = seqd |> f32 or seqd = seqd |> gpu?).

image

I don't know how this could be optimized on a single run of the simulate function. Yet, if I want to simulate several phantoms with the same sequence, with the current implementation I have to suffer this precompilation time for each phantom. Maybe, there might be a way to modify the implementation to separate the precompilation from the simulation, in which case the sequence preprocessing could be done only once? (Same for the opposite case where one deals with a very large phantom and several sequences: the phantom could be preprocessed only once).

Anyway, it is a pleasure to use KomaMRI! :-) Thanks to the team for developing this powerful tool!

Best,
Antoine

PS1: My goal is to simulate much longer sequences (about 300 pulses) but it seems infeasible today as the preprocessing time is more than linear in the number of pulses: for 40 pulses I get approximately a 15 min preprocessing.

PS2: apart from that, when computing on CPU (with smaller sequences), I have trouble with parallel processing: an increase of the number of threads N_threads leads to an increase of the simulation time... I don't think it is linked, but maybe I'm facing technical issues that might also affect the precompilation of the simulate function...

@cncastillo
Copy link
Member

cncastillo commented Jul 10, 2024

Hi @Tooine, first of all, thanks for using Koma 😄!

That definitely sounds like a problem. Thanks for attaching the sequence so we can investigate.

Are you using the newest version of all the Koma-related packages? (v0.8.X) We fixed a huge slowdown when the sequences had too many ADC points in KomaCore v0.8.3. If you cannot update to v0.8.X, that probably means that you need to update Julia to >=1.9.

The CPU problem could be that you did not open Julia with multiple threads julia --threads=auto; you can check with Threads.nthreads().

Let me know if that helps, if not, I will fix it 💪. @rkierulf Maybe this is a good sequence to test the performance improvements coming to v0.9.

@cncastillo
Copy link
Member

cncastillo commented Jul 10, 2024

I forgot to answer this:

Maybe, there might be a way to modify the implementation to separate the precompilation from the simulation, in which case the sequence preprocessing could be done only once

You can already do that in a few ways (without waiting for me to fix the issue):

  1. Add the phantoms together obj_new = obj1 + obj2. The problem is that the signals of each phantom will be combined into one. You can fix this by using an approach similar to this PR created BlochVoxelDictSimulation simulation method #348 from @depedraza, where you define a new simulation method that does that. I haven't had time to include this in Koma yet, and sadly there is no documentation on how to extend the simulation methods. Basically, you would need to define:
struct MyMethod <: SimulationMethod end # You can add parameters to this

KomaMRICore.sim_output_dim(obj::Phantom{T}, seq::Sequence, sys::Scanner, sim_method::MyMethod) where {T<:Real}

KomaMRICore.run_spin_precession!(p::Phantom{T}, seq::DiscreteSequence{T}, sig::AbstractArray{Complex{T}}, M::Mag{T}, sim_method::MyMethod) where {T<:Real}

# Optional
KomaMRICore.run_spin_excitation!(p::Phantom{T}, seq::DiscreteSequence{T}, sig::AbstractArray{Complex{T}}, M::Mag{T}, sim_method::MyMethod) where {T<:Real}

and then

sim_param["sim_method"] = MyMethod()

Check KomaMRICore/src/simulation/Bloch/BlochSimulationMethod.jl (defines all functions) and KomaMRICore/src/simulation/Bloch/BlochDictSimulationMethod.jl (defines just a subset) as examples. You don't need to define all of the functions, only the ones you want to replace, Bloch() is the fallback in case the function is not defined.

Important

Note that you don't need to "touch" Koma's source code to extend it! You can do it completely externally.

  1. Use the BlochDict simulation method and add the magnetizations externally to get the signal https://juliahealth.org/KomaMRI.jl/dev/explanation/3-simulation/#BlochDict-Simulation-Method

  2. if instead of using simulate you use run_sim_time_iter! directly.

The first one would be the recommended method, as you have more flexibility. If you found the PR useful and can help (1) to clean it up so we can merge it or (2) help document the simulation method extensibility (it is a very powerful but unknown feature) as a tutorial (just create in examples/3.tutorials/lit-06-SimulationMethods.jl with a simple example), that would help greatly 😄.

@gsahonero
Copy link
Member Author

I agree with @cncastillo! You need to use the last version to reduce the simulation time. The speedup between 0.7x and 0.8x is significant.

Complementing

  1. Use the BlochDict simulation method and add the magnetizations externally to get the signal https://juliahealth.org/KomaMRI.jl/dev/explanation/3-simulation/#BlochDict-Simulation-Method

In this case, simulate requires you to provide N spins to have N output signals (using return_type="mat" on sim_params). Each signal is an independent simulated signal (as if you were simulating a point-like phantom). Moreover, I don't think you would like to handle 1M output signals! So, you should think about your phantom again and understand how to use those N output signals for MRF.

Finally, please, check whether the GPU garbage collection is not affecting the performance when you use a GPU. For this, check this line if sim_params["gpu"] GC.gc(true); CUDA.reclaim() end in SimulatorCore.jl. If it exists and it is still being run, then comment that line, replace the simulate function with that modified version, and check whether performance is improved.

@Tooine
Copy link

Tooine commented Jul 11, 2024

Hi,

thanks for those quick and detailed answers!

First of all, the CPU problem is solved, I'm using Julia on a Jupyter Notebook with VS Code and indeed, it required a Julia environment with a specified number of threads. I'm not a Julia expert, so I missed this, thanks!

Are you using the newest version of all the Koma-related packages? (v0.8.X) We fixed a huge slowdown when the sequences had too many ADC points in KomaCore v0.8.3. If you cannot update to v0.8.X, that probably means that you need to update Julia to >=1.9.

Yes, I'm using KomaMRI v0.8.2 with Julia v1.10.2. Looking at my Julia packages, I see that I get

  1. KomaMRICore v0.8.3 which is a sub-package of KomaMRI v0.8.2,
  2. KomaMRICore v0.8.2 as a package by itself.

In my code, I only imported KomaMRI, but I don't know which version of KomaMRICore is called when using simulate. I though it was the option 1, but to be sure I updated option 2 to 0.8.3 and then relaunched a simulation.

  • The preprocessing time did not improve. Actually, it seems that the simulation time increased a little bit to 11min30, with very different gc time and compilation time ... (not sure I used exactly the same sequence, but the sequence length is the same)
  • if I interrupt the simulation, I see in the log that the simulation is using option 2, i.e., KomaMRICore as a separate package (which is quite unexpected I guess?).

So, in a nutshell (not sure I was very clear), I'm now sure to use KomaMRICore v0.8.3 and this doesn't improve performance.

image

Maybe, there might be a way to modify the implementation to separate the precompilation from the simulation, in which case the sequence preprocessing could be done only once

You can already do that in a few ways (without waiting for me to fix the issue):

Thanks for highlighting those possibilities. Currently, I'm more in the setting where I get one phantom and several sequences to test, but this might be interesting for me in the future (in which case I can try to submit a PR, but I don't know if i have the necessary Julia skills for this).

Finally, please, check whether the GPU garbage collection is not affecting the performance when you use a GPU. For this, check this line if sim_params["gpu"] GC.gc(true); CUDA.reclaim() end in SimulatorCore.jl. If it exists and it is still being run, then comment that line, replace the simulate function with that modified version, and check whether performance is improved.

Ok, I'm going to check this now. I can see this line in the simulate function but it is after the precompilation (and even after the simulation in fact) so I don't know if there is a chance of improvement?

@gsahonero
Copy link
Member Author

Hi,

Just a quick answer: that GPU garbage collection line may take a long time and resources. I understand the line will be removed soon (@cncastillo, I can send a PR just for this if needed). If you want to look deeper at the time and resources to profile what is taking too long for your simulation, use @time in some lines of simulate inside SimulatorCore.jl and run your simulation (replacing the default version of the function). That will throw some light.

In any case, simulate is always slower the first time is used. Following calls should be faster.

@cncastillo
Copy link
Member

cncastillo commented Jul 11, 2024

Yeah I agree that the GC line is not worth checking.

I will profile your sequence during the weekend. (Btw for this I recommend to use @profview in VSCode instead a sprinkling @times's. I reaaaally suggest reading this website https://modernjuliaworkflows.github.io/ full of tips and tricks!)

My intuition is that the problem comes from seqd = discretize(seq), there are some low hanging fruits that I could tackle to speed it up.

@JanWP
Copy link

JanWP commented Jul 11, 2024

Hi all, I'm a colleague of @Tooine

I just looked into this and it seems the problem is in the discretization of the sequence, and more specifically in get_samples.

I think the issue may be the one encountered in this thread on the Julia Discourse.

Using the sequence attached by @Tooine, I see the following:
@btime t_rf = reduce(vcat, T0[i] .+ times(seq.RF[1,i], :A) for i in range)
368.622 s (458569 allocations: 1029.27 GiB)

Whereas by modifying the line slightly according to one of the proposed alternatives in that thread, I get:
@btime t_rf2 = reduce(vcat, collect(T0[i] .+ times(seq.RF[1,i], :A) for i in range))
89.593 ms (377652 allocations: 325.26 MiB)

Disclaimer: I'm an absolute Julia Newbie, so even though I checked the results were identical for our case, there may be something I'm missing.

@cncastillo
Copy link
Member

Oh god! That could make sense. I believe a similar problem was detected by @gabuzi and fixed in #220, but in that case, the problem was a little different.

Could you check if

 @btime t_rf3 = reduce(vcat, [T0[i] .+ times(seq.RF[1,i], :A) for i in range])

also fixes the problem?

We can push a fix for that very easily!

@JanWP
Copy link

JanWP commented Jul 12, 2024

IIRC I tried exactly that line too, and the result was off. My interpretation was that the shape of [...] and collect(...) seemed to be different, but I didn't look closer. I'll try to find some time to check again later today.

@Tooine
Copy link

Tooine commented Jul 12, 2024

Hey,

thank you all for your consideration of the issue, and thank you @JanWP for probably identifying its cause!

I didn't have many time to test the proposed solutions this day, and I will be off for a few weeks, so don't worry if the problem is not solved immediately.

@cncastillo
Copy link
Member

cncastillo commented Jul 12, 2024

I believe that it doesn't hurt to push the reduce(vcat, collect(...)) fix immediately.

We run automated benchmarking and will notice if it generates any performance/accuracy regression. I will push a branch with the fix using KomaMRICore 0.0.9-DEV (that has the benchmarks setup), and if everything looks good, backport it to KomaMRICore v0.8.4.

@cncastillo
Copy link
Member

collect(...) and [...] are lowered to the same:

julia> Meta.@lower [i for i in 1:10]
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1%1 = 1:10%2 = Base.Generator(Base.identity, %1)
│   %3 = Base.collect(%2)
└──      return %3
))))

julia> Meta.@lower collect(i for i in 1:10)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1%1 = 1:10%2 = Base.Generator(Base.identity, %1)
│   %3 = collect(%2)
└──      return %3
))))

Also

julia> @btime reduce(vcat, collect(ones(100) for i in 1:10_000));
  1.622 ms (10006 allocations: 16.33 MiB)

julia> @btime reduce(vcat, [ones(100) for i in 1:10_000]);
  1.622 ms (10006 allocations: 16.33 MiB)

I will now use [...], which looks a little cleaner.

@cncastillo
Copy link
Member

cncastillo commented Jul 13, 2024

Ok! The benchmarks are in! We can see a clear speed increase and reduced memory allocation size.

https://juliahealth.org/KomaMRI.jl/benchmarks/?option=memory

I just realized the changes are in KomaMRIBase, so I will tag a new non-breaking release for that sub-package (KomaMRIBase v0.8.5).

@JanWP
Copy link

JanWP commented Jul 15, 2024

I'm a bit late to the party, but I can confirm that this fixes our issue. I put a @time on the call to discretize, and with the MRF sequence by @Tooine I'm using for testing, the difference is night and day:

Before commit e7bfc1f, I get:
1827.977852 seconds (6.93 M allocations: 3.056 TiB, 18.10% gc time)

After the commit:
1.783379 seconds (6.22 M allocations: 1.614 GiB, 3.86% gc time)

Quite the improvement! Thanks for fixing this so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants