-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Bloch GPU #537
base: master
Are you sure you want to change the base?
Faster Bloch GPU #537
Conversation
Wow amazing! Thanks @rkierulf ! Currently the CI is a little messed up, so I cannot merge it straight away. I benchmarked the previous implementation, with this code: The previous method didn't scale very well. Do you think the changes you have made can improve this? There are clearly two regimes, flat and linear growth, not sure if they correspond to the memory- and compute-bound regimes. What are your thoughts on this? |
Interesting! My guess is that the transition from flat to linear growth begins when the simulation becomes large enough to fully saturate the resources of the GPU. After that, due to the memory bound nature of the problem the growth in time should be proportional to the number of additional reads + writes to GPU global memory per added spin index. Since the new implementation has far fewer GPU memory operations per spin index, I would expect the slope to be less steep. And hopefully, it is more efficient overall so the curve would shift somewhat to the right as well. I'd be curious to see the results of this script run again with the new changes. Regarding the CI Failures:
|
Thanks for investigating. I will rerun the benchmarks using this PR when I have more time (moving to Stanford tomorrow, and then I have a conference in Washington DC). I believe the problem with Julia 1.11 and oneAPI is: JuliaGPU/oneAPI.jl#473. Recently they changed the internals of all the GPU packages to be based on KA.jl. onAPI.jl v2.0 should include these changes, so bumping the compat to include Don't mind the GitHub CI errors, @Stockless is currently fixing those. |
To keep my GPU programming skills sharp while applying to jobs, I've been working on a faster version of the Bloch GPU code I wrote this past summer. The first thing I noticed while working on this was that my earlier profiling wasn't 100% correct - in Julia, GPU kernel launches are all asynchronous, but they appear to be synchronous since kernels execute in the same order as they are submitted. So I needed to also add
KA.synchronize(backend)
before each@signpost_event
to see which parts were taking the longest. The main parts taking the longest were these lines:Δϕ .= (Bz[:,2:end] .+ Bz[:,1:end-1]) .* Δt .* T(-π .* γ)
pre.Mxy[:,2:end] .= M.xy .* exp.(-seq_block.tp_ADC' ./ p.T2) .* cis.(ϕ_ADC)
pre.φ .= T(-π .* γ) .* (pre.B[:,1:end-1] .* seq.Δt')
I think with the logical indexing of the Bz, B, and M.xy matrices, it is too hard for the compiler to generate efficient kernels for these operations that don't require allocating new memory. So writing simple kernels for these operations will be faster.
The other major thing I noticed was that the previous approach of trying to distribute work to one thread per time point and spin index is likely not optimal, since the previous implementation was memory bound. Each time we are creating a new matrix like in the line below:
pre.ΔT1 .= exp.(-seq.Δt' ./ p.T1)
this requires
number of spins x number of time points
writes to global GPU memory, plus the same number of reads later. In theory, onlynumber of spins + number of time points
global memory reads are needed. Consider the following cumsum example:The "naive" version outperforms the default Metal accumulate function which does a tree-based reduction even though it uses far less threads, since both kernels are memory bound and the naive version has a more efficient memory access pattern for this example.
With this in mind, I've rewritten the
run_spin_precession!
andrun_spin_excitation!
functions with a focus on minimizing the total number of global memory reads, so that in each block only the phantom and sequence arrays are read from and temporary values calculated from these arrays are stored in registers, not GPU global memory. For the signal output, we do still need to use a matrix of sizenumber of spins x number of time points
in global memory, but this can be reduced by a factor of the block size by doing a block-level reduction in shared memory before writing to the output.Testing the changes on my Mac M1, the benchmarks are about 3-5x faster, but they seem to make the most difference with a large phantom object - for example, if I get rid of the [1:10,000] part setting up the slice selection benchmark so it is using the full 3D brain phantom, then it is over 10x faster. There is also a strange issue where two of the tests are failing but only in my test environment - if I copy paste the test body into its own script then it produces the expected results. I haven't been able to figure out the reason for this yet.