Faster Bloch GPU #537

rkierulf · 2025-01-23T23:49:20Z

To keep my GPU programming skills sharp while applying to jobs, I've been working on a faster version of the Bloch GPU code I wrote this past summer. The first thing I noticed while working on this was that my earlier profiling wasn't 100% correct - in Julia, GPU kernel launches are all asynchronous, but they appear to be synchronous since kernels execute in the same order as they are submitted. So I needed to also add KA.synchronize(backend) before each @signpost_event to see which parts were taking the longest. The main parts taking the longest were these lines:

Δϕ .= (Bz[:,2:end] .+ Bz[:,1:end-1]) .* Δt .* T(-π .* γ)

pre.Mxy[:,2:end] .= M.xy .* exp.(-seq_block.tp_ADC' ./ p.T2) .* cis.(ϕ_ADC)

pre.φ .= T(-π .* γ) .* (pre.B[:,1:end-1] .* seq.Δt')

I think with the logical indexing of the Bz, B, and M.xy matrices, it is too hard for the compiler to generate efficient kernels for these operations that don't require allocating new memory. So writing simple kernels for these operations will be faster.

The other major thing I noticed was that the previous approach of trying to distribute work to one thread per time point and spin index is likely not optimal, since the previous implementation was memory bound. Each time we are creating a new matrix like in the line below:

pre.ΔT1 .= exp.(-seq.Δt' ./ p.T1)

this requires number of spins x number of time points writes to global GPU memory, plus the same number of reads later. In theory, only number of spins + number of time points global memory reads are needed. Consider the following cumsum example:

using Metal, BenchmarkTools, KernelAbstractions

A = MtlArray(rand(Float32, (10000, 1000)))
B = similar(A)

@kernel function naive_cumsum!(B, @Const(A))
     i = @index(Global)
     T = eltype(A)

    cur_val = zero(T)
    for k ∈ 1:size(A, 2)
        @inbounds cur_val += A[i, k]
        @inbounds B[i, k] = cur_val
    end
end

@benchmark Metal.@sync cumsum!(B, A; dims=2)

BenchmarkTools.Trial: 402 samples with 1 evaluation per sample.
 Range (min … max):  11.520 ms …  21.151 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.301 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.433 ms ± 688.393 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▁▃ ▄█▄▇▂▆▄         ▁  ▁     ▅▁                  
  ▃▃▆▄▁▁▁▁▃▁▁▃▄▃██▆████████▇▅▅▄▆▅▄▆█████▆▆▅▇███▅▃▃▁▃▃▃▁▁▃▃▁▃▁▃ ▄
  11.5 ms         Histogram: frequency by time         13.4 ms <

 Memory estimate: 7.33 KiB, allocs estimate: 309.

@benchmark Metal.@sync naive_cumsum!(MetalBackend())(B, A, ndrange=size(A,1))

BenchmarkTools.Trial: 1836 samples with 1 evaluation per sample.
 Range (min … max):  1.690 ms …   3.963 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.864 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.713 ms ± 547.521 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▇█                         
  ▂▂▄▂▃▃▂▃▄▅▅▅▄▄▇▅▃▃▃▂▃▃▂▂▂▂▂▃▄▄▄▃▄▄██▆▄▄▃▂▃▂▃▃▁▂▂▂▂▃▃▂▂▂▂▃▅▃ ▂
  1.69 ms         Histogram: frequency by time        3.76 ms <

 Memory estimate: 5.16 KiB, allocs estimate: 216.

The "naive" version outperforms the default Metal accumulate function which does a tree-based reduction even though it uses far less threads, since both kernels are memory bound and the naive version has a more efficient memory access pattern for this example.

With this in mind, I've rewritten the run_spin_precession! and run_spin_excitation! functions with a focus on minimizing the total number of global memory reads, so that in each block only the phantom and sequence arrays are read from and temporary values calculated from these arrays are stored in registers, not GPU global memory. For the signal output, we do still need to use a matrix of size number of spins x number of time points in global memory, but this can be reduced by a factor of the block size by doing a block-level reduction in shared memory before writing to the output.

Testing the changes on my Mac M1, the benchmarks are about 3-5x faster, but they seem to make the most difference with a large phantom object - for example, if I get rid of the [1:10,000] part setting up the slice selection benchmark so it is using the full 3D brain phantom, then it is over 10x faster. There is also a strange issue where two of the tests are failing but only in my test environment - if I copy paste the test body into its own script then it produces the expected results. I haven't been able to figure out the reason for this yet.

cncastillo · 2025-01-24T14:23:40Z

Wow amazing! Thanks @rkierulf ! Currently the CI is a little messed up, so I cannot merge it straight away. I benchmarked the previous implementation, with this code:

GPU_Benchmarks_ISMRM2025.zip

And got:

The previous method didn't scale very well. Do you think the changes you have made can improve this? There are clearly two regimes, flat and linear growth, not sure if they correspond to the memory- and compute-bound regimes. What are your thoughts on this?

Initial commit

7eb2902

rkierulf requested a review from cncastillo as a code owner January 23, 2025 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Bloch GPU #537

Faster Bloch GPU #537

rkierulf commented Jan 23, 2025

cncastillo commented Jan 24, 2025

Faster Bloch GPU #537

Are you sure you want to change the base?

Faster Bloch GPU #537

Conversation

rkierulf commented Jan 23, 2025

cncastillo commented Jan 24, 2025