Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To keep my GPU programming skills sharp while applying to jobs, I've been working on a faster version of the Bloch GPU code I wrote this past summer. The first thing I noticed while working on this was that my earlier profiling wasn't 100% correct - in Julia, GPU kernel launches are all asynchronous, but they appear to be synchronous since kernels execute in the same order as they are submitted. So I needed to also add
KA.synchronize(backend)
before each@signpost_event
to see which parts were taking the longest. The main parts taking the longest were these lines:Δϕ .= (Bz[:,2:end] .+ Bz[:,1:end-1]) .* Δt .* T(-π .* γ)
pre.Mxy[:,2:end] .= M.xy .* exp.(-seq_block.tp_ADC' ./ p.T2) .* cis.(ϕ_ADC)
pre.φ .= T(-π .* γ) .* (pre.B[:,1:end-1] .* seq.Δt')
I think with the logical indexing of the Bz, B, and M.xy matrices, it is too hard for the compiler to generate efficient kernels for these operations that don't require allocating new memory. So writing simple kernels for these operations will be faster.
The other major thing I noticed was that the previous approach of trying to distribute work to one thread per time point and spin index is likely not optimal, since the previous implementation was memory bound. Each time we are creating a new matrix like in the line below:
pre.ΔT1 .= exp.(-seq.Δt' ./ p.T1)
this requires
number of spins x number of time points
writes to global GPU memory, plus the same number of reads later. In theory, onlynumber of spins + number of time points
global memory reads are needed. Consider the following cumsum example:The "naive" version outperforms the default Metal accumulate function which does a tree-based reduction even though it uses far less threads, since both kernels are memory bound and the naive version has a more efficient memory access pattern for this example.
With this in mind, I've rewritten the
run_spin_precession!
andrun_spin_excitation!
functions with a focus on minimizing the total number of global memory reads, so that in each block only the phantom and sequence arrays are read from and temporary values calculated from these arrays are stored in registers, not GPU global memory. For the signal output, we do still need to use a matrix of sizenumber of spins x number of time points
in global memory, but this can be reduced by a factor of the block size by doing a block-level reduction in shared memory before writing to the output.Testing the changes on my Mac M1, the benchmarks are about 3-5x faster, but they seem to make the most difference with a large phantom object - for example, if I get rid of the [1:10,000] part setting up the slice selection benchmark so it is using the full 3D brain phantom, then it is over 10x faster. There is also a strange issue where two of the tests are failing but only in my test environment - if I copy paste the test body into its own script then it produces the expected results. I haven't been able to figure out the reason for this yet.