Skip to content

Commit

Permalink
Revert making :shared_memory the default
Browse files Browse the repository at this point in the history
It can be much faster for large-density problems, but this is not the
case for low-density ones. Performance seems to be dominated by global
<-> shared memory transfers.
  • Loading branch information
jipolanco committed Nov 4, 2024
1 parent 9e3f8ce commit 8cd0e91
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 13 deletions.
5 changes: 1 addition & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed

- Make `gpu_method = :shared_memory` the default, as it seems to be faster than
the `:global_memory` method for a wide range of spreading widths.

- Vastly improve performance of atomic operations (affecting type-1 transforms) on AMD
- Improve performance of atomic operations (affecting type-1 transforms) on AMD
GPUs by using `@atomic :monotonic`.

- Change a few defaults on AMD GPUs to improve performance.
Expand Down
2 changes: 1 addition & 1 deletion src/NonuniformFFTs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ default_kernel_evalmode(::KA.Backend) = FastApproximation()
default_block_size(::Dims, ::CPU) = 4096 # in number of linear elements
default_block_size(::Dims, ::GPU) = 1024 # except in 2D and 3D (see below)

# TODO: adapt this based on size of shared memory and on element type T (and padding 2M)?
# TODO: adapt this based on size of shared memory and on element type T (and padding 2M - 1)?
default_block_size(::Dims{2}, ::GPU) = (32, 32)
default_block_size(::Dims{3}, ::GPU) = (16, 16, 4) # tuned on A100 with 256³ non-oversampled grid, σ = 2 and m = HalfSupport(4)

Expand Down
17 changes: 9 additions & 8 deletions src/plan.jl
Original file line number Diff line number Diff line change
Expand Up @@ -114,20 +114,21 @@ the order of ``10^{-7}`` for `Float64` or `ComplexF64` data.
- `gpu_method`: allows to select between different implementations of
GPU transforms. Possible options are:
* `:global_memory`: directly read and write onto arrays in global memory in spreading
* `:global_memory` (default): directly read and write onto arrays in global memory in spreading
(type-1) and interpolation (type-2) operations;
* `:shared_memory`: copy data between global memory and shared memory (local
to each GPU workgroup) and perform most operations in the latter, which is faster and
can help avoid some atomic operations in type-1 transforms. We try to use as much shared
memory as is typically available on current GPUs (which is typically 48 KiB on
CUDA and 64 KiB on AMDGPU). This method can be much faster than `:global_memory`,
especially for not too large spreading widths (up to `HalfSupport(6)` at least).
Note that this method completely ignores the `block_size` parameter, as the actual block
size is adjusted to maximise shared memory usage. When this method is enabled, one can
play with the `gpu_batch_size` parameter (see below) to further tune performance.
CUDA and 64 KiB on AMDGPU). Note that this method completely ignores the `block_size`
parameter, as the actual block size is adjusted to maximise shared memory usage. When
this method is enabled, one can play with the `gpu_batch_size` parameter (see below) to
further tune performance.
The default is `:shared_memory` but this may change in the future.
For highly dense problems (number of non-uniform points comparable to the total grid
size), the `:shared_memory` method can be much faster, especially when the `HalfSupport`
is 4 or less (accuracies up to `1e-7` for `σ = 2`).
- `fftw_flags = FFTW.MEASURE`: parameters passed to the FFTW planner when `backend = CPU()`.
Expand Down Expand Up @@ -339,7 +340,7 @@ function _PlanNUFFT(
kernel_evalmode::EvaluationMode = default_kernel_evalmode(backend),
block_size::Union{Integer, Dims{D}, Nothing} = default_block_size(Ns, backend),
synchronise::Bool = false,
gpu_method::Symbol = :shared_memory,
gpu_method::Symbol = :global_memory,
gpu_batch_size::Val = Val(DEFAULT_GPU_BATCH_SIZE), # currently only used in shared-memory GPU spreading
) where {T <: Number, D}
ks = init_wavenumbers(T, Ns)
Expand Down

0 comments on commit 8cd0e91

Please sign in to comment.