-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test against cudnn? #30
Comments
Thanks, Chris, for your suggestion and your interest in ParallelStencil. We would be very happy to do some comparison against the cudnn wrappers. Do you have an example using cudnn that is similar to one of our ParallelStencil examples (e.g. heat diffusion, shallow ice...) which could serve as a good basis in order to create a maximally equivalent code for comparison? BTW: Comparing against CuArray broadcasting makes also sense for us as because it is the standard way for doing array computations in Julia and it enables writing code syntactically very similar to codes written with ParallelStencil. |
I see that, but by far and away the most common way to do stencil calculations with CuArrays (and CUDA in general) is cudnn, so it's the baseline benchmark someone would look for. I don't have a nice testing code handy, but the docs are here: https://fluxml.ai/Flux.jl/stable/models/nnlib/#NNlib.conv You'd create the Laplacian weight matrix and send it to that. |
We do not have any experience with cudnn - so, we will most certainly not be able to create an example using cudnn in its most performant way as needed for a meaningful comparison. Would you be willing to create such an example and we would do then the corresponding ParallelStencil example to compare against? If so, how about the shallow ice example - it is nonlinear and meaningful, but in the same time very simple and concise. Moreover, we should just compare the time for one iteration, else too many factors come into play (and we don't want to compare apples and oranges). |
IIRC it's just: |
Yes, just comparing one iteration makes sense. Also, you might need to go a little lower into the wrapper to find the in-place conv operation. |
LoopVectorization is CPU-only, but the cross-over point would be interesting. I'll definitely have benchmarks for training a few different (small) neural networks with NNlibCPU vs NNlibGPU, to get an idea of where the crossing-over point is for at least a couple different CPUs and GPUs. |
Yeah I was wondering on the CPU doing tests against LoopVectorization (since NNPACK is pretty bad) and GPU doing the tests vs cudnn. I think just doing the 3D Laplacian in all cases on 128x128x128 sounds reasonable to me. I'm swapping out to a new desktop in the next week so I won't be able to run it for a bit though. |
Do you mean something other than? julia> [0;0;0;;0;1;0;;0;0;0;;;0;1;0;;1;-6;1;;0;1;0;;;0;0;0;;0;1;0;;0;0;0] # Julia 1.7 syntax
3×3×3 Array{Int64, 3}:
[:, :, 1] =
0 0 0
0 1 0
0 0 0
[:, :, 2] =
0 1 0
1 -6 1
0 1 0
[:, :, 3] =
0 0 0
0 1 0
0 0 0 (Or alternatively, the 27-point stencil? For 128x128x128, we should probably also try a range of batch sizes. Although there wouldn't be a point if the GPU is already faster with a batch size of 1. I don't have a usable GPU, so I'd probably have to run that part on arctic. |
A 7-point stencil should be fine to start with. Maybe one could do both 2D and 3D. Domain resolution may be adapted upwards depending on what's needed to saturate the memory bandwidth (mainly on the GPU). ParallelStencil implements
Yes, Laplacian (2D and 3D) should be fine. Would be relevant thought to have the tests done with variable (e.g. diffusion) coefficient though as closer to real applications. |
Here are a couple benchmarks. First, 3d laplace on 128x128x128x1x1 -> 126x126x126x1x1: using LoopVectorization, CUDA, NNlib, NNlibCUDA, BenchmarkTools
function laplace_sparse!(
out::AbstractArray{<:Any,3}, img::AbstractArray{<:Any,3}
)
@tturbo for j₁ ∈ axes(out,1), j₂ ∈ axes(out,2), j₃ ∈ axes(out,3)
s_0 = img[j₁, j₂ + 1, j₃ + 1]
s_1 = s_0 + img[j₁ + 2, j₂ + 1, j₃ + 1]
s_2 = s_1 + img[j₁ + 1, j₂, j₃ + 1]
s_3 = s_2 + img[j₁ + 1, j₂ + 2, j₃ + 1]
s_4 = s_3 + img[j₁ + 1, j₂ + 1, j₃ ]
s_5 = s_4 + img[j₁ + 1, j₂ + 1, j₃ + 2]
s_6 = s_5 - 6img[j₁ + 1, j₂ + 1, j₃ + 1]
out[j₁, j₂, j₃] = s_6
end
end
laplace_kern = reshape(cat(
Float32[0 0 0; 0 1 0; 0 0 0],
Float32[0 1 0; 1 -6 1; 0 1 0],
Float32[0 0 0; 0 1 0; 0 0 0];
dims=3
), (3,3,3,1,1));
img3d = rand(Float32, 128, 128, 128, 1, 1);
dcdlaplace = NNlib.DenseConvDims(img3d, laplace_kern, flipkernel = true);
img3d_squeezed = reshape(img3d, size(img3d)[1:3]);
out3d = Array{Float32}(undef, size(img3d_squeezed) .+ 1 .- size(laplace_kern)[1:3]);
cuimg3d = CuArray(img3d);
cuout3d = CuArray{Float32}(undef, size(img3d,1)+1-size(laplace_kern,1), size(img3d,2)+1-size(laplace_kern,2), size(img3d,3)+1-size(laplace_kern,3), size(laplace_kern,5), size(img3d,5));
culaplace_kern = CuArray(laplace_kern);
@time laplace_sparse!(out3d, img3d_squeezed);
CUDA.@time NNlib.conv!(cuout3d, cuimg3d, culaplace_kern, dcdlaplace);
reshape(Array(cuout3d), size(out3d)) ≈ out3d
@benchmark laplace_sparse!($out3d, $img3d_squeezed)
@benchmark CUDA.@sync NNlib.conv!($cuout3d, $cuimg3d, $culaplace_kern, $dcdlaplace) Running on arctic3, I get: julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 129.043 μs … 591.036 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 139.638 μs ┊ GC (median): 0.00%
Time (mean ± σ): 139.897 μs ± 10.116 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▇█▇▄▂
▂▂▃▅▆▅▃▂▂▂▂▃▃▅▇██████▇▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▁▁▁▂▁▂▁▁▂▂▂▁▁▂ ▃
129 μs Histogram: frequency by time 166 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark CUDA.@sync NNlib.conv!($cuout3d, $cuimg3d, $culaplace_kern, $dcdlaplace)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 211.893 μs … 5.860 ms ┊ GC (min … max): 0.00% … 88.05%
Time (median): 276.628 μs ┊ GC (median): 0.00%
Time (mean ± σ): 278.146 μs ± 56.924 μs ┊ GC (mean ± σ): 0.19% ± 0.88%
▁ ▂ ▁ ▁ ▁▁▂▅▄▅█▇▄▅▅▂▁▅▃▂ ▁▁ ▂
█▅█▇▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁█▅▃██▄█████████████████▆▆▅▆██▆▅▄▄▃▁█ █
212 μs Histogram: log(frequency) by time 317 μs <
Memory estimate: 3.52 KiB, allocs estimate: 138.
julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 68.660 μs … 2.308 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 87.495 μs ┊ GC (median): 0.00%
Time (mean ± σ): 101.762 μs ± 36.557 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇▆▄ ▂█ ▁▅▃ ▃▁
▂▄████▇▅▄▃███▂▂▂▂▂▂▂▃▄▂▂▂▂▂▁▂▂▁▂▂▁▂▃███▃▄██▄▂▂▂▂▂▁▂▂▁▂▁▁▁▁▂▂ ▃
68.7 μs Histogram: frequency by time 167 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Not sure why the CPU time is so inconsistent from run to run, I get either, switching between results like the first vs the second from run to run. Running the CPU part on my 10980XE yields much more consistent results: julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 36.103 μs … 195.001 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 38.425 μs ┊ GC (median): 0.00%
Time (mean ± σ): 38.488 μs ± 1.829 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▂▃▄▅██▆▇▇▆▆▅▅▅▄▃▂▁▁
▁▁▁▁▂▂▂▂▃▃▃▃▄▄▅▅▆▆▆▇██████████████████████▇▆▆▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁
36.1 μs Histogram: frequency by time 41.1 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 36.041 μs … 197.193 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 38.378 μs ┊ GC (median): 0.00%
Time (mean ± σ): 38.405 μs ± 1.809 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▂▄▃▄▅█████▇▇▇▇▅▅▄▃▃▂▁▁
▁▁▁▁▁▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▆█▇████████████████████████▇▇▆▅▅▄▃▃▃▃▃▂▂▂▂▁▂▂▁
36 μs Histogram: frequency by time 40.5 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0. The benchmark I'd been using for convolutions is a 2d, with a 5x5 kernel mapping 3 inputs to 6 outputs, using 260x260 -> 256x256 images: julia> dcd
DenseConvDims: (260, 260, 3) * (5, 5) -> (256, 256, 6), stride: (1, 1), pad: (0, 0, 0, 0), dil: (1, 1), flip: true Script: using NNlib, Polyester, LoopVectorization, Static
function kernaxes(::DenseConvDims{2,K,C_in, C_out}) where {K,C_in, C_out} # LoopVectorization can take advantage of static size information
K₁ = StaticInt(1):StaticInt(K[1])
K₂ = StaticInt(1):StaticInt(K[2])
Cᵢₙ = StaticInt(1):StaticInt(C_in)
Cₒᵤₜ = StaticInt(1):StaticInt(C_out)
(K₁, K₂, Cᵢₙ, Cₒᵤₜ)
end
function convlayer!(
out::AbstractArray{<:Any,4}, img, kern,
dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
)
@batch for d ∈ axes(out,4)
(K₁, K₂, Cᵢₙ, Cₒᵤₜ) = kernaxes(dcd)
for o ∈ Cₒᵤₜ
@turbo for j₁ ∈ axes(out,1), j₂ ∈ axes(out,2)
s = zero(eltype(out))
for k₁ ∈ K₁, k₂ ∈ K₂, i ∈ Cᵢₙ
s += img[j₁ + k₁ - 1, j₂ + k₂ - 1, i, d] * kern[k₁, k₂, i, o]
end
out[j₁, j₂, o, d] = s
end
end
end
end
img = rand(Float32, 260, 260, 3, 100);
kern = rand(Float32, 5, 5, 3, 6);
out1 = Array{Float32}(undef, size(img,1)+1-size(kern,1), size(img,2)+1-size(kern,2), size(kern,4), size(img,4));
out2 = similar(out1);
dcd = NNlib.DenseConvDims(img, kern, flipkernel = true);
@time NNlib.conv!(out1, img, kern, dcd);
@time convlayer!(out2, img, kern, dcd);
out1 ≈ out2
cuout = CuArray(out1);
cuimg = CuArray(img);
cukern = CuArray(kern);
CUDA.@time NNlib.conv!(cuout, cuimg, cukern, dcd);
Array(cuout) ≈ out2
@benchmark NNlib.conv!($out1, $img, $kern, $dcd)
@benchmark convlayer!($out1, $img, $kern, $dcd)
@benchmark CUDA.@sync NNlib.conv!($cuout, $cuimg, $cukern, $dcd) Results on arctic3: julia> @benchmark NNlib.conv!($out1, $img, $kern, $dcd)
BechmarkTools.Trial: 13 samples with 1 evaluations.
Range (min … max): 364.162 ms … 476.280 ms ┊ GC (min … max): 0.00% … 17.16%
Time (median): 384.447 ms ┊ GC (median): 0.00%
Time (mean ± σ): 409.685 ms ± 43.502 ms ┊ GC (mean ± σ): 6.24% ± 8.39%
▁ ▁▁ ▁▁▁▁ ▁ ▁ ▁ ▁ █▁
█▁▁▁▁██▁████▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁█ ▁
364 ms Histogram: frequency by time 476 ms <
Memory estimate: 375.01 MiB, allocs estimate: 117.
julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
BechmarkTools.Trial: 474 samples with 1 evaluations.
Range (min … max): 7.617 ms … 72.980 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.559 ms ┊ GC (median): 0.00%
Time (mean ± σ): 10.492 ms ± 4.004 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▆ █▅ ▃ ▁▅▅▄
█▆▆▄▁▄███▁██▅▅▄▁▁▁▄▁▁▁▁▁▁▁▁▁▄▄▁▁▄▄▁▁▄▁▄▄▄▇▅▁▁▄▁▁▁▁▁▁▁▄▁████ ▇
7.62 ms Histogram: log(frequency) by time 14.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark CUDA.@sync NNlib.conv!($cuout, $cuimg, $cukern, $dcd)
BechmarkTools.Trial: 667 samples with 1 evaluations.
Range (min … max): 5.050 ms … 9.009 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.497 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.491 ms ± 278.196 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇
▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▃▄▃▃██▂▆▃▁▂▁▂▁▂▂▂▂ ▂
550 ms Histogram: frequency by time 8.2 ms <
Memory estimate: 64.44 KiB, allocs estimate: 4043. CPU results on my 10980XE: julia> @benchmark NNlib.conv!($out1, $img, $kern, $dcd)
BechmarkTools.Trial: 26 samples with 1 evaluations.
Range (min … max): 189.316 ms … 208.082 ms ┊ GC (min … max): 1.68% … 0.00%
Time (median): 197.510 ms ┊ GC (median): 0.00%
Time (mean ± σ): 197.727 ms ± 4.325 ms ┊ GC (mean ± σ): 0.49% ± 0.73%
▁ ▁ ▁ ▁ ▁▁ ▁▁ ██ ▁ █ ▁ ▁▁ ▁▁ ▁▁▁ ▁ ▁ ▁
█▁▁▁█▁▁▁▁▁▁█▁█▁██▁▁▁▁▁██▁██▁█▁█▁█▁██▁██▁███▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁█
189 ms Histogram: frequency by time 208 ms (top 1%)
Memory estimate: 337.51 MiB, allocs estimate: 105.
julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
BechmarkTools.Trial: 925 samples with 1 evaluations.
Range (min … max): 5.295 ms … 5.903 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.398 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.395 ms ± 38.166 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▂▃▁▃▂▁▃▂▄▄▃▅█▅█▃▃▇▁▂ ▂▂
▂▂▂▂▃▁▂▃▃▄▅▄▅▄▄▆▄▅██▆▅▇▆█████████████████████████▇███▅▅▆▆▅▄▄▄▃▃▂
5.29 ms Histogram: frequency by time 5.47 ms (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0. At these sizes, the CPUs seem much faster than the GPU. julia> CUDA.device()
CuDevice(0): Tesla T4
julia> CUDA.device()
CuDevice(0): Tesla T4
julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz Arctic seems to be a dual socket system, with 2x Xeon Silver 4114s (based on julia> versioninfo()
Julia Version 1.6.2-pre.2
Commit ff1827d117* (2021-05-02 02:37 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz The Tesla T4 has 2560 "cores", and sells for over twice as much as the 10980XE. |
You forgot ParallelStencil.jl haha. The code should be something like: using ParallelStencil.FiniteDifferences3D
#(...)
@parallel function diffusion3D_step!(T2, T, Ci, lam)
@inn(T2) = (lam*@inn(Ci)*(@d2_xi(T) + @d2_yi(T) + @d2_zi(T)));
return
end I'm omitting the dx^2,dy^2,dz^2 to make it match.
Many stencil compilers like cudnn only allow constant coefficients. I assumed that would be the limitation here as well, but if that wasn't specialized on then cudnn would probably be a lot faster but only apply to more limited scenarios. That is probably worth mentioning then as a comparison in the docs. |
using ParallelStencil, ParallelStencil.FiniteDifferences3D
@init_parallel_stencil(Threads, Float64, 3);
@parallel function diffusion3D_step!(T2, T)
@inn(T2) = @d2_xi(T) + @d2_yi(T) + @d2_zi(T)
return
end
out3d_pad = similar(img3d_squeezed);
@time @parallel diffusion3D_step!(out3d_pad, img3d_squeezed)
@time laplace_sparse!(out3d, img3d_squeezed);
subaxes = map(x -> UnitRange(x) .+ 1, axes(out3d))
out3d ≈ view(out3d_pad, subaxes...) On the 10980XE: julia> @benchmark @parallel diffusion3D_step!($out3d_pad, $img3d_squeezed)
BechmarkTools.Trial: 6145 samples with 1 evaluations.
Range (min … max): 797.289 μs … 1.032 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 809.144 μs ┊ GC (median): 0.00%
Time (mean ± σ): 809.137 μs ± 6.676 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▁▃▁▂▂▂▂ ▃▆█▅▂
▁▁▁▁▁▂▃▃▂▁▂▂▂▂▃▅██▇▇███████▇███████████▇▅▅▄▃▃▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
797 μs Histogram: frequency by time 823 μs (top 1%)
Memory estimate: 9.08 KiB, allocs estimate: 110.
julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 35.854 μs … 189.335 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 37.263 μs ┊ GC (median): 0.00%
Time (mean ± σ): 37.368 μs ± 1.675 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▃▅▇▆▇██▇▇▇▅▆▅▅▄▄▅▃▄▃▃▁▁
▁▁▁▁▁▁▂▂▃▄▄▆████████████████████████████▇▇▆▅▅▄▄▃▃▃▂▃▂▂▂▂▂▂▁▂▂▁▁▂▁▁
35.9 μs Histogram: frequency by time 39.4 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0. So |
On the T4 GPU: using ParallelStencil, ParallelStencil.FiniteDifferences3D
@init_parallel_stencil(CUDA, Float64, 3);
cuout3d_pad = CuArray(out3d_pad);
cuimg3d = CuArray(img3d_squeezed);
@time @parallel diffusion3D_step!(cuout3d_pad, cuimg3d)
out3d ≈ Array(view(cuout3d_pad, subaxes...))
@benchmark @parallel diffusion3D_step!($cuout3d_pad, $cuimg3d) Results: julia> @benchmark @parallel diffusion3D_step!($cuout3d_pad, $cuimg3d) # GPU
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 110.729 μs … 215.262 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 154.285 μs ┊ GC (median): 0.00%
Time (mean ± σ): 154.545 μs ± 10.290 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇▁▅█
▃▂▂▂▂▂▁▁▂▁▂▁▁▁▂▃▂▂▁▁▁▁▂▁▁▁▁▁▃▃▆████▇▆▄▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂ ▃
111 μs Histogram: frequency by time 189 μs <
Memory estimate: 7.67 KiB, allocs estimate: 195.
julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed) # @tturbo, CPU
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 69.342 μs … 4.140 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 89.844 μs ┊ GC (median): 0.00%
Time (mean ± σ): 94.171 μs ± 70.249 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▁ ▁█ ▄▇▂
▂▃▅███▅▃▃▄▃██▆▃▄▃▆███▅▃▂▂▂▃▃▃▂▂▂▁▁▁▂▂▂▁▁▂▂▂▂▂▃▆▆▃▂▂▂▅▆▅▃▂▂▂ ▃
69.3 μs Histogram: frequency by time 140 μs <
Memory estimate: 0 bytes, allocs estimate: 0. So about a 2x improvement over the |
Maybe @DhairyaLGandhi knows a switch on NNlib we should try here. But yeah, it looks promising on GPU but much less so on CPU. |
Having it optionally apply if var"##ix, ParallelStencil#264" <= size(T2, 1) - 2 && (var"##iy, ParallelStencil#265" <= size(T2, 2) - 2 && var"##iz, ParallelStencil#266" <= size(T2, 3) - 2)
#= /home/chriselrod/.julia/packages/ParallelStencil/0VULM/src/parallel.jl:103 =#
T2[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] = ((T[(var"##ix, ParallelStencil#264" + 1) + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] - T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1]) - (T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] - T[(var"##ix, ParallelStencil#264" + 1) - 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1])) + ((T[var"##ix, ParallelStencil#264" + 1, (var"##iy, ParallelStencil#265" + 1) + 1, var"##iz, ParallelStencil#266" + 1] - T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1]) - (T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] - T[var"##ix, ParallelStencil#264" + 1, (var"##iy, ParallelStencil#265" + 1) - 1, var"##iz, ParallelStencil#266" + 1])) + ((T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, (var"##iz, ParallelStencil#266" + 1) + 1] - T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1]) - (T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] - T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, (var"##iz, ParallelStencil#266" + 1) - 1]))
end And yeah, I wouldn't be surprised if I misused CUDNN here / if there's a way to tell it about all those |
Thanks for reporting - interesting results. Looks like you did the ParallelStencil tests using One could also give a try to the using BenchmarkTools, ParallelStencil, ParallelStencil.FiniteDifferences3D
@init_parallel_stencil(Threads, Float32, 3)
@parallel_indices (ix,iy,iz) function diffusion3D_step2!(T2::Data.Array, T::Data.Array)
T2[ix,iy,iz] = T[ix+1,iy,iz] + T[ix-1,iy,iz] +
T[ix,iy+1,iz] + T[ix,iy-1,iz] +
T[ix,iy,iz+1] + T[ix,iy,iz-1] -
6.0*T[ix,iy,iz]
return
end
A = ones(Float32, 128, 128, 128)
B = similar(A)
@benchmark @parallel (2:size(A,1)-1, 2:size(A,2)-1, 2:size(A,3)-1) diffusion3D_step!($B, $A)
Yes, ParallelStencil does not include by default For CPU execution, it could be interesting to see the perf one gets adding |
Oops. Wasn't paying attention to the julia> @benchmark @parallel (2:size($A,1)-1, 2:size($A,2)-1, 2:size($A,3)-1) diffusion3D_step2!($B, $A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 88.396 μs … 4.841 ms ┊ GC (min … max): 0.00% … 95.27%
Time (median): 132.150 μs ┊ GC (median): 0.00%
Time (mean ± σ): 119.220 μs ± 50.733 μs ┊ GC (mean ± σ): 0.39% ± 0.95%
▃▅ ▂█▂
▂▂▂▂▃▃▄▅▄▃▄██▄▃▃▃▃▂▂▂▂▃▃▃▃▃▂▂▃▃▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▁▁▁▂▂▂▂▃▄▄▅▇███▅▄▄▄▃
88.4 μs Histogram: frequency by time 141 μs (top 1%)
Memory estimate: 8.16 KiB, allocs estimate: 91.
julia> out3d = Array{Float32}(undef, size(A) .+ 1 .- 3);
julia> @benchmark laplace_sparse!($out3d, $A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 35.796 μs … 200.787 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 37.115 μs ┊ GC (median): 0.00%
Time (mean ± σ): 37.193 μs ± 1.745 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃▄▄▆▇█▇██▇▆▇▅▅▄▂▁▁
▂▁▂▂▂▂▂▃▃▃▃▄▅▆▇▇█████████████████████▇▆▅▅▄▄▄▃▃▃▂▃▃▂▃▂▂▃▃▂▃▂▃▂▃▂▂▂▂
35.8 μs Histogram: frequency by time 39.1 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0. It's slower with 36 threads (the CPU has 18 cores/36 threads).
That'll require some more changes. julia> @parallel_indices (ix,iy,iz) function diffusion3D_step2!(T2::Data.Array, T::Data.Array)
T2[ix,iy,iz] = T[ix+1,iy,iz] + T[ix-1,iy,iz] +
T[ix,iy+1,iz] + T[ix,iy-1,iz] +
T[ix,iy,iz+1] + T[ix,iy,iz-1] -
6.0*T[ix,iy,iz]
return
end
ERROR: LoadError: Unrecognized loop range type: var"##ranges, ParallelStencil.ParallelKernel#257"[3].
Stacktrace:
[1] register_single_loop!(ls::LoopVectorization.LoopSet, looprange::Expr)
@ LoopVectorization ~/.julia/packages/LoopVectorization/o15wm/src/modeling/graphs.jl:930 LoopVectorization wants loops with simple indices. CUDA performance: julia> @benchmark @parallel (2:size($cuA,1)-1, 2:size($cuA,2)-1, 2:size($cuA,3)-1) diffusion3D_step2!($cuB, $cuA)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 87.434 μs … 473.461 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 90.375 μs ┊ GC (median): 0.00%
Time (mean ± σ): 93.133 μs ± 9.758 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▃▆█▅▂▃ ▁▃ ▂ ▂ ▁▁ ▂ ▁
████████▆▅▄▅▅██▇▆▄▄▄▁██▆▆▅▄▁▄▁▁▄█▇██▅▅▃▃▁▃▁▁▁▃▁▃▁▆█▆▅▄▁▃▁▁▁█ █
87.4 μs Histogram: log(frequency) by time 138 μs <
Memory estimate: 1.55 KiB, allocs estimate: 83. |
Cool.
Yeah, somehow multi-threading perf is most optimal for number of threads being matching cores rather than threads.
What if one tries something like this (here without ParallelStencil): function diffusion3D_step3!(T2::Data.Array, T::Data.Array)
@tturbo for iz=2:size(T,3)-1
for iy=2:size(T,2)-1
for ix=2:size(T,1)-1
T2[ix,iy,iz] = T[ix+1,iy,iz] + T[ix-1,iy,iz] +
T[ix,iy+1,iz] + T[ix,iy-1,iz] +
T[ix,iy,iz+1] + T[ix,iy,iz-1] -
6.0*T[ix,iy,iz]
end
end
end
return
end One could also make a global index and thus a single index loop...
Nice. Problem may be a bit small for GPU though... Also, 3D stencil on GPU may need more advance implementation (like 2.5 D blocking) to get full perf... |
I'm looking into why it's slower at the moment: julia> @benchmark diffusion3D_step3!($B,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 46.454 μs … 227.213 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 47.727 μs ┊ GC (median): 0.00%
Time (mean ± σ): 47.884 μs ± 2.009 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▂▅▅▆█▇▇▆▅▃▃▁
▁▁▁▁▂▂▂▄▄▅▇███████████████▇▆▅▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▂▁▂▁▂▁▁▁
46.5 μs Histogram: frequency by time 50.8 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark laplace_sparse!($out3d,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 35.687 μs … 190.651 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 37.186 μs ┊ GC (median): 0.00%
Time (mean ± σ): 37.298 μs ± 1.728 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▄▄▇▆▇█▅▆▆▅▅▅▄▄▅▃▃▃▄▃▃▂▂▁
▁▁▁▁▂▂▂▃▄▅▆▇███████████████████████████▆▇▆▆▆▅▄▃▃▃▃▂▂▂▂▂▂▁▂▁▂▂▁▁▁▁▁
35.7 μs Histogram: frequency by time 39.6 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0. If you loop at the definition of julia> out3d = Array{Float32}(undef, size(A) .+ 1 .- 3);
julia> B = similar(A); Which seems to be the cause, i.e. if we make julia> out3d_view = view(B, 2:127, 2:127, 2:127);
julia> @benchmark laplace_sparse!($out3d_view,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 45.549 μs … 230.740 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 46.609 μs ┊ GC (median): 0.00%
Time (mean ± σ): 46.759 μs ± 1.989 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▄▄▇████▇▆▅▄▃▂▁
▁▁▂▁▂▃▄▆▇████████████████▇▇▆▅▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▁▁▂▁▁▂▁▁▁▁▁
45.5 μs Histogram: frequency by time 49.5 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0. My working theory is that it may have something to do with certain powers of 2 being magic slow numbers. On the theory that the extra size is causing it to run out of cache,
The bigger strike against the "too much memory" theory is if I shrink or increase the size of julia> A = ones(Float32, 126, 126, 126);
julia> B = similar(A);
julia> out3d = Array{Float32}(undef, size(A) .+ 1 .- 3);
julia> out3d_view = view(B, 2:size(B,1)-1, 2:size(B,2)-1, 2:size(B,3)-1);
julia> @benchmark diffusion3D_step3!($B,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 26.384 μs … 168.274 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 27.583 μs ┊ GC (median): 0.00%
Time (mean ± σ): 27.679 μs ± 1.512 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▅▇▇█▆▆▆▃▃▂▁
▂▁▁▁▂▂▂▂▃▃▄▄▅▆▇███████████████▆▆▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
26.4 μs Histogram: frequency by time 29.9 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark laplace_sparse!($out3d_view,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 26.807 μs … 190.863 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 28.499 μs ┊ GC (median): 0.00%
Time (mean ± σ): 28.685 μs ± 1.852 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄▆▆▇█▇▅▆▅▃▃▃▁▁
▁▁▁▁▁▁▁▂▂▂▃▄▄▆██████████████████▇▆▆▄▅▄▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁
26.8 μs Histogram: frequency by time 31.4 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark laplace_sparse!($out3d,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 27.010 μs … 146.895 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 30.255 μs ┊ GC (median): 0.00%
Time (mean ± σ): 30.137 μs ± 1.531 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▅▆████▆▅▃▂
▁▁▁▁▂▂▂▂▂▂▂▃▃▂▃▃▃▃▃▃▃▃▄▄▅▅▅▆▆▇▇███████████████▇▅▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁
27 μs Histogram: frequency by time 32.7 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0. bigger: julia> A = ones(Float32, 130, 130, 130);
julia> B = similar(A);
julia> out3d = Array{Float32}(undef, size(A) .+ 1 .- 3);
julia> out3d_view = view(B, 2:size(B,1)-1, 2:size(B,2)-1, 2:size(B,3)-1);
julia> @benchmark diffusion3D_step3!($B,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 36.041 μs … 194.086 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 37.261 μs ┊ GC (median): 0.00%
Time (mean ± σ): 37.362 μs ± 1.692 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▂▅▇▇▇██▇█▇▅▄▂▁
▁▁▁▁▂▂▂▃▃▄▆▇████████████████▇▇▆▄▄▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▂▁▁▁▂▁▁▁▁▁
36 μs Histogram: frequency by time 39.9 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark laplace_sparse!($out3d_view,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 37.286 μs … 187.904 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 38.715 μs ┊ GC (median): 0.00%
Time (mean ± σ): 38.837 μs ± 1.659 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▃▆▆█▇███▅▅▄▃▂▁
▁▁▁▁▁▂▂▃▄▅▅▆█████████████████▇▇▆▄▄▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▁▂▂▁▁▁▂▁▁▁▁▁▁
37.3 μs Histogram: frequency by time 41.7 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark laplace_sparse!($out3d,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 34.597 μs … 207.342 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 35.437 μs ┊ GC (median): 0.00%
Time (mean ± σ): 35.531 μs ± 1.807 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃▅▇████▇█▅▄▃▂
▁▁▁▂▂▂▃▃▅▆▇█████████████████▇▆▅▄▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁
34.6 μs Histogram: frequency by time 37.4 μs (top 1%)
Memory estimate: 0 bytes, allocs estimate: 0. These arrays are larger, but runtimes are the same as for the TLDR: Always benchmark at multiple sizes, and not just at power-of-2s. Also, note that these require the latest release of VectorizationBase and/or LoopVectorization. The 5 arguments to |
From Agner Fog's C++ optimization manual, pages 89-90:
The 10980XE and Xeon 4114 both have 1 MiB of L2 cache, 16 ways. julia> 2^20 ÷ 16
65536 bytes. julia> 128*128*sizeof(Float32)
65536 bytes apart. |
Thanks for reporting and interesting to see that the performance is so much sensitive to the array sizes. From experience I know optimal array sizes are critical when doing operations on the GPU, but never experienced much on the CPU side. But all makes sense on CPU as well, as there are certainly optimal numbers that would minimise cache-misses and have optimal cache stride.
+1 So in conclusion (correct me if I am wrong):
Thanks very much for having done all these investigations @chriselrod ! |
Thanks @chriselrod for sharing your benchmarks with us! Here are a couple of thoughts with respect to the cross-over point that you have mentioned:
|
This is to be expected for this problem size that fits in the CPU L2 cache but not into the GPU caches - see my comment above. For significantly larger problems, PS on GPU will deliver the best performance. |
I'm curious what GPU performance you get against something like the cudnn wrappers of NNlibCUDA.jl. Those would be more appropriate comparisons than CuArray broadcasting.
The text was updated successfully, but these errors were encountered: