This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
fix: missing enzyme rules for matmuladd! (CUDA support) #159
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/enzyme_cuda
branch
from
September 15, 2024 02:24
77ee6c3
to
f797993
Compare
avik-pal
force-pushed
the
ap/enzyme_cuda
branch
from
September 15, 2024 03:04
f797993
to
fcd5b08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: df2bfd5 | Previous: 987fce9 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5917 ns |
6083 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5792 ns |
6250 ns |
0.93 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8042 ns |
8104 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6125 ns |
5333 ns |
1.15 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
116710 ns |
127763 ns |
0.91 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2852430 ns |
2680722 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
861459 ns |
817500 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
401163 ns |
410844 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10375 ns |
9771 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10000 ns |
9958 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9687.5 ns |
9834 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9979 ns |
9958 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
537342 ns |
539870 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17823755 ns |
18273784 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4963250 ns |
2523292 ns |
1.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
717326 ns |
669947 ns |
1.07 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1500 ns |
2812.5 ns |
0.53 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1812.5 ns |
1416 ns |
1.28 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1542 ns |
1584 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1625 ns |
1333 ns |
1.22 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
21658 ns |
21455 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1280349 ns |
1323661 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
201041 ns |
216625 ns |
0.93 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
35810 ns |
28950 ns |
1.24 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3291.5 ns |
4458 ns |
0.74 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3750 ns |
3375 ns |
1.11 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
3937.5 ns |
4167 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4000 ns |
4000 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
144356.5 ns |
142970.5 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
8866089 ns |
10240879 ns |
0.87 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1450646 ns |
1524333 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
146231 ns |
149491.5 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57084 ns |
57833 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46708 ns |
40417 ns |
1.16 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46875 ns |
46375 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83917 ns |
83000 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36339 ns |
36725 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
818655.5 ns |
558408 ns |
1.47 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1057917 ns |
1040458 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
80285.5 ns |
81776 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2018875 ns |
2036667 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2083417 ns |
2086500 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2084333 ns |
2090375 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2014417 ns |
1993667 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
226229 ns |
226490 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
8035998 ns |
7533597 ns |
1.07 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
5478875 ns |
8034167 ns |
0.68 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1440733 ns |
986919 ns |
1.46 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
170417 ns |
146666 ns |
1.16 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
148584 ns |
151000 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
149583 ns |
151062.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
160562.5 ns |
194750 ns |
0.82 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166876 ns |
166182 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7797282 ns |
7689190 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1583375 ns |
1596770.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
203002 ns |
209312 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1102458.5 ns |
1113896 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1102083 ns |
1120062.5 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1109583 ns |
1119104 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1116459 ns |
1106542 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
696608 ns |
695636.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33320128.5 ns |
34400023 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
5999125 ns |
7210396 ns |
0.83 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1026614 ns |
1024730 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4458 ns |
5291 ns |
0.84 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4937 ns |
4916 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6062.5 ns |
6125 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5000 ns |
4375 ns |
1.14 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
92442 ns |
91792.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5499696 ns |
5267805 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
449541 ns |
474000 ns |
0.95 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
67831 ns |
67381 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9000 ns |
8750 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8875 ns |
8917 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9000 ns |
8792 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8916 ns |
8687.5 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
598617.5 ns |
600359 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
33059926 ns |
36489972 ns |
0.91 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5980250 ns |
5930125 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
387213 ns |
390114 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18666.5 ns |
17562.5 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18084 ns |
17979 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21416.5 ns |
20812.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17000.5 ns |
17750 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
66296 ns |
66076.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3090203 ns |
3263389.5 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1290959 ns |
1274334 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
76351 ns |
76030 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219000 ns |
212792 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218999.5 ns |
213000 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220916 ns |
218292 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
221500 ns |
254395.5 ns |
0.87 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
352074 ns |
351925 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
15166593 ns |
15484392 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5645208 ns |
5673084 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
451749 ns |
468334.5 ns |
0.96 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
583 ns |
625 ns |
0.93 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
750 ns |
708 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
875 ns |
770.5 ns |
1.14 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
666.5 ns |
666 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20395 ns |
20050 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1176380 ns |
1150135.5 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
301416 ns |
295625 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
30950 ns |
32420 ns |
0.95 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1334 ns |
1459 ns |
0.91 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1541.5 ns |
1520.5 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1584 ns |
1459 ns |
1.09 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1583 ns |
1500 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
123898 ns |
122512.5 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8669000 ns |
8913698.5 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1653417 ns |
1644687.5 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
135441 ns |
135591 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7416 ns |
7334 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5417 ns |
1.12 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6167 ns |
6042 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10459 ns |
10250 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23631 ns |
23888.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1190921 ns |
1207370.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
601583.5 ns |
446750 ns |
1.35 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47310 ns |
47420 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
222083 ns |
236834 ns |
0.94 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
235583 ns |
241875 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
237708.5 ns |
269875 ns |
0.88 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
256042 ns |
257687.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
188469 ns |
191906.5 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
32610334.5 ns |
32212683 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9041499.5 ns |
8558250.5 ns |
1.06 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
644350.5 ns |
645121 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4125 ns |
4083 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4083 ns |
4083 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4125 ns |
4042 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4125 ns |
4125 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
22829.5 ns |
23307 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
1921964 ns |
2000762.5 ns |
0.96 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
221334 ns |
223875 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
46730 ns |
48080 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16709 ns |
16792 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16708 ns |
16625 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16958 ns |
16792 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
17041 ns |
16917 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
192811 ns |
191629 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
11018487 ns |
10282963 ns |
1.07 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
941667 ns |
937125 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
171082 ns |
176282 ns |
0.97 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
509417 ns |
509292 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
404708 ns |
332354.5 ns |
1.22 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
405084 ns |
404834 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
865041 ns |
865333 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113085 ns |
113483 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
405181 ns |
392476 ns |
1.03 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
493750 ns |
487333 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
242012 ns |
240773 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2330209 ns |
2308770.5 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2027500 ns |
1756875 ns |
1.15 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2035083.5 ns |
2033625 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3196083.5 ns |
3270500 ns |
0.98 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
239374 ns |
237569 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
10368785 ns |
11006777.5 ns |
0.94 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1964958 ns |
2028666.5 ns |
0.97 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
739347 ns |
739942 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6354.5 ns |
6062.5 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7500 ns |
6584 ns |
1.14 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8020.5 ns |
8208.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6500 ns |
6875 ns |
0.95 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
90359 ns |
91839.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5333534 ns |
5704966 ns |
0.93 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
870542 ns |
776250 ns |
1.12 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
67591 ns |
65360 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11833.5 ns |
11041.5 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12000 ns |
11875 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12250 ns |
11125 ns |
1.10 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11625 ns |
12187.5 ns |
0.95 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
628260 ns |
637048 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
37786795 ns |
37465688 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5940750 ns |
5651896.5 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
415614 ns |
408644 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
541 ns |
542 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
541 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
22889 ns |
22899 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2223683 ns |
1980954 ns |
1.12 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
317083 ns |
214375 ns |
1.48 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
48540 ns |
49101 ns |
0.99 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2083 ns |
2084 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2083 ns |
2083 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2166 ns |
2208 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2167 ns |
2125 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
223807 ns |
228216 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
10821378 ns |
11133138.5 ns |
0.97 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
1968833 ns |
2019750 ns |
0.97 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
176001.5 ns |
180086.5 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9625 ns |
9083 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9333 ns |
8500 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10750 ns |
10833.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8542 ns |
8542 ns |
1 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
100074.5 ns |
108383.5 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3132454 ns |
3207332 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
922709 ns |
816208 ns |
1.13 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
71491 ns |
74171 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17625 ns |
16875 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18125 ns |
18792 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18500 ns |
18250 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17917 ns |
17812.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
568177 ns |
615805 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
16550550 ns |
16767446 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5440042 ns |
5170312.5 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
378703 ns |
383838.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
666 ns |
0.88 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
34894 ns |
35553 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1202690 ns |
1192710.5 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
451834 ns |
293146 ns |
1.54 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
45750 ns |
46141 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8875 ns |
8541.5 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9125 ns |
8541 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9166.5 ns |
9958.5 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8333 ns |
9458.5 ns |
0.88 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
257748.5 ns |
264293 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21823797 ns |
18241947 ns |
1.20 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5328916 ns |
5274687.5 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
367713.5 ns |
366223 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396833.5 ns |
396958 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287875 ns |
215500 ns |
1.34 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288250 ns |
287792 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756458 ns |
755333 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111609 ns |
110939.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
329132 ns |
326929 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
395667 ns |
365521 ns |
1.08 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
74991 ns |
74351 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1449979 ns |
1446854 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1131042 ns |
859125 ns |
1.32 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1133958 ns |
1132854 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2356500 ns |
2436292 ns |
0.97 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
204709 ns |
204467 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10635074 ns |
8967194.5 ns |
1.19 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1611750 ns |
1574375 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
322063 ns |
321063 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7104 ns |
7187.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7958 ns |
7270.5 ns |
1.09 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8958.5 ns |
8541.5 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6562.5 ns |
6979.5 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
142105.5 ns |
145872 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5936342 ns |
5766375 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
452000 ns |
448125 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
66091 ns |
65611 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
12375 ns |
14770.5 ns |
0.84 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15041 ns |
16916.5 ns |
0.89 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15917 ns |
15687.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
12417 ns |
15562.5 ns |
0.80 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
942027 ns |
956937.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
44977187 ns |
42931711 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5815708 ns |
6186333 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
435994 ns |
421904 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
26250 ns |
25292 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
27062 ns |
25292 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
30166.5 ns |
28583.5 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
27083 ns |
30125 ns |
0.90 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
197974.5 ns |
198270.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7858446 ns |
7924119 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
679708 ns |
654625 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
116281 ns |
113131 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
111583 ns |
157000 ns |
0.71 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
152937.5 ns |
118479 ns |
1.29 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
153979.5 ns |
118792 ns |
1.30 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
149416 ns |
145083.5 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1069746 ns |
1072793 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42543095 ns |
41512479 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5946875 ns |
5879750 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
584775 ns |
587055 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
75459 ns |
76417 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
78583 ns |
74917 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
86542 ns |
80458 ns |
1.08 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76458 ns |
82834 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
204304 ns |
204563.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7626592.5 ns |
7289524 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
525584 ns |
532021 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
127951 ns |
126591 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
305250 ns |
263209 ns |
1.16 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
309917 ns |
316562 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
303687.5 ns |
248479.5 ns |
1.22 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
300459 ns |
210125 ns |
1.43 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1109450.5 ns |
1111658.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42357809 ns |
39831914 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6335041.5 ns |
6266000 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
692446 ns |
691997 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16583 ns |
16771 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
17209 ns |
16791.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
18583 ns |
17542 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
16250 ns |
16750 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
143978 ns |
144759.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5682846.5 ns |
5606829 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
442583.5 ns |
474208 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
231662.5 ns |
232022 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
25771 ns |
26895.5 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27895.5 ns |
25167 ns |
1.11 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27437.5 ns |
27333 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
25146 ns |
24167 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
970539 ns |
972458 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41685019 ns |
41939896 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5938208 ns |
6295958 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
688856 ns |
695306.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11771 ns |
11209 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11000 ns |
11333.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12625 ns |
12416.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
10375 ns |
11042 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
123075.5 ns |
122668.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3467941 ns |
3386989.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
923417 ns |
858500 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
233902 ns |
233942 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
22249.5 ns |
21584 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21958 ns |
22563 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
22458.5 ns |
22583 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21417 ns |
21291 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
700502.5 ns |
697229 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21978852 ns |
21507216 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5587958.5 ns |
5485375 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
671596 ns |
669687 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
66792 ns |
63104 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
64084 ns |
66479 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
67917 ns |
66584 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
62708 ns |
64208.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
104530 ns |
105012.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3393921 ns |
3348443 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1274250 ns |
1297624.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
232402 ns |
232172 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
445958.5 ns |
440625 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
445104 ns |
448937.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
474833 ns |
440917 ns |
1.08 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
487708 ns |
438250 ns |
1.11 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
511360 ns |
511759 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19945666 ns |
20624860 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6173209 ns |
5921625 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
714611 ns |
713498 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7541.5 ns |
7521 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8021 ns |
8084 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8458 ns |
8667 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6895.5 ns |
7750 ns |
0.89 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
143280 ns |
143457 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5507961.5 ns |
5597779 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
760167 ns |
446771 ns |
1.70 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
64870 ns |
64960 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15167 ns |
14875 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15624.5 ns |
15709 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14791.5 ns |
16542 ns |
0.89 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14625 ns |
15541.5 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
937676 ns |
938762 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
39900259 ns |
39706040 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5852708 ns |
5775541 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
399123 ns |
398045 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6153208.5 ns |
6154854 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6372791 ns |
3224917 ns |
1.98 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6377771 ns |
6376292 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11908042 ns |
11902583 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
347117 ns |
347379 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
326032 ns |
297978.5 ns |
1.09 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19111958.5 ns |
19104063 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19917167 ns |
11143020.5 ns |
1.79 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
20013437.5 ns |
19964417 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36539583.5 ns |
36518125 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1010211 ns |
1020967.5 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1160155 ns |
1158972 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
958 ns |
958 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
917 ns |
1000 ns |
0.92 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
959 ns |
1000 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
959 ns |
958 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
22953 ns |
22897 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2055203.5 ns |
2091957.5 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
317500 ns |
232500 ns |
1.37 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
206642 ns |
206842 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3667 ns |
3708 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3667 ns |
3709 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3750 ns |
3792 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3666 ns |
3667 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
276264.5 ns |
277378 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11103867 ns |
11186074 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2080563 ns |
2130584 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
626735 ns |
626357 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8354.5 ns |
7750 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8271.5 ns |
7937.5 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10375 ns |
9771 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8479.5 ns |
7437.5 ns |
1.14 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
119642 ns |
119515 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3410967 ns |
3487658 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
806333 ns |
816562.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
65401 ns |
65701 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11916 ns |
11208 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11959 ns |
13416.5 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12334 ns |
12834 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
12062 ns |
11584 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
633604.5 ns |
631148 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24085045 ns |
21438278 ns |
1.12 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5097229 ns |
5005375 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
354493 ns |
354774 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
291 ns |
0.86 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
250 ns |
333 ns |
0.75 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22391 ns |
22106 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2130096.5 ns |
2144977 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
223604.5 ns |
226937 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
46660 ns |
46510 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2875 ns |
2875 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2875 ns |
3000 ns |
0.96 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3042 ns |
2917 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
3292 ns |
2958 ns |
1.11 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
200381.5 ns |
199810.5 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9512287 ns |
9182273 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1667166.5 ns |
1664167 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
154861.5 ns |
161676.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11188 ns |
11625 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11375 ns |
11979 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13583 ns |
13333 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11167 ns |
11604.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
120676.5 ns |
120755 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3356710 ns |
3560641 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
905416 ns |
1031500 ns |
0.88 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
232962.5 ns |
233163 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20896 ns |
20687.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20541 ns |
20583 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24042 ns |
23000 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20437.5 ns |
20541.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
592980 ns |
590597 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19762992.5 ns |
20721086 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4815437.5 ns |
4786083 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
647055 ns |
646557 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4417 ns |
4375 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4375 ns |
4417 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4375 ns |
4417 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
23422 ns |
23934 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2056687 ns |
2235095.5 ns |
0.92 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
223584 ns |
221479.5 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
47230 ns |
47181 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16500 ns |
16667 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16458 ns |
16541 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16833 ns |
16709 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16250 ns |
16708 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
328856 ns |
326329 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12215144.5 ns |
12543391.5 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1709708 ns |
1672458 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
205042 ns |
204152 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2042 ns |
2084 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2125 ns |
2125 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2125 ns |
2083 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2083 ns |
1958 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
35411 ns |
35852 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1204529 ns |
1224950 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
468083 ns |
293583 ns |
1.59 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
202926.5 ns |
203142 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
16875.5 ns |
18208 ns |
0.93 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
17083.5 ns |
17187.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
18312.5 ns |
18041.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
17812.5 ns |
17021 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
291655 ns |
291174 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20862505 ns |
21237766 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5608145.5 ns |
5676396 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
685221 ns |
684357.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
60604.5 ns |
60208.5 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
64375 ns |
62042 ns |
1.04 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
66625 ns |
65750 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51125 ns |
51250 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66758 ns |
66352.5 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
117881 ns |
112971 ns |
1.04 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
203042 ns |
188541.5 ns |
1.08 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
108479 ns |
140250.5 ns |
0.77 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
153187.5 ns |
124249.5 ns |
1.23 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
269187.5 ns |
220125 ns |
1.22 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
213243 ns |
213978 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
616430 ns |
616297 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
83708 ns |
84479 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
91041 ns |
83666.5 ns |
1.09 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
86250.5 ns |
86167 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83958 ns |
125666 ns |
0.67 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193430.5 ns |
193270.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5188768 ns |
5699293.5 ns |
0.91 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1750270.5 ns |
1963979.5 ns |
0.89 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
217472 ns |
204042 ns |
1.07 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1915271 ns |
1887292 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1908688 ns |
1916521 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1884250 ns |
1912333 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1925896 ns |
1806250 ns |
1.07 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
526075 ns |
528167 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
18387490 ns |
24408984.5 ns |
0.75 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9059833 ns |
9102667 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1068719 ns |
1064601.5 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21289 ns |
21230 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2736431 ns |
2190815.5 ns |
1.25 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
344083 ns |
367541.5 ns |
0.94 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
41530 ns |
41291 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1792 ns |
1834 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
249820 ns |
249025 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9441082 ns |
10051558 ns |
0.94 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1569708 ns |
1526271 ns |
1.03 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
178101 ns |
182202 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8709 ns |
8583 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8417 ns |
9542 ns |
0.88 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11583 ns |
10604 ns |
1.09 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9042 ns |
8125 ns |
1.11 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
116492 ns |
117788.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3343948 ns |
3476276 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
864458 ns |
921312.5 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
232712 ns |
232182 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9125 ns |
9000 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8541 ns |
8958 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
12333 ns |
11292 ns |
1.09 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8750 ns |
9145.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
520944 ns |
518629.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20822847 ns |
19406043 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4608167 ns |
4477584 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
626775 ns |
626986 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57667 ns |
57458 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46209 ns |
39875 ns |
1.16 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46917 ns |
46750 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83750 ns |
82583 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
39308 ns |
39259 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1295650 ns |
1309251 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1112083 ns |
1121542 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
78771 ns |
74341 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1919478.5 ns |
1867542 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1959291.5 ns |
1978791 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1979917 ns |
1977229 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1905563 ns |
1853979.5 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
217725.5 ns |
219172 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
34778309 ns |
32964288 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11556353.5 ns |
11253292 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1038339 ns |
1160142 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
417958 ns |
419229.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
418542 ns |
435958 ns |
0.96 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
421667 ns |
420208 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
417917 ns |
417291.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
207355 ns |
208124 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
9387178.5 ns |
8033766 ns |
1.17 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
516354 ns |
539333.5 ns |
0.96 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
281432 ns |
280723 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
752313 ns |
718729.5 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
756708.5 ns |
670917 ns |
1.13 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
755042 ns |
681646 ns |
1.11 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
714791.5 ns |
671125 ns |
1.07 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1039645 ns |
1045689 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
48331709 ns |
44612818 ns |
1.08 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7339229.5 ns |
6579583 ns |
1.12 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
900947 ns |
909619.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3463292 ns |
3431646 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3411041 ns |
3418041.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3450270.5 ns |
3459666 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3450729 ns |
3424604 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
172028 ns |
172982 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8247685 ns |
8225049 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1400042 ns |
1418875 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
423798.5 ns |
438875 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6207666.5 ns |
6211958.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6191917 ns |
6239125 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6203042 ns |
6228166.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6155750 ns |
6164812.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
983454.5 ns |
989377 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49424156 ns |
49957898 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7415333.5 ns |
7609083 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1640284 ns |
1545101 ns |
1.06 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
472625 ns |
470459 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
341958 ns |
254333 ns |
1.34 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
342229.5 ns |
342000 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
901125 ns |
901833 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46305 ns |
45850.5 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
387810 ns |
874511 ns |
0.44 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
488417 ns |
485291 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
242132 ns |
241413 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2334541 ns |
2331458 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2033792 ns |
1762250 ns |
1.15 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2039166 ns |
2040791.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3201625 ns |
3281083 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
263325 ns |
263882 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
8472401 ns |
13135947 ns |
0.64 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2171166 ns |
2243500 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
769296 ns |
765467.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57166 ns |
57083 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46000 ns |
38854.5 ns |
1.18 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46209 ns |
46125 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83666 ns |
82875 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
27911 ns |
28162 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1037054 ns |
1368315 ns |
0.76 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1128875 ns |
1138958 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
74385.5 ns |
74570.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2031375.5 ns |
2033792 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2049062.5 ns |
2094125 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2082791.5 ns |
2089041.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2007792 ns |
2003042 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
230367.5 ns |
231932 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36228924 ns |
35712411 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11447958 ns |
11300791.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1195775 ns |
1044461 ns |
1.14 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58062.5 ns |
57500 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46667 ns |
39917 ns |
1.17 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46854.5 ns |
46500 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83583 ns |
82625 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
49082 ns |
48905 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
780496 ns |
744836.5 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1087542 ns |
1117520.5 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
73710 ns |
64946 ns |
1.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1920625 ns |
1922750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1979271 ns |
1974334 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1967333.5 ns |
1956833.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1873958 ns |
1889708 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
235469 ns |
239067 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
16628287 ns |
16476478 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9899750 ns |
9755374.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1045729 ns |
916609 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
34659 ns |
35081.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1165228.5 ns |
1290014 ns |
0.90 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
274666 ns |
287438 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48121 ns |
45840 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6583 ns |
6541 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6584 ns |
6687.5 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7125 ns |
7000 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6708 ns |
6500 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
205946 ns |
205115.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20242299 ns |
20319441.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5155666.5 ns |
5303083 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
368933 ns |
367174 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31792 ns |
31894 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1265965 ns |
1192240 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
256250 ns |
254292 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
38370 ns |
36310 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2875 ns |
3334 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2833 ns |
2958 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3042 ns |
3167 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2959 ns |
2958 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
186536 ns |
185317.5 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
7354138 ns |
7518628 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
1072208 ns |
1115709 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
157402 ns |
149472 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
455958 ns |
422083 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
420603.5 ns |
423833 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
425479.5 ns |
427834 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
423104.5 ns |
424937.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
137903 ns |
137292 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5954417 ns |
5779699.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2040521 ns |
2076458 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
364623 ns |
366143.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3804583.5 ns |
3813229.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3812292 ns |
3824249.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3795875 ns |
3788084 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3812541.5 ns |
3812042 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
705085 ns |
705310 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32654529 ns |
31262641 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11097208 ns |
10824937.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1465927 ns |
1464005 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49850208 ns |
49892959 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35526250 ns |
26011834 ns |
1.37 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35528084 ns |
35523145.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
97085042 ns |
97645833 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1598092 ns |
1616287 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1049079 ns |
1048102 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154749104.5 ns |
154680021 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112480041.5 ns |
88850291.5 ns |
1.27 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112311583 ns |
112398500 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
295479854.5 ns |
298306271 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6475380 ns |
6498761 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5567482 ns |
5545318 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
20395.5 ns |
19937.5 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
16708 ns |
15167 ns |
1.10 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
17083.5 ns |
17041.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15125 ns |
14792 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
21066 ns |
20017 ns |
1.05 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1101176 ns |
1149888 ns |
0.96 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
215250 ns |
229541 ns |
0.94 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
25670 ns |
27001 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
11166 ns |
10417 ns |
1.07 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
9021 ns |
7250 ns |
1.24 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
9271 ns |
9104 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17167 ns |
17375 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
257676 ns |
257217 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
10263846 ns |
9674368 ns |
1.06 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1588958 ns |
1641396 ns |
0.97 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
147461 ns |
147861 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8750 ns |
8063 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8916 ns |
9125 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
11167 ns |
10667 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8250 ns |
8917 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
122695.5 ns |
114750.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3611838 ns |
3651219 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
810334 ns |
861125 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
234252 ns |
233283 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9875 ns |
9792 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10042 ns |
10750 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10917 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9666.5 ns |
10271 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
614564 ns |
614307 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22515968 ns |
28192305 ns |
0.80 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5300125.5 ns |
5310750 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
650845 ns |
649747 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10375 ns |
9708 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10208 ns |
10000 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10937.5 ns |
11541 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
8958 ns |
9584 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
119909.5 ns |
119206 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3438435.5 ns |
3481764 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
940042 ns |
937459 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
71350 ns |
72050 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
15166 ns |
17479.5 ns |
0.87 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
17249.5 ns |
14375 ns |
1.20 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
16500 ns |
15125 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
15625 ns |
14667 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
588113 ns |
586931 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18956216 ns |
19607421 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4795145.5 ns |
4735125 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
345433 ns |
343533 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
541 ns |
584 ns |
0.93 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
458 ns |
459 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
34662 ns |
34228 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1200054.5 ns |
1215476 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
277750.5 ns |
314188 ns |
0.88 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
204216.5 ns |
203452 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7834 ns |
9334 ns |
0.84 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8209 ns |
8604.5 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9291.5 ns |
9041 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8625 ns |
8250 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
231220.5 ns |
230655 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21949133 ns |
22072831 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5504979 ns |
5460541 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
657786 ns |
654892 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
17625 ns |
17375 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
15458 ns |
14792 ns |
1.05 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
15667 ns |
16000 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10375 ns |
10458 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
21715 ns |
21718 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1135978 ns |
1102903 ns |
1.03 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
207667 ns |
208666 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
185231 ns |
184622 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
31895.5 ns |
31542 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
32312.5 ns |
32000 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32187.5 ns |
32208 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32250 ns |
32354.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
272921 ns |
271707 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11269975 ns |
10769694 ns |
1.05 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1796334 ns |
1820875 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
590985 ns |
588176 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
487000 ns |
452584 ns |
1.08 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
446417 ns |
441979.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
443208.5 ns |
467167 ns |
0.95 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
445125 ns |
438521 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194886.5 ns |
194827 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5952118 ns |
5920885 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1965209 ns |
1997667 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
374993 ns |
368184 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3824209 ns |
3829250 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3832854 ns |
3838292 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3826979.5 ns |
3802021 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3837854 ns |
3830584 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
537990 ns |
544632 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27884609 ns |
28778535 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9399354 ns |
9720812.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1357231 ns |
1358284 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
783420645.5 ns |
831986833 ns |
0.94 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
542712125 ns |
416264500 ns |
1.30 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
544393542 ns |
543217708 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1506067228.5 ns |
1509789750 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22742302 ns |
22539644.5 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14706275 ns |
14678121 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
2515795250 ns |
3779013833 ns |
0.67 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1789560125 ns |
1885743917 ns |
0.95 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
3174171250 ns |
1788587042 ns |
1.77 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
4755520667 ns |
4810183875 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
365925509 ns |
364565745 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
88872223 ns |
88375525 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
78271 ns |
75520.5 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
77042 ns |
76416.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79750 ns |
79958.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76333 ns |
78625 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
206560.5 ns |
207155.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7729887 ns |
7714255 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
524875 ns |
534709 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
118801 ns |
106301.5 ns |
1.12 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
278708.5 ns |
235667 ns |
1.18 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
193229.5 ns |
283229.5 ns |
0.68 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
282667 ns |
247208 ns |
1.14 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
193562 ns |
210874.5 ns |
0.92 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1040522 ns |
1048818 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42904324 ns |
44375934 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6113958 ns |
6248084 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
659445 ns |
631246 ns |
1.04 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199659188 ns |
199488333 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
138835417 ns |
103922541.5 ns |
1.34 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
139066292 ns |
139224666 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
389188917 ns |
393811292 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5820554 ns |
5835255 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3565645.5 ns |
3578582 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
620296312.5 ns |
620321291.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
440668667 ns |
354710917 ns |
1.24 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
438212625 ns |
440219958 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1178525042 ns |
1185414250 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26711415.5 ns |
26495134 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
22018976 ns |
22065145 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7417 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5417 ns |
1.12 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6250 ns |
6292 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10042 ns |
10145.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27655.5 ns |
27466 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1222067 ns |
1213453.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
582458 ns |
432833 ns |
1.35 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48381 ns |
47620 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213583 ns |
213000 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220395.5 ns |
223041 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220604 ns |
220917 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
208021 ns |
206896 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
219078 ns |
223324 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31953958 ns |
31525343 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9203250.5 ns |
9133958 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
526564 ns |
524095 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10375 ns |
8854.5 ns |
1.17 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8625 ns |
9312.5 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11271 ns |
10583 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7375 ns |
9625 ns |
0.77 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
117396 ns |
116401 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3513776 ns |
3333892 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
877687.5 ns |
911750 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
69271 ns |
69370 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9333 ns |
7437.5 ns |
1.25 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
11458 ns |
8854 ns |
1.29 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8125 ns |
7959 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9458.5 ns |
9145.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
518767 ns |
515224 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20215136 ns |
18606821 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4550083 ns |
4708917 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
317923 ns |
318334 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
459 ns |
709 ns |
0.65 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
26366 ns |
25690 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1195376 ns |
1183861 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
477666.5 ns |
493792 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
46620 ns |
46791 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
8459 ns |
9000 ns |
0.94 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
8770.5 ns |
10791.5 ns |
0.81 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9749.5 ns |
9854.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10500 ns |
10042 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
252367 ns |
251338.5 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23130848 ns |
23713128.5 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5955833.5 ns |
6062250 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
388573 ns |
386044 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
107187 ns |
107354.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
98312 ns |
84667 ns |
1.16 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
100917 ns |
100375 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146541 ns |
146729.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
24461 ns |
24618 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1106973 ns |
1206806.5 ns |
0.92 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
268041.5 ns |
266292 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
190491 ns |
190862 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
500833 ns |
478500 ns |
1.05 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
513042 ns |
492271 ns |
1.04 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
505270.5 ns |
481000 ns |
1.05 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
478625 ns |
479145.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
230476 ns |
230580 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11382686 ns |
11914566 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2119250.5 ns |
2188458.5 ns |
0.97 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
606815 ns |
605276 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5750 ns |
6042 ns |
0.95 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
5770.5 ns |
7000 ns |
0.82 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7166 ns |
7583 ns |
0.95 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
4062.5 ns |
6000 ns |
0.68 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
15861 ns |
16947 ns |
0.94 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
79281 ns |
79345.5 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
11375 ns |
12062.5 ns |
0.94 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10791.5 ns |
10542 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11125 ns |
10917 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
17375 ns |
18208 ns |
0.95 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
210987.5 ns |
212062.5 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
375968 ns |
367674 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39958 ns |
39750 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
51000 ns |
50708 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52854.5 ns |
52625 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13625 ns |
13750 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
21619 ns |
19888.5 ns |
1.09 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
86201 ns |
87991 ns |
0.98 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
36083 ns |
36500 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
30937.5 ns |
28959 ns |
1.07 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
30937 ns |
31500 ns |
0.98 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
58145.5 ns |
58583 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
191196.5 ns |
190552 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
397084 ns |
413955 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1708 ns |
1750 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1833 ns |
1937.5 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2208 ns |
2125 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1708 ns |
1792 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
20510 ns |
20369 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1162226 ns |
1137759 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
310521 ns |
312000 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
33601 ns |
32711 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2250 ns |
0.94 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2084 ns |
2396 ns |
0.87 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2250 ns |
2333 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2250 ns |
0.94 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
202165 ns |
201543.5 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
9408568.5 ns |
9195441 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1661458 ns |
1575208 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
135986.5 ns |
136711 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6187.5 ns |
4562.5 ns |
1.36 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4708 ns |
4708.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
6834 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4917 ns |
5125 ns |
0.96 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
142748.5 ns |
144149.5 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5789328 ns |
5753580 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
742959 ns |
707854 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
70511 ns |
69031 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8333 ns |
8167 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8500 ns |
9250 ns |
0.92 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8542 ns |
8667 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9375 ns |
9209 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
867834 ns |
867994 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
39735735.5 ns |
37396018.5 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5582541.5 ns |
5747500 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
384434 ns |
386354 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56709 ns |
56917 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57708 ns |
56875 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57667 ns |
57833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58209 ns |
58125 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37190 ns |
37109 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1219055 ns |
1131214.5 ns |
1.08 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
385729.5 ns |
421167 ns |
0.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
215301 ns |
203222.5 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
447916 ns |
451020.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
463563 ns |
475979 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
464458.5 ns |
465354 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
433979 ns |
487041.5 ns |
0.89 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
263643 ns |
264507 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26923755 ns |
28501147 ns |
0.94 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8401250 ns |
7943604 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
823927 ns |
830424 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3319417 ns |
3311000 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2332542 ns |
1770250 ns |
1.32 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2335291 ns |
2337729.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6329458 ns |
6302417 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
205595 ns |
204131.5 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
208502 ns |
211992 ns |
0.98 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11418312.5 ns |
11485250 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8313229.5 ns |
6571812.5 ns |
1.26 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8312625.5 ns |
8309250 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21175145.5 ns |
21151875.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
735748 ns |
735481 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1053319 ns |
1057071 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5333 ns |
5125 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5145.5 ns |
5375 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6770.5 ns |
7125 ns |
0.95 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5250 ns |
6208.5 ns |
0.85 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
135858 ns |
137212.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5802378 ns |
5624260 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
848125 ns |
793500 ns |
1.07 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
56221 ns |
56010 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7375 ns |
7000 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7250 ns |
7500 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7292 ns |
7458 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7416 ns |
9083 ns |
0.82 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
750240.5 ns |
754137 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
36356150 ns |
34576213 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5157167 ns |
5244167 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
368223 ns |
366813 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
102291 ns |
103250 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
94459 ns |
103875 ns |
0.91 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
98875 ns |
125291 ns |
0.79 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
95417 ns |
101042 ns |
0.94 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
150934 ns |
151348 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6128379 ns |
6050689.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2030146 ns |
2052375 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
203502 ns |
203192 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2023916 ns |
2018375 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2024250 ns |
2029000 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2027646 ns |
2023521 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2034083 ns |
1991417 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
701876 ns |
703391 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32467383 ns |
31442085 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10818208 ns |
11046312.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1249360 ns |
1250762 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
35292 ns |
34667 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
35875 ns |
34750 ns |
1.03 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
35083 ns |
35041.5 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
791 ns |
646 ns |
1.22 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15147 ns |
15242 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
79081 ns |
79571 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2645.5 ns |
2729.5 ns |
0.97 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2709 ns |
2917 ns |
0.93 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
2875 ns |
3000 ns |
0.96 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2083 ns |
2208 ns |
0.94 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
138748 ns |
139866 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
340393 ns |
342158.5 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7167 ns |
7167 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
5417 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
6084 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10041 ns |
10042 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36394 ns |
36552 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1208263.5 ns |
1221281.5 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
351562.5 ns |
674708 ns |
0.52 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47980 ns |
48261 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212958.5 ns |
213624.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220083.5 ns |
221166.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220313 ns |
220812.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205916 ns |
205833 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
241977 ns |
243393.5 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27639627 ns |
25870086.5 ns |
1.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8048833 ns |
7741583 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
571310 ns |
575566 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3959 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
21257 ns |
21563 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2138204 ns |
2027782.5 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
243542 ns |
250542 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
43251 ns |
43640 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14875 ns |
14917 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14834 ns |
14791 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14917 ns |
14958 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14667 ns |
14917 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
307482 ns |
306375 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11303877 ns |
11210297 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
1023584 ns |
1037625 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
191196 ns |
194327 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
107000 ns |
105583 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
100833 ns |
106167 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
104000 ns |
124875 ns |
0.83 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
101417 ns |
102583 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
146441 ns |
139877 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6052419 ns |
5810927 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2004875 ns |
2048416 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204652 ns |
208802 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1922750 ns |
1878500 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1900416.5 ns |
1927583.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1923479 ns |
1867521 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1924771 ns |
1917937.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
688454 ns |
684487.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31437331 ns |
30087516 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10713709 ns |
10640458 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1150920 ns |
1063341 ns |
1.08 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20792 ns |
17583 ns |
1.18 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18688 ns |
19500 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20833 ns |
20708 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17542 ns |
18791 ns |
0.93 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
108224.5 ns |
109550 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3604915 ns |
3331480 ns |
1.08 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1331375 ns |
1318708 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79140 ns |
80701 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
216291.5 ns |
216271 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221625 ns |
222292 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
217000.5 ns |
217916 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215666.5 ns |
216167 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
517432 ns |
516519 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
18685063.5 ns |
19724665.5 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6219542 ns |
6017791.5 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
476044 ns |
477585 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
26292 ns |
26583 ns |
0.99 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
30000 ns |
28770.5 ns |
1.04 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
28875 ns |
29104 ns |
0.99 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1292 ns |
1334 ns |
0.97 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
15962 ns |
15984 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
80911 ns |
81921 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4833.5 ns |
4833.5 ns |
1 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5000.5 ns |
4833 ns |
1.03 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5354.5 ns |
5208.5 ns |
1.03 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4209 ns |
4333 ns |
0.97 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
205690.5 ns |
206128 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
379593 ns |
379654 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
305875 ns |
305792 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
305541 ns |
306042 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
307729 ns |
306833 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
306916 ns |
307083 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
227217 ns |
227988.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7859423.5 ns |
7778230 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
633291.5 ns |
1241125 ns |
0.51 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
272362 ns |
272793 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
534458 ns |
535708 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
597417 ns |
533084 ns |
1.12 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
590709 ns |
538208 ns |
1.10 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
530417 ns |
530917 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1072354 ns |
1080430 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42952925 ns |
42644591.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6417187.5 ns |
6182083 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
865447 ns |
851073.5 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21875 ns |
19125 ns |
1.14 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20042 ns |
20624.5 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22000 ns |
21458 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19750 ns |
20000 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
112832 ns |
112864 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3590647 ns |
3473281 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1450250 ns |
1444854 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79501 ns |
80611 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
214292 ns |
220167 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
223625 ns |
222791.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
215459 ns |
214771 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
212459 ns |
212625 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
736403 ns |
737028 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25495544 ns |
25214419 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7182729 ns |
7109375 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
535594 ns |
531685 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6667 ns |
5916 ns |
1.13 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7375 ns |
7083 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8125 ns |
8604.5 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6333 ns |
6500 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
138851.5 ns |
140088 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5554452.5 ns |
5562789 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
874521 ns |
803937.5 ns |
1.09 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
65381 ns |
64661 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9917 ns |
10000 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10562.5 ns |
10937.5 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10938 ns |
10750 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10334 ns |
10041 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
820981 ns |
822803 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
36828865 ns |
36817844 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5371520.5 ns |
5484583 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
382893 ns |
382033 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5917 ns |
4334 ns |
1.37 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5042 ns |
5291 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6041 ns |
7333 ns |
0.82 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4792 ns |
5584 ns |
0.86 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
142980.5 ns |
142901.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5709761 ns |
5758977.5 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
854000 ns |
800458 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
68440 ns |
66271 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7375 ns |
7208 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7604.5 ns |
7646 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7833 ns |
7750 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7542 ns |
7583 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
780235.5 ns |
782456.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
37248785 ns |
39501262 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5909667 ns |
6034250 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
394663.5 ns |
392794 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14461375 ns |
14539375 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10111500 ns |
7723291.5 ns |
1.31 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10115583 ns |
10145625 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27830500 ns |
27763416 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
529523 ns |
554910 ns |
0.95 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
393383 ns |
393434 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46327458.5 ns |
46429208.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33413250 ns |
26609416 ns |
1.26 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33418167 ns |
33517458 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85714292 ns |
85405667 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2641351 ns |
2664805 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3305837.5 ns |
3291838.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
67750 ns |
66292 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
67188 ns |
67875 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
68625 ns |
68250 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
66125 ns |
65917 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
120297 ns |
119249 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3401525.5 ns |
3647654 ns |
0.93 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1454375 ns |
1440312.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
233432 ns |
232702 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
442208 ns |
441250 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
440500 ns |
441625 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
450791 ns |
447167 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
440250 ns |
441478.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
724096 ns |
727144.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27244425 ns |
26208342 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7830291 ns |
7477375 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
787496 ns |
793922.5 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
584 ns |
0.86 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
583 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
31907 ns |
31836 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1176105 ns |
1180672 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
456084 ns |
286667 ns |
1.59 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
48991 ns |
47841 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8833 ns |
9458 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9166 ns |
9271 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9792 ns |
9750 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9417 ns |
9416 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
283657 ns |
283587 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20890296.5 ns |
22547365 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5446750 ns |
5502666.5 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
375983 ns |
374188.5 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9792 ns |
9792 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9833 ns |
9833 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9833 ns |
9875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9834 ns |
9875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
22906 ns |
22851 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2026316 ns |
2120178 ns |
0.96 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
223375 ns |
221333 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
208602 ns |
207772 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45917 ns |
46167 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
46125 ns |
46083 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46166 ns |
46417 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45667 ns |
46062.5 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
288163 ns |
287950 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11404659.5 ns |
12273456 ns |
0.93 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
1352041 ns |
1033833.5 ns |
1.31 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
600205 ns |
600566 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56250 ns |
56167 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57042 ns |
56875 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57167 ns |
57166 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58000 ns |
57875 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28634 ns |
28495 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1199631 ns |
1157087.5 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
678271 ns |
660125 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
202202 ns |
202572 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
456521 ns |
448229 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
464562.5 ns |
464979 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
473020.5 ns |
472292 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
434104 ns |
474437.5 ns |
0.91 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
243054 ns |
244496.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
32218901.5 ns |
33157318.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9969250 ns |
9248750 ns |
1.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
881277 ns |
888349 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
651521 ns |
614125 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
642167 ns |
648750 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
608000 ns |
652521 ns |
0.93 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
619354.5 ns |
642542 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
204222 ns |
208606.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8182516.5 ns |
7841403 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1372521 ns |
1401250 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
304733 ns |
305493 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2224041.5 ns |
2245937.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2229479 ns |
2247291 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2229542 ns |
2238062.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2244687 ns |
2241541 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
967724 ns |
971988 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49487921 ns |
48958299 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7073541 ns |
7597458.5 ns |
0.93 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1316971.5 ns |
1213901.5 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21333 ns |
19333 ns |
1.10 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20895.5 ns |
21646 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23000 ns |
21833 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19042 ns |
24291 ns |
0.78 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
112093 ns |
111706.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3575165.5 ns |
3500994.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1370042 ns |
1437895.5 ns |
0.95 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
80870.5 ns |
79141 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221250 ns |
219459 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
253917 ns |
219791.5 ns |
1.16 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228958 ns |
222104.5 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219542 ns |
219875 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
727914 ns |
728212.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27540388.5 ns |
26675294 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7698208 ns |
7278312 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
552954 ns |
555140 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
584 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
667 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
22957 ns |
22972 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1275000 ns |
1186538 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
478875.5 ns |
461542 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
47341 ns |
49541 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9792 ns |
9750 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9333 ns |
9333.5 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9291.5 ns |
9896 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9104 ns |
10000 ns |
0.91 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
266060 ns |
265448 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24079846.5 ns |
24827341.5 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6178000 ns |
6076333 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
397054 ns |
415154 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9000 ns |
7917 ns |
1.14 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8313 ns |
10208 ns |
0.81 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11229.5 ns |
10542 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7791 ns |
9292 ns |
0.84 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
119220 ns |
118520 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3267882.5 ns |
3378687 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
897312 ns |
891583 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
68721 ns |
75371 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7312.5 ns |
7291.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7292 ns |
7875 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7916 ns |
7833.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7437.5 ns |
7708 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
506373 ns |
503824 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18976858 ns |
17507211 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4417625 ns |
4534375 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
319133 ns |
318933 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1583 ns |
1437.5 ns |
1.10 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1479.5 ns |
1667 ns |
0.89 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1958 ns |
1917 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1459 ns |
1417 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
21067 ns |
21272 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1168198 ns |
1191094 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
303833.5 ns |
307229 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
188261 ns |
189132 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3208 ns |
3292 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3459 ns |
3333 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3375 ns |
3500 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3250 ns |
3500 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
219575 ns |
216668.5 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10655031 ns |
10523301.5 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1792187.5 ns |
1655750 ns |
1.08 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
578625 ns |
579466 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
148104 ns |
148229.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
127729 ns |
106166.5 ns |
1.20 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
129479 ns |
129250 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
225625 ns |
225167 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
23937 ns |
23640 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1210144 ns |
1169047 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
272875 ns |
281229 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
39341 ns |
40580 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
143270.5 ns |
143125 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
123208 ns |
87375 ns |
1.41 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
110459 ns |
112875.5 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
251895.5 ns |
250792 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
215563.5 ns |
214898 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10368835 ns |
10468792 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2024375 ns |
2056708 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
267812 ns |
266232 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7209 ns |
7208 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6041 ns |
5375 ns |
1.12 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
6083 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10208 ns |
10000 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32658 ns |
33010 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1187497 ns |
1218913 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
564146 ns |
357271 ns |
1.58 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48321 ns |
50911 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
226938 ns |
227938 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
227916 ns |
228354.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228125 ns |
235708 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
212958.5 ns |
249729 ns |
0.85 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
259894 ns |
263220 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27209302 ns |
28851277 ns |
0.94 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8471583 ns |
8089625 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
594335 ns |
591956 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
14917 ns |
15375 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15187 ns |
14917 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
16937.5 ns |
16834 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
14708 ns |
15583 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
138769.5 ns |
138290 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5609166 ns |
5390404 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
857375 ns |
805167 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
231592 ns |
231372.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23625 ns |
23333 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24229 ns |
23438 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
23625 ns |
24459 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23417 ns |
23666 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
859944 ns |
863635.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
40379406 ns |
39146915 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
6046000 ns |
5702250 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
690055 ns |
683727 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9834 ns |
8875 ns |
1.11 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9478.5 ns |
10041.5 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12145.5 ns |
11750 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8459 ns |
9917 ns |
0.85 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
123004 ns |
122685 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3326674 ns |
3570923 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
859125 ns |
917271 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
76391 ns |
75270 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13459 ns |
14166 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14166.5 ns |
14458.5 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14042 ns |
14979.5 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13791 ns |
13542 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
666299 ns |
660959 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21583968 ns |
21424061 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5388292 ns |
5279979 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
375715 ns |
365744 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10125 ns |
8417 ns |
1.20 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9083 ns |
10146 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11041.5 ns |
12125 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9459 ns |
9792 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
121924 ns |
121433.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3403856.5 ns |
3352559.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
933083 ns |
952146 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
72331 ns |
72460 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12167 ns |
13166 ns |
0.92 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12083 ns |
12938 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13125 ns |
13125 ns |
1 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12104.5 ns |
12916 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
550706 ns |
548948 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19323121 ns |
18645332 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4728041 ns |
4735063 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
341615 ns |
340583 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
29770.5 ns |
31125.5 ns |
0.96 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
34792 ns |
31520.5 ns |
1.10 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
31646 ns |
32333.5 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1834 ns |
1834 ns |
1 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
15895 ns |
16210 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
80311 ns |
80860 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5354.5 ns |
5229.5 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5312.5 ns |
4959 ns |
1.07 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5333.5 ns |
5250 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6166 ns |
6334 ns |
0.97 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
138354.5 ns |
138594 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
385195 ns |
388224 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
250 ns |
291 ns |
0.86 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
250 ns |
375 ns |
0.67 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
334 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
26010 ns |
25350 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1215712 ns |
1199368 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
456396 ns |
478250.5 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48931 ns |
49490 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6291 ns |
6292 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6416.5 ns |
6750 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6792 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6292 ns |
6584 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
186334.5 ns |
186417 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24541666 ns |
23013025 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5713208.5 ns |
5920458 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
388596 ns |
393209 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
1958 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
1959 ns |
2042 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2042 ns |
2083 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
2000 ns |
2000 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
26508.5 ns |
25999.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1230942 ns |
1183440.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
461166.5 ns |
314229 ns |
1.47 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
205793 ns |
206522 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16125 ns |
16583.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16625 ns |
15958 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16875 ns |
16854 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16000 ns |
16791.5 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
274700 ns |
272947 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
25617930.5 ns |
25132475.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6184792 ns |
6200500 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
700900 ns |
699897 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
195959 ns |
158000 ns |
1.24 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
148437.5 ns |
152895.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
156042 ns |
179875 ns |
0.87 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
146563 ns |
175625 ns |
0.83 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
201860 ns |
205507.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7750006.5 ns |
8109426 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1388208 ns |
1459854.5 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
223213 ns |
213437 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1322291.5 ns |
1279667 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1326083 ns |
1336958 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1330937.5 ns |
1276333 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1337083 ns |
1332729.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
903712 ns |
907688 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
45331996 ns |
46524861.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6766854.5 ns |
6921834 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1095360 ns |
1109576 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25750 ns |
25937.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25875 ns |
25750 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27709 ns |
27437.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24125 ns |
24042 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
234038 ns |
236630 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7956003 ns |
7924614 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
976958 ns |
1195645.5 ns |
0.82 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
113361.5 ns |
112891.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
117812.5 ns |
117812.5 ns |
1 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
176771 ns |
125958 ns |
1.40 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
132292 ns |
130667 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
124833 ns |
132625 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1069728 ns |
1078111.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
46049440 ns |
48454865.5 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6183167 ns |
6291354 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
611759 ns |
604836 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
250 ns |
375 ns |
0.67 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
334 ns |
0.87 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
22728 ns |
22703 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1231861 ns |
1228350.5 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
452750 ns |
303875 ns |
1.49 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
46601 ns |
47155.5 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
6333 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6458 ns |
6937.5 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6792 ns |
6750 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6395.5 ns |
6687.5 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
202592 ns |
201918.5 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25214681 ns |
24022047 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
6171875 ns |
6154291 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
391445 ns |
390799 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6167 ns |
5584 ns |
1.10 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5791.5 ns |
6729 ns |
0.86 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7583 ns |
7834 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5916 ns |
6333 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
143756.5 ns |
144556.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5728741 ns |
5802837 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
472791 ns |
465083.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
232863 ns |
231623 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10166.5 ns |
9875 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10083 ns |
10500 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10083 ns |
10250 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9958 ns |
10084 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
893713 ns |
898422 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
42287345 ns |
41540865 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6328334 ns |
5925625 ns |
1.07 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
669549 ns |
667721.5 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
666 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22019 ns |
22281 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1940958 ns |
2048848.5 ns |
0.95 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
222667 ns |
228500 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
206602.5 ns |
205022 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4583 ns |
4625 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4625 ns |
4625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4916 ns |
4791 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4625 ns |
4584 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
224705 ns |
224113.5 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10067778 ns |
11648202 ns |
0.86 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1710291 ns |
1667208 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
577498 ns |
578966 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8167 ns |
8604.5 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8041.5 ns |
9500 ns |
0.85 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10500 ns |
10125 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7833 ns |
8125 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
121515 ns |
121216 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3793308.5 ns |
3493631.5 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
796583 ns |
797562.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
73461 ns |
73391 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8292 ns |
8166.5 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8291.5 ns |
9020.5 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8541 ns |
9292 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8625 ns |
8834 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
586857 ns |
585686 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
20229154 ns |
21659888 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5084312.5 ns |
5138604.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
342285 ns |
345673 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
126646 ns |
128166 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
129146 ns |
95895.5 ns |
1.35 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
129834 ns |
130416 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
180625 ns |
193500 ns |
0.93 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
45632 ns |
45829 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
100411 ns |
100941 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
316563 ns |
335583 ns |
0.94 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
326709 ns |
167167 ns |
1.95 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
329187.5 ns |
354375 ns |
0.93 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
584417 ns |
609249.5 ns |
0.96 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
190484 ns |
190876 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
499046 ns |
517555 ns |
0.96 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
398250 ns |
397541 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288042 ns |
215333 ns |
1.34 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288458 ns |
288458 ns |
1 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756875 ns |
756458 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43262 ns |
43687 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1353668 ns |
1356444.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
416333.5 ns |
420167 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
80431 ns |
80321 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1450583 ns |
1457000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1135583 ns |
862125 ns |
1.32 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1135479.5 ns |
1134520.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2361583 ns |
2444500 ns |
0.97 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
246229 ns |
251807.5 ns |
0.98 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
11102894 ns |
10565821 ns |
1.05 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1848812.5 ns |
1852750 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
352430 ns |
350374 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
627958 ns |
683334 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
648250 ns |
650583 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
652979 ns |
641791.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
649354 ns |
653250 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
201989 ns |
202465 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8297683 ns |
8364163.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1350375 ns |
1384458 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
302004 ns |
302773 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2448208 ns |
2447209 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2442541 ns |
2468625 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2446583 ns |
2446166.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2461375 ns |
2452188 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
986950 ns |
992979 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
51627338 ns |
51629265.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
12287584 ns |
9882875 ns |
1.24 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1385198 ns |
1311863 ns |
1.06 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
34583 ns |
34667 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
34604.5 ns |
34291.5 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
34187 ns |
35521 ns |
0.96 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
979 ns |
875 ns |
1.12 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15115 ns |
15660 ns |
0.97 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
78861 ns |
78941 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3250 ns |
3125 ns |
1.04 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3083.5 ns |
3458.5 ns |
0.89 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3292 ns |
3312.5 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3000 ns |
3084 ns |
0.97 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
136884.5 ns |
137070.5 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
333775 ns |
338254 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
406333.5 ns |
406166 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
407959 ns |
404458 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
408250 ns |
408458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
421625 ns |
420458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
42816 ns |
42995 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1337237.5 ns |
1466063 ns |
0.91 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1160458.5 ns |
1144125 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
237128 ns |
238192 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3867166 ns |
3877875 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3999583 ns |
3990896 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3994541 ns |
3992562.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3753104 ns |
3778146 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
240654 ns |
240990 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36363725.5 ns |
36589646 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11696917 ns |
11933709 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1245696 ns |
1433854 ns |
0.87 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3916 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3875 ns |
3958 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3916 ns |
3917 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33162 ns |
33931 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1219454 ns |
1232713.5 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
179458 ns |
183709 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
37931 ns |
38031 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15625 ns |
15708 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15708 ns |
15750 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
16000 ns |
15958 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15500 ns |
15750 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
251974 ns |
252887 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
10610220 ns |
9179273 ns |
1.16 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
863958.5 ns |
893625 ns |
0.97 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
162802 ns |
172862 ns |
0.94 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404583 ns |
404417 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
295833 ns |
221125 ns |
1.34 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
295625 ns |
296500 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760708 ns |
761125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113013 ns |
112867 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1012749 ns |
1050270.5 ns |
0.96 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
433875 ns |
406792 ns |
1.07 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
87531 ns |
87471 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1468208 ns |
1471292 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1161479 ns |
884000 ns |
1.31 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1161334 ns |
1160146 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2383584 ns |
2466083.5 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
236752 ns |
238614 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10443178 ns |
9255273 ns |
1.13 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1920167 ns |
1932833 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
354105 ns |
350549 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
459 ns |
583 ns |
0.79 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
584 ns |
583 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
25752 ns |
25487 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1271548 ns |
1217335.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
477750.5 ns |
387333 ns |
1.23 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
205523 ns |
206202 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7292 ns |
7375 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7542 ns |
8020.5 ns |
0.94 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7687.5 ns |
7916 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7583 ns |
7542 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
216586.5 ns |
209854.5 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
26033942 ns |
25469136 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6012333 ns |
6294375 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
686818.5 ns |
684857 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
840000 ns |
833124.5 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
615666 ns |
467292 ns |
1.32 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
621333 ns |
621750 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1558000 ns |
1543666 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
133759 ns |
130036 ns |
1.03 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
234603 ns |
230222 ns |
1.02 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2678083 ns |
2684437.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
2002000 ns |
1538583 ns |
1.30 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
2002187.5 ns |
2002583 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4935667 ns |
4933354 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
239631.5 ns |
243369 ns |
0.98 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
807670 ns |
836303.5 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
250 ns |
1.17 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
250 ns |
375 ns |
0.67 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
334 ns |
0.87 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31737 ns |
31581 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1190684.5 ns |
1181114.5 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
318125 ns |
425666.5 ns |
0.75 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
45701 ns |
49050 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6125 ns |
6291 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6270.5 ns |
6708.5 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6584 ns |
6667 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6083 ns |
6375 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
223270 ns |
222549 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20675696 ns |
20723673 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5388000 ns |
5408500 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
363375 ns |
364253.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2402167 ns |
2412916 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2417416 ns |
2399708 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2383750 ns |
2391250 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2375125 ns |
2406375 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
199863.5 ns |
201130.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8222543 ns |
8039466.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1463104 ns |
1500813 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
372574 ns |
371169 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4640208 ns |
4645417 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4646500 ns |
4666145.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4660209 ns |
4648375 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4644979.5 ns |
4646334 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
893947 ns |
899895.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
51330419 ns |
47712828 ns |
1.08 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6764792 ns |
6893375 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1345616.5 ns |
1384804 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7542 ns |
7083 ns |
1.06 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7083 ns |
7000 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7375 ns |
7750 ns |
0.95 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6833 ns |
6792 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
23243 ns |
23107 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1128838 ns |
1160499 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
261584 ns |
282458 ns |
0.93 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
36650 ns |
40431 ns |
0.91 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
47854 ns |
48667 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
51771 ns |
57125 ns |
0.91 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
48833 ns |
51042 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
32833 ns |
33354.5 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
215568 ns |
215404 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10412929 ns |
10709204 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
1911875 ns |
2066833 ns |
0.93 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
230092 ns |
264313 ns |
0.87 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
22729.5 ns |
22854 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
24646 ns |
24375.5 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
23458 ns |
24917 ns |
0.94 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5250 ns |
5209 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
16364.5 ns |
16790 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
83781 ns |
89191 ns |
0.94 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
12166 ns |
12250 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10375 ns |
9375 ns |
1.11 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
10917 ns |
10604.5 ns |
1.03 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
17750 ns |
18083 ns |
0.98 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
226319.5 ns |
225960 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
369134 ns |
387419 ns |
0.95 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406042 ns |
406584 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
297417 ns |
223292 ns |
1.33 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296958 ns |
297000 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762792 ns |
762667 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46179 ns |
45879 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1332014 ns |
1417981 ns |
0.94 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
486500 ns |
424354.5 ns |
1.15 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89011 ns |
89741 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1485666.5 ns |
1486000.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1167500 ns |
892208.5 ns |
1.31 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1166000 ns |
1169500 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2386416 ns |
2471625 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
280427 ns |
279157 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
13856466.5 ns |
13109750 ns |
1.06 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2074958 ns |
2047333 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
379774 ns |
376633 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
433667 ns |
433500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
436959 ns |
430292 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
436959 ns |
436292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
448375 ns |
446958 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
53930.5 ns |
54004 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1009165 ns |
1003277 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1060375 ns |
1090562.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
233243 ns |
236733 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3904625 ns |
3866292 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4021125 ns |
4019812.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4027208 ns |
4022583.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3804375 ns |
3812208.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
260750 ns |
261348.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31826505 ns |
32496173.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10958041.5 ns |
10504750 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1342696 ns |
1365148 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8750 ns |
8708 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
7625 ns |
6958 ns |
1.10 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
7666 ns |
7667 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12417 ns |
12417 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
23537 ns |
23411 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2186463 ns |
2120051 ns |
1.03 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
224396 ns |
229334 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
211122 ns |
208012 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
44916 ns |
45583 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
44916 ns |
45291 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
45375 ns |
45416 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
44709 ns |
45042 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
343620.5 ns |
345424.5 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
13641380 ns |
13588599 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1814145.5 ns |
1751750 ns |
1.04 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
656018 ns |
653876 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
86792 ns |
113812.5 ns |
0.76 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
125000 ns |
90020.5 ns |
1.39 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
88208 ns |
88625 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
86000 ns |
81000 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190101 ns |
190227.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6192741 ns |
6167893 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1953792 ns |
2705500 ns |
0.72 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
217797.5 ns |
221462 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2013000 ns |
1871229 ns |
1.08 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2017875 ns |
2028479 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2014000 ns |
2015645.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2026416 ns |
2020395.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
532007 ns |
534895 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27385258 ns |
28188330 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9714708 ns |
9724208 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1093783 ns |
1078565.5 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
commented
Sep 15, 2024
avik-pal
commented
Sep 15, 2024
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #148. I am still seeing some failures on the end-to-end Lux case, but let's get part of the solution in for now.