This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
fix: enzyme reverse bias needs a check on Const
#160
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/fix_check
branch
from
September 16, 2024 15:34
083c1f8
to
ac37989
Compare
avik-pal
force-pushed
the
ap/fix_check
branch
from
September 16, 2024 15:53
ac37989
to
1538324
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: 1538324 | Previous: 7ba127a | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7208 ns |
4667 ns |
1.54 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
6666.5 ns |
0.89 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7812.5 ns |
7500 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5500 ns |
5750 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
116591 ns |
117321 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
835125 ns |
3008750 ns |
0.28 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
407004 ns |
404195 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9749.5 ns |
9896 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10208.5 ns |
9833 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9750 ns |
9979 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9646 ns |
9958.5 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
532910 ns |
533872 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5057208 ns |
2324292 ns |
2.18 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
684507 ns |
674968 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
2042 ns |
1437.5 ns |
1.42 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1375 ns |
2875 ns |
0.48 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
2021 ns |
2083 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1875 ns |
1437.5 ns |
1.30 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
21601 ns |
21479 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
208458 ns |
190209 ns |
1.10 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
29330 ns |
29540 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4084 ns |
4250 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3812.5 ns |
4167 ns |
0.91 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4541.5 ns |
4145.5 ns |
1.10 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4208 ns |
4375 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
142653 ns |
144438.5 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1611188 ns |
1604875 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
147161 ns |
145092 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58083 ns |
55875 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46291 ns |
39209 ns |
1.18 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47125 ns |
46625 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82708 ns |
84167 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36579 ns |
36824 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1064500 ns |
1333104 ns |
0.80 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
83101 ns |
81391 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2013666 ns |
2024917 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2083625 ns |
2079125 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2087792 ns |
2081625 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1990541 ns |
1993125 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
225517.5 ns |
226688 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
4569584 ns |
7427958 ns |
0.62 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
975579 ns |
1252074 ns |
0.78 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
177958 ns |
174750 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
146625 ns |
164541.5 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
176041 ns |
148812.5 ns |
1.18 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
153250 ns |
144375 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165529 ns |
165480 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1562708.5 ns |
1457521 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
208972 ns |
204852 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1129895.5 ns |
1117250 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1113270.5 ns |
1109375.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1116688 ns |
1113334 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1108000 ns |
1112187.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
691122 ns |
694582 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6441708.5 ns |
6238375 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1025005 ns |
1026961 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5083 ns |
4417 ns |
1.15 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4458 ns |
5041 ns |
0.88 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6708 ns |
5208 ns |
1.29 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5125 ns |
4583 ns |
1.12 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
91713 ns |
93299.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
467917 ns |
634041.5 ns |
0.74 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
67821 ns |
69460 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8708 ns |
8375 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8667 ns |
8542 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8917 ns |
8833 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8292 ns |
8833 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
595976 ns |
604485 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
6425709 ns |
5669937.5 ns |
1.13 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
386609 ns |
388374 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17854 ns |
17000 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18125 ns |
17709 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20833 ns |
18021 ns |
1.16 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16791.5 ns |
16895.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
65447 ns |
66654.5 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1286000 ns |
477833 ns |
2.69 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
77060 ns |
78451 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213000 ns |
216834 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218125 ns |
219896 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
225458 ns |
225583.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215709 ns |
217625 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
347209 ns |
356473 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5699333 ns |
5644395.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
463334 ns |
465005 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
958 ns |
667 ns |
1.44 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
792 ns |
750 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
834 ns |
812.5 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
625 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20255 ns |
20462 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
303708 ns |
302625 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
31221 ns |
32870 ns |
0.95 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1417 ns |
1417 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1583 ns |
1458 ns |
1.09 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1583 ns |
1417 ns |
1.12 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1334 ns |
1416 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
122871 ns |
125127 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1648916 ns |
1526500 ns |
1.08 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
136871 ns |
136521 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7417 ns |
7208 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6083 ns |
5416 ns |
1.12 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6167 ns |
6125 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10041 ns |
10666 ns |
0.94 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23633 ns |
23625 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
656708 ns |
356458 ns |
1.84 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47130 ns |
48881 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
229750 ns |
226166 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
269375 ns |
265333 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
270667 ns |
234854 ns |
1.15 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219458 ns |
219500 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
187028 ns |
192027 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8744750 ns |
9046313 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
648261 ns |
649247 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4084 ns |
4125 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4125 ns |
4083 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4125 ns |
4084 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4083 ns |
4083 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
22803 ns |
23477 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
224083 ns |
214833 ns |
1.04 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
46881 ns |
47261 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16584 ns |
17083 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16666 ns |
17000 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17375 ns |
16833 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16916 ns |
17334 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
191515.5 ns |
195303 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
962437.5 ns |
918208 ns |
1.05 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
172696.5 ns |
174652 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
512042 ns |
508750 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
405354.5 ns |
330583 ns |
1.23 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
405583 ns |
404666 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
865958 ns |
864791 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113233 ns |
113620 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
462708 ns |
490979 ns |
0.94 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
242192 ns |
242133 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2271854 ns |
2313834 ns |
0.98 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2032625 ns |
1747479 ns |
1.16 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2029500 ns |
2035208 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3281500 ns |
3272708.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
239296 ns |
241207 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
2003084 ns |
2011770.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
742307 ns |
743443 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6875 ns |
4708.5 ns |
1.46 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6625 ns |
7625 ns |
0.87 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8417 ns |
7708 ns |
1.09 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6583 ns |
5479.5 ns |
1.20 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
90887.5 ns |
92351.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
884937.5 ns |
783479 ns |
1.13 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
66440 ns |
65411 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10396 ns |
10333.5 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11604 ns |
11875 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11958.5 ns |
11750 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11729 ns |
12062.5 ns |
0.97 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
634525 ns |
634956 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
6103459 ns |
5457291.5 ns |
1.12 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
411954 ns |
409979.5 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
541 ns |
541 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
583 ns |
0.86 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
22874 ns |
23181 ns |
0.99 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
325500 ns |
332584 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
48711 ns |
47221 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2166 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2125 ns |
2167 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2084 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2084 ns |
2084 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
231590.5 ns |
215755 ns |
1.07 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
2053250 ns |
1978417 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
172611.5 ns |
172626.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8875 ns |
8937.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9708 ns |
9729.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10417 ns |
9459 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8791 ns |
8958 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
100356.5 ns |
96639 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
929291 ns |
876000 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
71930.5 ns |
71941 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17645.5 ns |
18521 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17749.5 ns |
19104.5 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
19896.5 ns |
17625 ns |
1.13 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17458 ns |
18812.5 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
603312.5 ns |
554001 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5371083.5 ns |
5180916.5 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
376938.5 ns |
378539 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
458 ns |
1.18 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
459 ns |
625 ns |
0.73 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
666 ns |
0.88 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
35224.5 ns |
35213 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
467666 ns |
466396 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
45901 ns |
46270 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9854.5 ns |
9312.5 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8875 ns |
9916.5 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9792 ns |
9167 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9104.5 ns |
9458.5 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
259410 ns |
267136 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5178229 ns |
4572250 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
364483 ns |
367694 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
398833 ns |
395333 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287791 ns |
214416 ns |
1.34 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288166 ns |
288292 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
755917 ns |
756291 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111747.5 ns |
111882 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
365292 ns |
300208.5 ns |
1.22 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
75320 ns |
77331 ns |
0.97 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1398542 ns |
1453791.5 ns |
0.96 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1133583 ns |
852583 ns |
1.33 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1133583 ns |
1132645.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2439666 ns |
2440625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
204923 ns |
207032 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1643875 ns |
1668041.5 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
323223 ns |
324428.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6896 ns |
7041.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7750 ns |
7750 ns |
1 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8854 ns |
9396 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7604 ns |
7791.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
142390 ns |
144806.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
502541.5 ns |
437250 ns |
1.15 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
65701 ns |
66071 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14979.5 ns |
13083 ns |
1.14 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16042 ns |
14479 ns |
1.11 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14791.5 ns |
15709 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13812.5 ns |
15354.5 ns |
0.90 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
954176.5 ns |
956377 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6228229 ns |
5700250 ns |
1.09 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
429743 ns |
428955 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24521 ns |
24000 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
30584 ns |
24875 ns |
1.23 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
30083 ns |
29292 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
26042 ns |
27667 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
196332 ns |
199144 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
597417 ns |
999584 ns |
0.60 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
114751 ns |
116931 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
155541 ns |
103583 ns |
1.50 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
149792 ns |
152687 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
118812.5 ns |
153583 ns |
0.77 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
153083 ns |
151000 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1064087 ns |
1075746 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5878770.5 ns |
5733792 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
588085.5 ns |
590946.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
75500 ns |
75000 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76750 ns |
77084 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
81834 ns |
86333.5 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
75417 ns |
74875 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
203196 ns |
205585 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
544541.5 ns |
519187.5 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
127411.5 ns |
127562 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
319167 ns |
293542 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
292208 ns |
308750 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
319437.5 ns |
315187.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
254396 ns |
304208 ns |
0.84 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1106728 ns |
1108118 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6853687.5 ns |
6276458 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
692376 ns |
695017 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16167 ns |
15875 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
17042 ns |
17521 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
19208 ns |
18500 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
16625 ns |
16958 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
144171.5 ns |
146489 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
453916 ns |
723083.5 ns |
0.63 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
232502 ns |
232683 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27645.5 ns |
26667 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27875 ns |
26687.5 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27104.5 ns |
28208.5 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27209 ns |
27708.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
966948 ns |
982068.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6307750 ns |
5743229 ns |
1.10 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
687236 ns |
686807.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11542 ns |
11083 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11000 ns |
12042 ns |
0.91 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11959 ns |
12334 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
10708.5 ns |
10791 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
123398.5 ns |
124134 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
941104 ns |
880000 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
235762 ns |
234213 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
22125 ns |
21958 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
22270.5 ns |
22729.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
23229 ns |
21895.5 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21500 ns |
22000 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
695351 ns |
701831.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5572437 ns |
5204750 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
677346 ns |
674667 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
62708 ns |
63437.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
63666 ns |
65521 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
65167 ns |
66750 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
63041.5 ns |
63042 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
104762.5 ns |
106345.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1332187.5 ns |
480667 ns |
2.77 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
233822 ns |
233433 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
445500 ns |
437896 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
444000 ns |
456000 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
450354 ns |
450542 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
448604 ns |
444000 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
508730 ns |
515188 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6142645.5 ns |
6095791.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
714332 ns |
717017.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7000 ns |
6792 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7521 ns |
8000 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8313 ns |
8583.5 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7937.5 ns |
6917 ns |
1.15 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
143368 ns |
146052.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
776916 ns |
726500 ns |
1.07 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
69001 ns |
65301 ns |
1.06 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14354.5 ns |
14292 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14625 ns |
15292 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15500.5 ns |
14084 ns |
1.10 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15333 ns |
16209 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
936852 ns |
947670 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5938604.5 ns |
5499875 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
401428 ns |
399764 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6155875 ns |
6131500 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6375916.5 ns |
3224875 ns |
1.98 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6374000 ns |
6379229.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11901292 ns |
11911084 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
350778 ns |
349856 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
323393 ns |
303248 ns |
1.07 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19147458.5 ns |
19059708.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19965187.5 ns |
11090437.5 ns |
1.80 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
19960104 ns |
20005646 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36468541.5 ns |
36446770.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1066981 ns |
1081781.5 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1167291 ns |
1153782 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
958 ns |
958 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1000 ns |
1000 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1000 ns |
958 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
958 ns |
917 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
22956 ns |
23071 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
326979.5 ns |
332541.5 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
207932 ns |
207622 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3625 ns |
3667 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3709 ns |
3750 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3791 ns |
3708 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3666 ns |
3667 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
278049 ns |
281551.5 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2145896 ns |
2129583 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
628290.5 ns |
626307 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7583 ns |
8042 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8437.5 ns |
8145.5 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10687 ns |
9042 ns |
1.18 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7604.5 ns |
7937.5 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
119392 ns |
121104 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
855709 ns |
802541.5 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
65571 ns |
65471 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
12062 ns |
13125 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11875 ns |
12875 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
13583 ns |
11417 ns |
1.19 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11375 ns |
12708 ns |
0.90 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
628558 ns |
638151 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5354374.5 ns |
4390333 ns |
1.22 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
352454 ns |
355644 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22280 ns |
22337 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
325417 ns |
207833 ns |
1.57 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
47591 ns |
47401 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2917 ns |
3042 ns |
0.96 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2833 ns |
3375 ns |
0.84 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3375 ns |
2916 ns |
1.16 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2875 ns |
3333 ns |
0.86 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
200188 ns |
204047 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1755125 ns |
1611395.5 ns |
1.09 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
166112 ns |
157641.5 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
12062.5 ns |
10250 ns |
1.18 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11709 ns |
12167 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
14167 ns |
12187.5 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
12208 ns |
10604 ns |
1.15 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
120653.5 ns |
121713.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
957583 ns |
904791.5 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
233782 ns |
233512.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
22521 ns |
21104.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21229 ns |
22583 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21709 ns |
21083 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21271 ns |
21708 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
588112 ns |
595173 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4873042 ns |
4095583 ns |
1.19 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
645610.5 ns |
638246.5 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4417 ns |
4417 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4417 ns |
4375 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4416 ns |
4375 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4416 ns |
4417 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
23872 ns |
24193.5 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
227375 ns |
215041 ns |
1.06 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
47721 ns |
47690 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16458 ns |
16292 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16459 ns |
16291 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16750 ns |
16667 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16375 ns |
16416 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
328973 ns |
330020.5 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1096437.5 ns |
1639709 ns |
0.67 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
205102 ns |
206457.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
1917 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2167 ns |
2167 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2125 ns |
2084 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2083 ns |
2084 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
35620 ns |
35891 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
479292 ns |
474917 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
203502 ns |
204052 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
16666.5 ns |
19687.5 ns |
0.85 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
16625 ns |
17187.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
17917 ns |
17750 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
16645.5 ns |
16667 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
291232 ns |
293976.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5011416 ns |
4767354.5 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
685261 ns |
686777 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
60083 ns |
55771 ns |
1.08 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
66500 ns |
62792 ns |
1.06 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
65291.5 ns |
65604.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
53250 ns |
51333 ns |
1.04 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66478.5 ns |
66418 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
114861 ns |
114241 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
189708.5 ns |
202896 ns |
0.94 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
131041.5 ns |
135104 ns |
0.97 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
158375 ns |
130083 ns |
1.22 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
299229 ns |
245666 ns |
1.22 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
213976.5 ns |
215296 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
614765 ns |
607861 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
84000 ns |
79709 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
83417 ns |
107104 ns |
0.78 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
117459 ns |
85167 ns |
1.38 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
90334 ns |
124166.5 ns |
0.73 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192051 ns |
192861 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1810583.5 ns |
1816084 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
203912 ns |
203512 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1915666 ns |
1869895.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1913209 ns |
1901084 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1915500 ns |
1917666.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1911313 ns |
1889333 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
534916.5 ns |
531825 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8918708 ns |
8859584 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
927078 ns |
925670 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21774 ns |
21389 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
368584 ns |
336229.5 ns |
1.10 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
41120 ns |
42770.5 ns |
0.96 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1834 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1875 ns |
1834 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
255691 ns |
253832 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1105833 ns |
1009479 ns |
1.10 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
182671.5 ns |
184376.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8792 ns |
8000 ns |
1.10 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9750 ns |
10042 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10500 ns |
10375 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7833 ns |
8167 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
120137.5 ns |
119090.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
886041 ns |
876708 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
235302 ns |
232622 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9542 ns |
9083 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9542 ns |
10625 ns |
0.90 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9667 ns |
9542 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9792 ns |
10125 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
530135 ns |
527209 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4591292 ns |
3949187.5 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
631445 ns |
624237 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58666 ns |
56166 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46458 ns |
38916 ns |
1.19 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47084 ns |
46125 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82542 ns |
83958 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
40746 ns |
40233 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1137875 ns |
1123667 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
74291 ns |
76266 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1797833 ns |
1923750 ns |
0.93 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1970562.5 ns |
1952750.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1981541 ns |
1982854 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1786375 ns |
1850708.5 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
222550 ns |
221906.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11379375.5 ns |
11408021 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1174290 ns |
1191052 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
419270.5 ns |
416333 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
424500 ns |
421645.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
421958 ns |
421208.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
416084 ns |
417667 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
209953 ns |
208798 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
539875 ns |
518208 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
282843 ns |
282883 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
669937.5 ns |
747916.5 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
718333 ns |
671583 ns |
1.07 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
773833 ns |
673562.5 ns |
1.15 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
683417 ns |
748021 ns |
0.91 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1056591 ns |
1048327.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6669292 ns |
6335208.5 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
910917 ns |
914290 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3426458 ns |
3428937.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3445083 ns |
3384709 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3449250 ns |
3435000 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3430354 ns |
3417875 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
176943 ns |
175238.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1423917 ns |
1424083 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
448819 ns |
426124 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6195708.5 ns |
6191270.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6031458 ns |
6170041 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6208500 ns |
6167416.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6217667 ns |
6190792 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1004532.5 ns |
994959 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7516103.5 ns |
7413750 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1546963 ns |
1549811 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
470709 ns |
470666 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
341708 ns |
252458 ns |
1.35 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
342084 ns |
342417 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
904895.5 ns |
901125 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47024 ns |
46139 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
536459 ns |
368208 ns |
1.46 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
244742 ns |
243602 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2262542 ns |
2334750 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2032625.5 ns |
1752562 ns |
1.16 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2039042 ns |
2041187.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3286625 ns |
3280124.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
272283 ns |
255952 ns |
1.06 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2240916 ns |
2244770.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
767047 ns |
770018 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58209 ns |
55708 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
45792 ns |
39041 ns |
1.17 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46834 ns |
46020.5 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82833 ns |
84125 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28553 ns |
28321 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1155250 ns |
1106875 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
76011 ns |
76505.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1966583 ns |
2029708 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2097041.5 ns |
2082292 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2105333 ns |
2090958 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1928854 ns |
1949604 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
234931 ns |
232547 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11838750 ns |
11649979 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1200890 ns |
1052311 ns |
1.14 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58000 ns |
55833 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46666 ns |
39083.5 ns |
1.19 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46333 ns |
46375 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82333 ns |
84042 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
50602 ns |
49287 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1136750 ns |
1049084 ns |
1.08 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
75801 ns |
69820 ns |
1.09 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1932208 ns |
1919458 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1947417 ns |
1955416.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1987250 ns |
1946334 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1888333 ns |
1890750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
241387 ns |
239685 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10094000 ns |
9788042 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1035989 ns |
918859 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
417 ns |
0.80 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
35237 ns |
34717 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
443125.5 ns |
263500 ns |
1.68 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48850 ns |
46211 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7125 ns |
6333 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6625 ns |
7500 ns |
0.88 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7291 ns |
6583 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6500 ns |
7000 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
210223.5 ns |
208392.5 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5397396 ns |
4479667 ns |
1.20 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
368313 ns |
365124 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
33000.5 ns |
32562 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
258666 ns |
258000 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
39550 ns |
37000 ns |
1.07 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2834 ns |
2750 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
3417 ns |
3625 ns |
0.94 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3625 ns |
2709 ns |
1.34 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
3000 ns |
2917 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
191838.5 ns |
189309.5 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
1338041 ns |
905666.5 ns |
1.48 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
161651 ns |
151136.5 ns |
1.07 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
422791.5 ns |
467667 ns |
0.90 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
422146 ns |
444750 ns |
0.95 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
435479.5 ns |
425999.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
420583.5 ns |
421833.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
138548 ns |
137895 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2802625 ns |
2386500 ns |
1.17 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
369014 ns |
367024 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3806000 ns |
3802521 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3822896 ns |
3765917 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3824375 ns |
3811417 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3805604 ns |
3799541.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
714199.5 ns |
709425 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11413916 ns |
10457896 ns |
1.09 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1476123 ns |
1471404 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49897812.5 ns |
49735229.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35524000 ns |
25984959 ns |
1.37 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35517334 ns |
35560875 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
96963667 ns |
96902041.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1624121 ns |
1616773 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1037819 ns |
1045271 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154672625 ns |
153907333 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112393437.5 ns |
89247291.5 ns |
1.26 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112506958 ns |
112379750 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
296345729 ns |
294166500 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6489470 ns |
6515848 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5543337 ns |
5562255.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
19041.5 ns |
14521 ns |
1.31 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
18917 ns |
14958 ns |
1.26 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
17167 ns |
16833 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15667 ns |
14854.5 ns |
1.05 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
20662 ns |
20539.5 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
323166 ns |
206959 ns |
1.56 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
26030 ns |
26060 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
11083 ns |
10625 ns |
1.04 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
9062.5 ns |
7771 ns |
1.17 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
9209 ns |
9208 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17417 ns |
17437.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
262818 ns |
260548 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1604333.5 ns |
1587125 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
148741 ns |
149326.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8437.5 ns |
7958 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8979 ns |
9292 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9895.5 ns |
9500 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7938 ns |
7958.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
127524.5 ns |
116273.5 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
820209 ns |
810375 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
233142 ns |
233683 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9583 ns |
9208.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10458 ns |
10645.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10041.5 ns |
10208 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9687.5 ns |
10375 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
625567 ns |
619508.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5349854 ns |
4432792 ns |
1.21 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
648161 ns |
654786 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10937.5 ns |
8291.5 ns |
1.32 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10250 ns |
10459 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11666.5 ns |
10042 ns |
1.16 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9146 ns |
9250 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
122855.5 ns |
120531 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
976417 ns |
901792 ns |
1.08 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
70821 ns |
71071 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
14667 ns |
13250 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
15333 ns |
16042 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
14208 ns |
17208 ns |
0.83 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13417 ns |
15167 ns |
0.88 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
596853 ns |
592138 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4884688 ns |
4027062.5 ns |
1.21 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
343503 ns |
345753 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
541 ns |
459 ns |
1.18 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
459 ns |
541 ns |
0.85 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
35561 ns |
34521 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
474104 ns |
371562.5 ns |
1.28 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
204042 ns |
206352 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8354.5 ns |
7062.5 ns |
1.18 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8375 ns |
8333.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7750 ns |
8583 ns |
0.90 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6875 ns |
8000 ns |
0.86 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
235390.5 ns |
233771 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5839792 ns |
4885833 ns |
1.20 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
656975.5 ns |
662116 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
16584 ns |
12292 ns |
1.35 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
17042 ns |
13229 ns |
1.29 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
14291 ns |
15125 ns |
0.94 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10042 ns |
10167 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
22367 ns |
22042 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
222667 ns |
189125 ns |
1.18 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
185752 ns |
189132 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
32458 ns |
31875 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
32145.5 ns |
32333.5 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32416.5 ns |
32291.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32125 ns |
32000 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
277897 ns |
276327 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1844750 ns |
1697542 ns |
1.09 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
593515 ns |
595015.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
442500 ns |
480875 ns |
0.92 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
443792 ns |
441083 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
444209 ns |
450250 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
450666 ns |
490979 ns |
0.92 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194406.5 ns |
194024 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2002791 ns |
2629708 ns |
0.76 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
367493 ns |
368063.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3843396 ns |
3822958 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3822375 ns |
3807354 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3837021 ns |
3827834 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3828583 ns |
3826167 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
546587.5 ns |
544349 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9324958 ns |
9196542 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1364051 ns |
1359983 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
778516791 ns |
838219667 ns |
0.93 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
542702208 ns |
415052604.5 ns |
1.31 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
545647834 ns |
543102500 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1521122979.5 ns |
1525021500 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22763268.5 ns |
22764607.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14742526.5 ns |
14772276 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
2527539958 ns |
3570164958 ns |
0.71 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1822368542 ns |
1502049709 ns |
1.21 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
2266716458 ns |
2269221042 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
4816725334 ns |
4773617583 ns |
1.01 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
332342534.5 ns |
369302709 ns |
0.90 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
88513087.5 ns |
87924411 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
78250 ns |
79646 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
77750 ns |
78895.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79500 ns |
78667 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76708 ns |
77583 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
208775.5 ns |
207237 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
543354 ns |
520375 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
107265.5 ns |
107601 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
191958 ns |
250834 ns |
0.77 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
193916 ns |
294583.5 ns |
0.66 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
278354 ns |
285708.5 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
191917 ns |
222333.5 ns |
0.86 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1056329 ns |
1049109.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6246083 ns |
6122958 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
634790 ns |
640576 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199782750 ns |
199656458.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
139205750 ns |
103769666.5 ns |
1.34 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
139486250 ns |
139342042 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
388433333 ns |
388182208 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5833082.5 ns |
5838796 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3539050 ns |
3577840.5 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
620133542 ns |
616451521 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
441272958 ns |
351188291.5 ns |
1.26 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
439771292 ns |
439680896 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1178807834 ns |
1178137125 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26447699.5 ns |
26651952 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
22039567 ns |
22092888 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7333 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6167 ns |
5292 ns |
1.17 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
6084 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9792 ns |
10167 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28196 ns |
27714.5 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
596833 ns |
351458 ns |
1.70 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48371 ns |
48481 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212937 ns |
218291.5 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221500 ns |
222250 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229333 ns |
221209 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205916 ns |
213708.5 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
224394 ns |
222292 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9258896 ns |
9125125 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
520850 ns |
529665 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9395.5 ns |
7271 ns |
1.29 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9437.5 ns |
9541.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10354.5 ns |
9791 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9291.5 ns |
8187.5 ns |
1.13 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
118055 ns |
117715.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
906375 ns |
885458 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
68451 ns |
69700 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7625 ns |
7479 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9958 ns |
10479.5 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8000 ns |
10875 ns |
0.74 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7458.5 ns |
8875 ns |
0.84 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
524850 ns |
519786.5 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
3849333 ns |
3961208 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
312977.5 ns |
316073 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
541 ns |
416 ns |
1.30 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
750 ns |
0.78 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
459 ns |
1.18 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
416 ns |
500 ns |
0.83 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
26875 ns |
26338 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
469042 ns |
488604.5 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
46651 ns |
46820 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9333 ns |
9291 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10812.5 ns |
10416 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9667 ns |
9208.5 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8709 ns |
11583 ns |
0.75 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
256059 ns |
253612 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
6077334 ns |
5171833.5 ns |
1.18 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
383208.5 ns |
388624 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
107542 ns |
104834 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
100125 ns |
84834 ns |
1.18 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
101479.5 ns |
99500 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146625 ns |
146333 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
25218 ns |
24613 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
273854 ns |
246062.5 ns |
1.11 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
189951 ns |
192062 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
495334 ns |
526854 ns |
0.94 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
478291.5 ns |
478875 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
516708 ns |
500416.5 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
478125 ns |
478958.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
234048 ns |
232619 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2325583 ns |
1709625 ns |
1.36 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
610996 ns |
610896 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5791 ns |
5125 ns |
1.13 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
6500 ns |
7167 ns |
0.91 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
6916 ns |
6791 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
6375 ns |
4042 ns |
1.58 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
16672 ns |
16580 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
84840 ns |
79701 ns |
1.06 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
11709 ns |
11708 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
11917 ns |
11584 ns |
1.03 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11041 ns |
10792 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
16625 ns |
17687.5 ns |
0.94 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
215620.5 ns |
214143.5 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
379153 ns |
366964 ns |
1.03 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39709 ns |
35792 ns |
1.11 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
51834 ns |
50791 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52458 ns |
51833.5 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
14687.5 ns |
13542 ns |
1.08 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
20332 ns |
21568 ns |
0.94 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
92151 ns |
87241 ns |
1.06 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
36208 ns |
38979.5 ns |
0.93 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
31917 ns |
30708 ns |
1.04 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
37000.5 ns |
30416 ns |
1.22 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
57166 ns |
58458 ns |
0.98 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
194385.5 ns |
192010 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
394518.5 ns |
395119 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1791.5 ns |
1729.5 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1917 ns |
1875 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2250 ns |
2146 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1750 ns |
1709 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
20812 ns |
20594 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
311104 ns |
326833 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
32150 ns |
33120 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2167 ns |
2125 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2208 ns |
2333 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2292 ns |
2250 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2042 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
205188 ns |
204587 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1645562.5 ns |
1518500 ns |
1.08 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
135871 ns |
136826.5 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5250 ns |
4417 ns |
1.19 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5270.5 ns |
5250 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7334 ns |
6375.5 ns |
1.15 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5812.5 ns |
4041.5 ns |
1.44 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
146117 ns |
145077 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
458167 ns |
725208 ns |
0.63 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
68481 ns |
69471 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8125 ns |
8041 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9250 ns |
8958 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8459 ns |
8416 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8292 ns |
9208 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
880887.5 ns |
875812.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
6096687.5 ns |
5580917 ns |
1.09 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
386063.5 ns |
389804 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56833 ns |
56792 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57750 ns |
56875 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57667 ns |
57584 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58042 ns |
58375 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
38026 ns |
37054 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
486292 ns |
336000 ns |
1.45 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
202862 ns |
203242 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
448000 ns |
485813 ns |
0.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
464562 ns |
499958.5 ns |
0.93 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
514167 ns |
468208 ns |
1.10 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
433250 ns |
438854.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
268686 ns |
268055 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8272083 ns |
8122166.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
828927 ns |
832729 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3318958 ns |
3291250 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2343459 ns |
1764708 ns |
1.33 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2341458.5 ns |
2339021 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6308646 ns |
6260292 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
205558 ns |
204625 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
215872 ns |
209992 ns |
1.03 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11518666.5 ns |
11332208 ns |
1.02 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8313187 ns |
6550833 ns |
1.27 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8355021 ns |
8325250 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21045041.5 ns |
20937125 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
745952 ns |
734916 ns |
1.02 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1057229 ns |
1048155.5 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6792 ns |
4291 ns |
1.58 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5458.5 ns |
5875 ns |
0.93 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6500 ns |
6583 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4166 ns |
4896 ns |
0.85 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
138428.5 ns |
137991.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
882854 ns |
785625 ns |
1.12 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
56141 ns |
56390 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
7042 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7375 ns |
10562.5 ns |
0.70 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7375 ns |
7104.5 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6895.5 ns |
7833 ns |
0.88 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
761776 ns |
754679 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5740125 ns |
5245042 ns |
1.09 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
368623 ns |
371414 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
97541 ns |
127625 ns |
0.76 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
97291 ns |
95624.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
102834 ns |
100000 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
95833 ns |
95708 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
152013 ns |
152137 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2103917 ns |
2635166.5 ns |
0.80 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
203602 ns |
203242 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2028917 ns |
2017959 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2020833 ns |
2027771 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2026375 ns |
2021167 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2015416 ns |
1987167 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
713055.5 ns |
703925.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10783958 ns |
11055292 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1109779 ns |
1255893 ns |
0.88 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
33958 ns |
29375 ns |
1.16 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
37500 ns |
34500 ns |
1.09 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
34833 ns |
35250 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
625 ns |
583 ns |
1.07 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15998 ns |
15622 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
78530 ns |
80130 ns |
0.98 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2667 ns |
2542 ns |
1.05 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2875 ns |
3125 ns |
0.92 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3042 ns |
2834 ns |
1.07 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2208.5 ns |
3000 ns |
0.74 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
142041 ns |
141408 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
342123 ns |
343344 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7125 ns |
7125 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
5375 ns |
1.11 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
6000 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10209 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37390 ns |
36671 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
676167 ns |
331459 ns |
2.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48060 ns |
48221 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212729 ns |
217479 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220042 ns |
229625 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
234520.5 ns |
225000 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205667 ns |
212875 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
246638 ns |
244929 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8074875 ns |
7984187.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
575124.5 ns |
574266 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3958 ns |
3959 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
21990 ns |
21419 ns |
1.03 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
252375 ns |
234583 ns |
1.08 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
42541 ns |
42620 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14750 ns |
14791 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15083 ns |
14750 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14959 ns |
14875 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14875 ns |
14833 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
313538 ns |
311492 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
1028333 ns |
982000 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
197971.5 ns |
192231.5 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
101625 ns |
140834 ns |
0.72 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
103750 ns |
127417 ns |
0.81 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
106333 ns |
105167 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
103459 ns |
141000 ns |
0.73 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
142836 ns |
152595 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2101916 ns |
2057334 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204827 ns |
213297 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1936209 ns |
1917833 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1845500 ns |
1898875 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1928833 ns |
1922083 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1923833 ns |
1898854 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
695235 ns |
692137 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10431000 ns |
10436541 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1217070 ns |
1217872 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17333 ns |
18250 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18521 ns |
18625 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21896 ns |
20750 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16667 ns |
17749.5 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
109599.5 ns |
110137 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1364167 ns |
480541.5 ns |
2.84 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78990 ns |
79421 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215333 ns |
252041.5 ns |
0.85 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
216916 ns |
217541.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
260791.5 ns |
219687.5 ns |
1.19 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215375 ns |
222729.5 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
522595 ns |
519298 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6229875 ns |
6194812.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
475894 ns |
478425 ns |
0.99 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
23895.5 ns |
23291.5 ns |
1.03 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
32458 ns |
28583 ns |
1.14 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
29292 ns |
28792 ns |
1.02 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
3458 ns |
1229.5 ns |
2.81 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16387 ns |
16210 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
81460 ns |
82241 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4854 ns |
4292 ns |
1.13 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5208 ns |
4729 ns |
1.10 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5395.5 ns |
5042 ns |
1.07 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4792 ns |
5771 ns |
0.83 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
208972.5 ns |
207444.5 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
378003 ns |
378084 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
307375 ns |
305417 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
306083 ns |
306250 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
306834 ns |
308084 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
305666 ns |
305750 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
229495 ns |
228609 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
591083 ns |
604584 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
272703 ns |
273963 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
530459 ns |
532917 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
539625 ns |
538167 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
568541.5 ns |
539125 ns |
1.05 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
529458 ns |
572709 ns |
0.92 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1085835 ns |
1074383 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6524292 ns |
6115208.5 ns |
1.07 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
862227 ns |
858603.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20875 ns |
19291 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19979.5 ns |
20708 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23042 ns |
22375.5 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19750 ns |
19875 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
115151 ns |
114907 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1456958.5 ns |
593916 ns |
2.45 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79391 ns |
79421 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213000 ns |
215708 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213542 ns |
220584 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
246521 ns |
213625 ns |
1.15 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213167 ns |
215875 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
753347 ns |
762395 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7160542 ns |
7232562.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
538225 ns |
542290.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6708 ns |
6125 ns |
1.10 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6667 ns |
7083 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8375 ns |
7917 ns |
1.06 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6917 ns |
6208 ns |
1.11 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
141312.5 ns |
140165.5 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
873084 ns |
799291 ns |
1.09 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
65570 ns |
65270 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10916 ns |
9542 ns |
1.14 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10625 ns |
10333.5 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10312.5 ns |
10375 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9500 ns |
11145.5 ns |
0.85 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
830338 ns |
826456 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5676125 ns |
5311708 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
385548 ns |
387474 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6375 ns |
4875 ns |
1.31 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5000 ns |
6917 ns |
0.72 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7000 ns |
7250 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5833 ns |
4812.5 ns |
1.21 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
145831.5 ns |
144262 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
860333 ns |
808375 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
66980 ns |
66621 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7666 ns |
7458 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7375 ns |
8083 ns |
0.91 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7708 ns |
7541.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7250 ns |
7833 ns |
0.93 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
791858 ns |
783702 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5957062 ns |
5566229 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
392953 ns |
395004 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14530458 ns |
14350584 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10121208 ns |
7693688 ns |
1.32 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10148584 ns |
10127042 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27684375 ns |
27615959 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
535738 ns |
548306 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
397784 ns |
393134 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46490166.5 ns |
45943208 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33477083 ns |
26437417 ns |
1.27 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33537500 ns |
33454833 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85247125 ns |
84782667 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2681552 ns |
2657066 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3309727.5 ns |
3290613 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
67375 ns |
66375 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
68792 ns |
68584 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
69854.5 ns |
69333.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
66416 ns |
65979 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
122151 ns |
121920.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1445250 ns |
508166 ns |
2.84 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
226167 ns |
229397.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
440688 ns |
446833 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
451750 ns |
452437.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
492625 ns |
446375 ns |
1.10 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
441000 ns |
445834 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
732469 ns |
728139 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7909125 ns |
7552104 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
791857 ns |
790108 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
666 ns |
0.88 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
666 ns |
500 ns |
1.33 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
667 ns |
0.75 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
33043 ns |
32311 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
460958 ns |
473500 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47400 ns |
47340 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8958 ns |
8666 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9500.5 ns |
9208 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
8875 ns |
8458 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8958 ns |
17104 ns |
0.52 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
288238 ns |
286358 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5647625 ns |
4681395.5 ns |
1.21 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
374718.5 ns |
375004 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9833 ns |
9875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9875 ns |
9875 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9792 ns |
9792 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9792 ns |
9833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23362.5 ns |
23012 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
224083 ns |
215645.5 ns |
1.04 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
203972 ns |
205762 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45833 ns |
45958 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45959 ns |
46042 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46292 ns |
46041 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45958 ns |
46250 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
293937 ns |
290878 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
980625 ns |
942542 ns |
1.04 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
605265 ns |
607695 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56375 ns |
56250 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57083 ns |
56458 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57042 ns |
57083 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
57625 ns |
57709 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
29177 ns |
28552 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
672479.5 ns |
663666.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
203142 ns |
203541.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
448687 ns |
448583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
464312.5 ns |
465562 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
510417 ns |
465458.5 ns |
1.10 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
433854.5 ns |
454041.5 ns |
0.96 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
247396 ns |
245887 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9563709 ns |
9545520.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
893168 ns |
887779 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
544437.5 ns |
645812.5 ns |
0.84 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
597375 ns |
575959 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
648250 ns |
640542 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
624874.5 ns |
646271 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
210492 ns |
208584 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1402270.5 ns |
1406395.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
311727.5 ns |
315503 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2217500 ns |
2214979 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2240396 ns |
2211999.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2229708.5 ns |
2220812.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2225709 ns |
2227958 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
983621 ns |
978439 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7187583 ns |
10481646 ns |
0.69 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1224520 ns |
1213952 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21375 ns |
18625 ns |
1.15 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20229 ns |
20729 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22833 ns |
21583 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20334 ns |
18875 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
113138 ns |
113850.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1461708.5 ns |
497958 ns |
2.94 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78945.5 ns |
79731 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221084 ns |
227375 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218833.5 ns |
259417 ns |
0.84 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
258896 ns |
225541 ns |
1.15 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219333.5 ns |
227084 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
734120 ns |
729838 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7799875 ns |
7560500 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
554300 ns |
554315 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
541 ns |
1.23 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23471 ns |
23274 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
494166 ns |
484250 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
47550 ns |
48040 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9166 ns |
9083 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9792 ns |
10437.5 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9458 ns |
9541 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8958 ns |
9500 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
270206 ns |
268183 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6261937.5 ns |
5000875 ns |
1.25 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
395404 ns |
398234 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9292 ns |
7250 ns |
1.28 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9625 ns |
9187.5 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10021 ns |
9645.5 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8792 ns |
8041 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
121366 ns |
118921.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
911666 ns |
886791.5 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
69440 ns |
71801 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7458 ns |
7604 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
8125 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8000 ns |
7500 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7291 ns |
7562.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
511555 ns |
507494 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4435417 ns |
3782375 ns |
1.17 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
317033 ns |
320313 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1583 ns |
1500 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1604 ns |
1708.5 ns |
0.94 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2000 ns |
1791 ns |
1.12 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1375 ns |
1375 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
21695 ns |
21598 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
314792 ns |
313375 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
189691 ns |
190932 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3375 ns |
3541 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3458 ns |
3583 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3458.5 ns |
3458 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3250 ns |
3292 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
223980.5 ns |
218452 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1832500 ns |
1797375 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
582325 ns |
583116 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
148250 ns |
148104.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
128875 ns |
106833 ns |
1.21 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
128958.5 ns |
128562.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
225250 ns |
225000 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
24262 ns |
23975 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
283354.5 ns |
254292 ns |
1.11 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
41141 ns |
41470 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
143416.5 ns |
157645.5 ns |
0.91 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
112166.5 ns |
87625 ns |
1.28 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
112958 ns |
112000 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
250792 ns |
250708.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
218729 ns |
218220.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2142416 ns |
1096666 ns |
1.95 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
268067 ns |
269773 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7334 ns |
7167 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5333 ns |
1.10 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
6000 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9958 ns |
10458 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33218 ns |
32755 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
655791.5 ns |
330458 ns |
1.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50271 ns |
50720 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220458 ns |
253104 ns |
0.87 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
227458 ns |
229041.5 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
270542 ns |
234187.5 ns |
1.16 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
212354.5 ns |
227938 ns |
0.93 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
266380 ns |
263186.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8511103.5 ns |
8237750 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
592610 ns |
594190.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
14833 ns |
13792 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15333 ns |
15166 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
17291.5 ns |
16499.5 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
15084 ns |
14667 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
140084.5 ns |
139540 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
909875 ns |
786729 ns |
1.16 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
231072 ns |
232963 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23750 ns |
23000 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24667 ns |
23937.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
23250 ns |
23875 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23000 ns |
23979.5 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
872721 ns |
870094.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5802750 ns |
5595708 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
678796 ns |
679366 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9208 ns |
8750 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9292 ns |
10312.5 ns |
0.90 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10834 ns |
11271 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9459 ns |
9584 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
124867 ns |
123388.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
859792 ns |
858292 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
72841 ns |
74460 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14208 ns |
13375 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13708 ns |
14458.5 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
13916 ns |
13958 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13708 ns |
13625 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
673605 ns |
667308 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5390083 ns |
4997708 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
367303 ns |
365743 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9208 ns |
8583 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10208.5 ns |
10333 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11062.5 ns |
10312.5 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9416 ns |
9166 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
124133 ns |
121770.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
963958 ns |
906625 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
72331 ns |
75170 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12666 ns |
12292 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12521 ns |
13437.5 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12833 ns |
12916 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11959 ns |
12458 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
558576.5 ns |
553718.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4777458 ns |
3865125.5 ns |
1.24 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
342592 ns |
341293 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
29729 ns |
26354.5 ns |
1.13 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
34979 ns |
30645.5 ns |
1.14 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
31541.5 ns |
31541 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1833 ns |
1833 ns |
1 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16426 ns |
16183 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
81111 ns |
81001 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5625 ns |
5209 ns |
1.08 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
4958 ns |
5021 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5416 ns |
5417 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6542 ns |
6604 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
141297.5 ns |
140577.5 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
371714 ns |
370423.5 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
250 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
291 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
26334 ns |
25697 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
487083 ns |
465667 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
47261 ns |
47180 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6417 ns |
6125 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6458 ns |
6729 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6792 ns |
6333 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6250 ns |
6312.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
189443.5 ns |
187721.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
6171125 ns |
4952833.5 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
386868 ns |
386429 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
1959 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
2083 ns |
2042 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2125 ns |
2000 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
1958 ns |
1959 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
27204 ns |
26463 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
471084 ns |
479625 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
205931 ns |
206252 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16042 ns |
16250 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16833 ns |
16666 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16791 ns |
16208.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16646 ns |
16417 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
277576.5 ns |
276067 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6251000 ns |
5326083 ns |
1.17 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
699851 ns |
700836 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
148375 ns |
173875 ns |
0.85 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
150437.5 ns |
148750 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
153292 ns |
155708 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
153125 ns |
147458 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
211229 ns |
203847 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1433375 ns |
1561917 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
214362 ns |
232482 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1324083 ns |
1328917 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1324958 ns |
1311771 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1328917 ns |
1320791 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1318208 ns |
1322500 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
915681 ns |
909940.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6666625 ns |
7124333 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1106330 ns |
995559.5 ns |
1.11 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25708 ns |
22958 ns |
1.12 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24833 ns |
26833 ns |
0.93 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27541 ns |
27625 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
26083 ns |
24667 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
237833 ns |
234608.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1183708 ns |
576541 ns |
2.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
114451 ns |
116011 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
117687.5 ns |
118166.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
125583 ns |
122375 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
136583.5 ns |
158041.5 ns |
0.86 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
128500 ns |
123833.5 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1088124 ns |
1073695 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6499917 ns |
6127166 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
608510 ns |
612925 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
250 ns |
1.17 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23244 ns |
23160 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
492313 ns |
478542 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
47000 ns |
47471 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6395.5 ns |
6291 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6750 ns |
6833.5 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6958 ns |
6458 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6541 ns |
6584 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
206201.5 ns |
204382.5 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
6167167 ns |
5334937.5 ns |
1.16 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
386698 ns |
388703 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7000 ns |
5208 ns |
1.34 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6208.5 ns |
7021 ns |
0.88 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8250 ns |
7458 ns |
1.11 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5958 ns |
5667 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
146258 ns |
145933.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
457750 ns |
753959 ns |
0.61 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
232042 ns |
234802 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9917 ns |
9583 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10333 ns |
10375 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10334 ns |
10125 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9958 ns |
10042 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
909058 ns |
903827 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6298000 ns |
5826479 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
665070.5 ns |
668457 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
667 ns |
709 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
667 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
666 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22848 ns |
22371 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
325125 ns |
208416 ns |
1.56 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
205392 ns |
207552 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4791 ns |
4584 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4625 ns |
4833 ns |
0.96 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4750 ns |
4666 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4542 ns |
4584 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
231258 ns |
228749 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1797833 ns |
1654416.5 ns |
1.09 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
580615 ns |
580735 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7812.5 ns |
7750 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7521 ns |
9166.5 ns |
0.82 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10250 ns |
8834 ns |
1.16 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7604.5 ns |
8291 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
123509.5 ns |
121959 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
806208 ns |
827916 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
73451 ns |
74011 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8729.5 ns |
8625 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8833 ns |
9041.5 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9042 ns |
8583.5 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8375 ns |
8375 ns |
1 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
596738 ns |
591884.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
4941250 ns |
4264875 ns |
1.16 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
340602 ns |
342784 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
126458 ns |
122750 ns |
1.03 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
129666 ns |
96459 ns |
1.34 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
129958 ns |
130187.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
180979.5 ns |
180875 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46506 ns |
45830 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
101881 ns |
101721 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
302250 ns |
328000 ns |
0.92 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
315979.5 ns |
166666 ns |
1.90 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
345021 ns |
347541.5 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
586229 ns |
608646 ns |
0.96 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
192939 ns |
192063 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
509434 ns |
505519.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
399292 ns |
395916 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287875 ns |
214250 ns |
1.34 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287875 ns |
288167 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756125 ns |
756500 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44157 ns |
43676.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
416625 ns |
429792 ns |
0.97 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
80330.5 ns |
82131 ns |
0.98 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1393583.5 ns |
1458834 ns |
0.96 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1132333 ns |
857583 ns |
1.32 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1134896 ns |
1134333 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2442729 ns |
2441958.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
251447 ns |
249859 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1849937 ns |
1909646 ns |
0.97 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
351723 ns |
352903 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
657209 ns |
616500 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
644583 ns |
598250 ns |
1.08 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
647625 ns |
648916.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
646500 ns |
642667 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
203355 ns |
200586.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1396542 ns |
1363291 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
308307 ns |
313733 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2493938 ns |
2445375 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2448666 ns |
2426917 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2455417 ns |
2441500 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2437875 ns |
2440750 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
995041 ns |
994961 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10855500 ns |
9661291 ns |
1.12 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1303071 ns |
1307388 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
33167 ns |
28521 ns |
1.16 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
37708 ns |
34625 ns |
1.09 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
34395.5 ns |
33916.5 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
792 ns |
875 ns |
0.91 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15536 ns |
15425.5 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
78885.5 ns |
79381 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3187 ns |
3062.5 ns |
1.04 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3208.5 ns |
3416 ns |
0.94 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3375 ns |
3208 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3167 ns |
3209 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
138978.5 ns |
139741 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
337663 ns |
338953 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
407167 ns |
404500 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
407625 ns |
402125 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
408917 ns |
408334 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
420083 ns |
422458 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
43841 ns |
43145 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1162041.5 ns |
1128750.5 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
242687 ns |
239562 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3888000 ns |
3863292 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3966812.5 ns |
3971625 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4003187.5 ns |
3996791 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3753604.5 ns |
3757979.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
242130 ns |
242826 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11627208 ns |
11673750 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1434622 ns |
1433229 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3959 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3959 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3916 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3875 ns |
3917 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
34345 ns |
33968 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
265875 ns |
167334 ns |
1.59 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
40000 ns |
38620 ns |
1.04 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15458 ns |
15666 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
16000 ns |
15750 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15875 ns |
15625 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15750 ns |
15625 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
255938 ns |
255128 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
886916.5 ns |
843520.5 ns |
1.05 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
177676.5 ns |
169816.5 ns |
1.05 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404750 ns |
402625 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
295584 ns |
220209 ns |
1.34 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
295625 ns |
295959 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760584 ns |
760791.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113304 ns |
113239 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
453333.5 ns |
348895.5 ns |
1.30 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89391 ns |
89300.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1423000 ns |
1474958.5 ns |
0.96 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1161417 ns |
881146 ns |
1.32 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1159042 ns |
1159083.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2466333.5 ns |
2461917 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
240454.5 ns |
241292 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1919791 ns |
1946459 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
331642 ns |
354883 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
26071 ns |
25844 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
479125 ns |
496709 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
205842 ns |
209382 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7375 ns |
7375 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7666.5 ns |
8104.5 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7958 ns |
7500 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7334 ns |
7375 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
211355 ns |
217033.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6383916.5 ns |
5254333.5 ns |
1.21 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
685985.5 ns |
685977 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
831521 ns |
825125.5 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
618542 ns |
468584 ns |
1.32 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
621583 ns |
621500 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1545084 ns |
1536542 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
130366.5 ns |
130845.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
230342 ns |
229862 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2689875 ns |
2661979 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1997791 ns |
1535250.5 ns |
1.30 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
2003375 ns |
2000792 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4944895.5 ns |
4906416 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
259056 ns |
242304 ns |
1.07 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
850477 ns |
841449 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32798 ns |
32216 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
459854.5 ns |
464375 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
47071 ns |
47630 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6208 ns |
6125 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6500 ns |
6708 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6750 ns |
6500 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6084 ns |
6375 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
224304.5 ns |
224154.5 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5499188 ns |
4615291 ns |
1.19 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
357363 ns |
357793.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2461458 ns |
2392708 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2378542 ns |
2371959 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2403542 ns |
2404416 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2387833 ns |
2370084 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
200880 ns |
200035.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1494834 ns |
1597041.5 ns |
0.94 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
373593 ns |
373933 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4658209 ns |
4648292 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4661750.5 ns |
4644250 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4678791.5 ns |
4636708 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4636334 ns |
4642750 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
901278.5 ns |
891890 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6685687.5 ns |
6938541.5 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1397522 ns |
1391633 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6916.5 ns |
7187.5 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7292 ns |
7542 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7166 ns |
7125 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7354.5 ns |
6875 ns |
1.07 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
23290 ns |
23289 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
278083 ns |
243458.5 ns |
1.14 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
40180 ns |
39800 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
33146 ns |
46396.5 ns |
0.71 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
48833.5 ns |
32917 ns |
1.48 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
46125 ns |
45875.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
33958.5 ns |
67312 ns |
0.50 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
216552 ns |
214725 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2082666 ns |
1121562 ns |
1.86 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
268892 ns |
269102.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
21041.5 ns |
19604.5 ns |
1.07 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
26625 ns |
24021 ns |
1.11 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
24937.5 ns |
23750 ns |
1.05 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
7334 ns |
5084 ns |
1.44 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
16735 ns |
17227 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
84431 ns |
83741 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
11916 ns |
11916 ns |
1 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10395.5 ns |
9354.5 ns |
1.11 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
10583 ns |
10417 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18021 ns |
17958 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
227961 ns |
225890 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
370163 ns |
371753 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406167 ns |
404000 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
297250 ns |
222584 ns |
1.34 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
297333 ns |
296875 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762625 ns |
762667 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46476 ns |
46288 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
449292 ns |
358375 ns |
1.25 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
88291 ns |
89491 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1437042 ns |
1480896 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1162625 ns |
888250 ns |
1.31 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1166208 ns |
1164959 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2472895.5 ns |
2465417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
282746.5 ns |
288016 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2110792 ns |
2117375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
378543 ns |
381744 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
434250 ns |
432125 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
436625 ns |
430333 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
436625 ns |
436917 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
447208 ns |
448604.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
54844 ns |
54122.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1118291.5 ns |
1059021 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
235652 ns |
234952 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3915542 ns |
3895042 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4019584 ns |
4004458 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4030583.5 ns |
4030291.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3785166.5 ns |
3789979 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
262526 ns |
260055 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10655166 ns |
10349458.5 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1364571 ns |
1223712 ns |
1.12 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8791 ns |
8750 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
7708 ns |
6917 ns |
1.11 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
7625 ns |
7583 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12458 ns |
12416 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
23849 ns |
23553.5 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
231188 ns |
214667 ns |
1.08 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
209912 ns |
211142 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
44833 ns |
44958 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45083 ns |
45083 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
45334 ns |
45000 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
44958 ns |
44958 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
348178 ns |
344550 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1901333 ns |
1862458 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
655620 ns |
659011.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
84187.5 ns |
122729 ns |
0.69 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
86833.5 ns |
83521 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
128541 ns |
87354.5 ns |
1.47 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
90229.5 ns |
105375 ns |
0.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190096 ns |
190055 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2015000 ns |
1972791.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
218242 ns |
214447 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2025917 ns |
2012458.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1983125 ns |
1980000 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2025145.5 ns |
2023917 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2020771 ns |
2011645.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
533574 ns |
529776 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9232375 ns |
9305500.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
961658 ns |
1088680 ns |
0.88 |
This comment was automatically generated by workflow using github-action-benchmark.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.