This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: enzyme reverse bias needs a check on Const
- Loading branch information
Showing
4 changed files
with
11 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "LuxLib" | ||
uuid = "82251201-b29d-42c6-8e01-566dec8acb11" | ||
authors = ["Avik Pal <[email protected]> and contributors"] | ||
version = "1.2.2" | ||
version = "1.2.3" | ||
|
||
[deps] | ||
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0df09fa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
0df09fa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/115299
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
0df09fa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6938
ns4667
ns1.49
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7438
ns6666.5
ns1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7541
ns7500
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5750
ns5750
ns1
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
133931
ns117321
ns1.14
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
2868757
ns2723919
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
741167
ns3008750
ns0.25
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
407074
ns404195
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9916.5
ns9896
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9625
ns9833
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9937.5
ns9979
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9916.5
ns9958.5
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
536526
ns533872
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
17845684
ns18512917
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
2422500
ns2324292
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
678976
ns674968
ns1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1583
ns1437.5
ns1.10
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3145.5
ns2875
ns1.09
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
2812.5
ns2083
ns1.35
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1541.5
ns1437.5
ns1.07
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
21370
ns21479
ns0.99
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI
1416739
ns1282166
ns1.10
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal
237500
ns190209
ns1.25
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
29161
ns29540
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4166
ns4250
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4291
ns4167
ns1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4417
ns4145.5
ns1.07
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4104
ns4375
ns0.94
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
143094
ns144438.5
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI
9766798.5
ns9108147.5
ns1.07
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal
1569250
ns1604875
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
144301
ns145092
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58000
ns55875
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46834
ns39209
ns1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46584
ns46625
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82333
ns84167
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36625
ns36824
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
686115
ns542002
ns1.27
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1069291
ns1333104
ns0.80
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
78821
ns81391
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2031375
ns2024917
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2084708
ns2079125
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2090291
ns2081625
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1985542
ns1993125
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
225038
ns226688
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
8235886
ns7623752
ns1.08
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
5106125
ns7427958
ns0.69
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
987279
ns1252074
ns0.79
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
174500
ns174750
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
162104.5
ns164541.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
165229
ns148812.5
ns1.11
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
145875
ns144375
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165145
ns165480
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8411274
ns7680925
ns1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1520666
ns1457521
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
209957
ns204852
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1119979
ns1117250
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1112166.5
ns1109375.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1117709
ns1113334
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1107125
ns1112187.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
687949
ns694582
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35372606
ns33705507.5
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6112291
ns6238375
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1024164.5
ns1026961
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4625.5
ns4417
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5104
ns5041
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5583
ns5208
ns1.07
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5042
ns4583
ns1.10
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
92273
ns93299.5
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
5823843
ns5368327
ns1.08
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
499583.5
ns634041.5
ns0.79
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
67701
ns69460
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9000
ns8375
ns1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8500
ns8542
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9187.5
ns8833
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8417
ns8833
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
600949
ns604485
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
36561430
ns36365543
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5960250
ns5669937.5
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
389274
ns388374
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19625
ns17000
ns1.15
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17791
ns17709
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20291
ns18021
ns1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16645.5
ns16895.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
65239
ns66654.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3323140
ns2923981.5
ns1.14
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1293104
ns477833
ns2.71
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
73656
ns78451
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220959
ns216834
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212333
ns219896
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212541
ns225583.5
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212000
ns217625
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
347340
ns356473
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
13974103
ns14201022
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5755333
ns5644395.5
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
462604
ns465005
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
666
ns667
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
833.5
ns750
ns1.11
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
875
ns812.5
ns1.08
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
584
ns625
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
20357
ns20462
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI
1288251
ns1162134.5
ns1.11
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal
292667
ns302625
ns0.97
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
31491
ns32870
ns0.96
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1416.5
ns1417
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1416
ns1458
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1625
ns1417
ns1.15
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1416
ns1416
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
123399.5
ns125127
ns0.99
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI
9450809
ns8831211
ns1.07
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal
1493229
ns1526500
ns0.98
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
135231
ns136521
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7500
ns7208
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6042
ns5416
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6125
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns10666
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23818
ns23625
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1331154.5
ns1207481
ns1.10
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
628937.5
ns356458
ns1.76
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
46911
ns48881
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219750
ns226166
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
265167
ns265333
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
264416
ns234854
ns1.13
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
249854
ns219500
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
189311.5
ns192027
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
33158982
ns31211143.5
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9299979.5
ns9046313
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
643876
ns649247
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4125
ns4125
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4125
ns4083
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4083
ns4084
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4083
ns4083
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23427
ns23477
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI
2124740.5
ns2001417
ns1.06
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal
222770.5
ns214833
ns1.04
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
46290
ns47261
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16833
ns17083
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16792
ns17000
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16750
ns16833
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16792
ns17334
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
191493
ns195303
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI
11757211
ns14536946
ns0.81
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal
955313
ns918208
ns1.04
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
171341.5
ns174652
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
511167
ns508750
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
405458
ns330583
ns1.23
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
405000
ns404666
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
858250
ns864791
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113156
ns113620
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI
448835
ns401393
ns1.12
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal
471209
ns490979
ns0.96
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
240532
ns242133
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2268250
ns2313834
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2031416
ns1747479
ns1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2030917
ns2035208
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3275750
ns3272708.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
236871
ns241207
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI
10359638.5
ns10021457.5
ns1.03
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal
1993250
ns2011770.5
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
739142
ns743443
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6583
ns4708.5
ns1.40
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6875
ns7625
ns0.90
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7709
ns7708
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6292
ns5479.5
ns1.15
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
90224.5
ns92351.5
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
5882879
ns5442998
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
771000
ns783479
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
65250
ns65411
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12333.5
ns10333.5
ns1.19
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11375
ns11875
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11312.5
ns11750
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11833.5
ns12062.5
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
622443
ns634956
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
41746922
ns40400531.5
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5637750
ns5457291.5
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
407854
ns409979.5
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
541
ns541
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns583
ns0.86
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
22944
ns23181
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI
2423476.5
ns2216579
ns1.09
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal
326750
ns332584
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
48960
ns47221
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2166
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2125
ns2167
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2083
ns2084
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
217144
ns215755
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI
12060454
ns11357397.5
ns1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal
1960083
ns1978417
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
180236.5
ns172626.5
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8625
ns8937.5
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9646
ns9729.5
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11229
ns9459
ns1.19
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8792
ns8958
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
103267
ns96639
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
3427494
ns3207607
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
875083
ns876000
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
73431
ns71941
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17834
ns18521
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17916
ns19104.5
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17333
ns17625
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18000
ns18812.5
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
586862
ns554001
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
17435012.5
ns16517942.5
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5223458
ns5180916.5
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
377954
ns378539
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns458
ns1.27
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns625
ns0.80
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
542
ns666
ns0.81
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
541
ns500
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
34849.5
ns35213
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
1279718
ns1186873
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
435291
ns466396
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
45841
ns46270
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8979.5
ns9312.5
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9250
ns9916.5
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
8917
ns9167
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8146
ns9458.5
ns0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
260579
ns267136
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
19733483
ns18948901
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
4985875
ns4572250
ns1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
366004
ns367694
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
398667
ns395333
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
287958
ns214416
ns1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287750
ns288292
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756458
ns756291
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111261.5
ns111882
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI
376549
ns329474.5
ns1.14
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal
367583.5
ns300208.5
ns1.22
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
74430
ns77331
ns0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1400375
ns1453791.5
ns0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1135375
ns852583
ns1.33
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1132354
ns1132645.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2440958
ns2440625
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
203910
ns207032
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI
9225527
ns10204120
ns0.90
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal
1662875
ns1668041.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
321818
ns324428.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7604.5
ns7041.5
ns1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8083
ns7750
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8729
ns9396
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7437.5
ns7791.5
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
142785
ns144806.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
6299176.5
ns5813106.5
ns1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
521292
ns437250
ns1.19
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
65420
ns66071
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
12583
ns13083
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
12437.5
ns14479
ns0.86
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14521
ns15709
ns0.92
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14979.5
ns15354.5
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
943733.5
ns956377
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
47612069
ns42729213
ns1.11
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5885062.5
ns5700250
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
417444
ns428955
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
30395.5
ns24000
ns1.27
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
29604
ns24875
ns1.19
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
27709
ns29292
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
25083.5
ns27667
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
195905
ns199144
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8216412
ns7744284
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
990125
ns999584
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
116401
ns116931
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
154583.5
ns103583
ns1.49
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
155500
ns152687
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
114042
ns153583
ns0.74
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
113187.5
ns151000
ns0.75
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1061855
ns1075746
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
46328998
ns43042130
ns1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5883041
ns5733792
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
586901
ns590946.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74459
ns75000
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75833
ns77084
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78208
ns86333.5
ns0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
75958
ns74875
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
203068
ns205585
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7813436
ns8027595.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
533437.5
ns519187.5
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
127391
ns127562
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
298166
ns293542
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
303208
ns308750
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
306041.5
ns315187.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
295666
ns304208
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1104226
ns1108118
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
44772773.5
ns40422383
ns1.11
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6766000
ns6276458
ns1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
694176
ns695017
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
17000
ns15875
ns1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
17292
ns17521
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
18375
ns18500
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
16792
ns16958
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
145201.5
ns146489
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
6348029
ns5586208
ns1.14
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
448000.5
ns723083.5
ns0.62
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
231113
ns232683
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
27208
ns26667
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
28625
ns26687.5
ns1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27187.5
ns28208.5
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26145.5
ns27708.5
ns0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
972527
ns982068.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
44334727.5
ns40344043
ns1.10
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5935916
ns5743229
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
684627
ns686807.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11375
ns11083
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11625
ns12042
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14042
ns12334
ns1.14
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
10416
ns10791
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
123261.5
ns124134
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
3725175
ns3473152
ns1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
904958
ns880000
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
233272
ns234213
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
22000
ns21958
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21666
ns22729.5
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21542
ns21895.5
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21916
ns22000
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
697545
ns701831.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
22814286
ns21157140
ns1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5479812.5
ns5204750
ns1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
668531
ns674667
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
67459
ns63437.5
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
63625
ns65521
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
65084
ns66750
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
62667
ns63042
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
105558.5
ns106345.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3699497
ns3373870
ns1.10
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1336625
ns480667
ns2.78
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
231652
ns233433
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
450250
ns437896
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
451792
ns456000
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
446041.5
ns450542
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
484250
ns444000
ns1.09
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
508079
ns515188
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
22280153.5
ns21597008
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6164479
ns6095791.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
712097
ns717017.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7667
ns6792
ns1.13
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8458
ns8000
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8041.5
ns8583.5
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7083.5
ns6917
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
142974
ns146052.5
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
5983895.5
ns5510181.5
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
687104.5
ns726500
ns0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
68961
ns65301
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14333
ns14292
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14312
ns15292
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15021
ns14084
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15250
ns16209
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
941966
ns947670
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
40659493.5
ns39845105
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5744375
ns5499875
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
395784
ns399764
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6161520.5
ns6131500
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
6378125.5
ns3224875
ns1.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6377708.5
ns6379229.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11920959
ns11911084
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
347985
ns349856
ns0.99
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
320268
ns303248
ns1.06
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19132416
ns19059708.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
20009458
ns11090437.5
ns1.80
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
19937708
ns20005646
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36464229.5
ns36446770.5
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1013485
ns1081781.5
ns0.94
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1165921
ns1153782
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
917
ns958
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1000
ns1000
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
917
ns958
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
958
ns917
ns1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23221
ns23071
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI
2197390
ns2085318
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal
332458.5
ns332541.5
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
205762
ns207622
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3667
ns3667
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3709
ns3750
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3667
ns3708
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3667
ns3667
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
277792
ns281551.5
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI
12494000
ns12095727
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal
2076312.5
ns2129583
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
624236
ns626307
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8792
ns8042
ns1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8875.5
ns8145.5
ns1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9875
ns9042
ns1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7625
ns7937.5
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
119047.5
ns121104
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
3910252.5
ns3679976
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
795416.5
ns802541.5
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
65320
ns65471
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11374.5
ns13125
ns0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12208
ns12875
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11792
ns11417
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11979.5
ns12708
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
629697.5
ns638151
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
23515262
ns22685670
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5019875
ns4390333
ns1.14
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
352263
ns355644
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22203
ns22337
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI
2289294
ns2195388.5
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal
228916
ns207833
ns1.10
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
46161
ns47401
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3084
ns3042
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2959
ns3375
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2917
ns2916
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2875
ns3333
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
200155
ns204047
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI
9757264
ns14763707.5
ns0.66
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal
1632083
ns1611395.5
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
153411.5
ns157641.5
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11563
ns10250
ns1.13
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11334
ns12167
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12292
ns12187.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10854
ns10604
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
120519
ns121713.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
3640370.5
ns3281210
ns1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
897667
ns904791.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
232282
ns233512.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20750
ns21104.5
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21083
ns22583
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21959
ns21083
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21458.5
ns21708
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
590202
ns595173
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
22574638
ns20531194.5
ns1.10
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
4746958.5
ns4095583
ns1.16
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
639216
ns638246.5
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
23877
ns24193.5
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI
2442376
ns2211530
ns1.10
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal
225708
ns215041
ns1.05
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
46800
ns47690
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16291
ns16292
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16625
ns16291
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16459
ns16667
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16500
ns16416
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
326023.5
ns330020.5
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI
13171553
ns12280627
ns1.07
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal
1188229
ns1639709
ns0.72
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
205042
ns206457.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
2042
ns1917
ns1.07
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2083
ns2167
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2083
ns2084
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2084
ns2084
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
35572
ns35891
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
1338351
ns1213015
ns1.10
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
435459
ns474917
ns0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
202812
ns204052
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
16520.5
ns19687.5
ns0.84
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
17104.5
ns17187.5
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
18375
ns17750
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
18770.5
ns16667
ns1.13
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
291395
ns293976.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
23003699
ns21212198
ns1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5678333
ns4767354.5
ns1.19
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
682086
ns686777
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
58979
ns55771
ns1.06
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
67125
ns62792
ns1.07
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
66917
ns65604.5
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51625
ns51333
ns1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66452
ns66418
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
114721
ns114241
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
162292
ns202896
ns0.80
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
147229
ns135104
ns1.09
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
130229
ns130083
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
296770.5
ns245666
ns1.21
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
213701
ns215296
ns0.99
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
607926
ns607861
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
84250
ns79709
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
124729
ns107104
ns1.16
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85875
ns85167
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
123833
ns124166.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193440
ns192861
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7291287
ns5531381
ns1.32
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1831167
ns1816084
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203522
ns203512
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1928271
ns1869895.5
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1891125
ns1901084
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1902250
ns1917666.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1914749.5
ns1889333
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
525346
ns531825
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
26967967.5
ns32650285
ns0.83
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9298209
ns8859584
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
927389
ns925670
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
291
ns291
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21417
ns21389
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI
2392141
ns2065883
ns1.16
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal
342188
ns336229.5
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
42200
ns42770.5
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1791
ns1792
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
249016
ns253832
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI
10390055
ns10417238
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal
1093187.5
ns1009479
ns1.08
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
179602
ns184376.5
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9667
ns8000
ns1.21
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
10125
ns10042
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10249.5
ns10375
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9375
ns8167
ns1.15
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
118409
ns119090.5
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
3710566
ns3309191
ns1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
886083.5
ns876708
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
231452
ns232622
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9209
ns9083
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10000
ns10625
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9770.5
ns9542
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9500
ns10125
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
517575.5
ns527209
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
21956361
ns22247571
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
4314937.5
ns3949187.5
ns1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
624606
ns624237
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58209
ns56166
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46542
ns38916
ns1.20
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46750
ns46125
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83000
ns83958
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
39682
ns40233
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1450337.5
ns1343252
ns1.08
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1115958
ns1123667
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
74661
ns76266
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1939500
ns1923750
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1983125
ns1952750.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1951312.5
ns1982854
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1897667
ns1850708.5
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
216819.5
ns221906.5
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
37812796.5
ns33376877
ns1.13
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10968478.5
ns11408021
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1185212
ns1191052
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
417625
ns416333
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
419834
ns421645.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
420958
ns421208.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
417208
ns417667
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
204963.5
ns208798
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8983027
ns7659621
ns1.17
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
546875
ns518208
ns1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
280603
ns282883
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
669791.5
ns747916.5
ns0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
780667
ns671583
ns1.16
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
689645.5
ns673562.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
725292
ns748021
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1038703
ns1048327.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
49679972
ns45569778.5
ns1.09
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6487209
ns6335208.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
909389
ns914290
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
3413542
ns3428937.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
3417875
ns3384709
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
3420479
ns3435000
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
3414187
ns3417875
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168543
ns175238.5
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8597060
ns8069034
ns1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1366458.5
ns1424083
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
434404
ns426124
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
6191104
ns6191270.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
6232645.5
ns6170041
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
6213854
ns6167416.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6216250
ns6190792
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
979877
ns994959
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
50928344
ns50094330
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7557875
ns7413750
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1538944
ns1549811
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
471584
ns470666
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
341687.5
ns252458
ns1.35
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
340375
ns342417
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
902500
ns901125
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46568
ns46139
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI
450349
ns884569
ns0.51
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal
504562.5
ns368208
ns1.37
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
241952
ns243602
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2276916
ns2334750
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2038666
ns1752562
ns1.16
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2034583
ns2041187.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3280958
ns3280124.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
253153
ns255952
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI
14086050
ns12850913
ns1.10
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal
2208291.5
ns2244770.5
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
765407
ns770018
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57959
ns55708
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46250
ns39041
ns1.18
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46250
ns46020.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82792
ns84125
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28134
ns28321
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1575508
ns1407008
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1135958
ns1106875
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
73405.5
ns76505.5
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1962520.5
ns2029708
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2093312.5
ns2082292
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2086834
ns2090958
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2000458.5
ns1949604
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
229351
ns232547
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
38934321
ns35887652
ns1.08
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11662250
ns11649979
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1196771
ns1052311
ns1.14
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58208
ns55833
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46812.5
ns39083.5
ns1.20
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46708
ns46375
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82375
ns84042
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
49491
ns49287
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
947062
ns790006.5
ns1.20
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1068833
ns1049084
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
77751
ns69820
ns1.11
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1937792
ns1919458
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1974209
ns1955416.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1960000
ns1946334
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1899959
ns1890750
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
235535
ns239685
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
22349832.5
ns17609091
ns1.27
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9994166
ns9788042
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
915999
ns918859
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns417
ns0.80
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
34420
ns34717
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
1328125
ns1181143
ns1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
278292
ns263500
ns1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
45880
ns46211
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6541
ns6333
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6917
ns7500
ns0.92
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6584
ns6583
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6458
ns7000
ns0.92
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
209753
ns208392.5
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
22541718
ns20162243
ns1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4971437.5
ns4479667
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
368183
ns365124
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31457
ns32562
ns0.97
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI
1340759
ns1251080
ns1.07
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal
258291
ns258000
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
36451
ns37000
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
3458
ns2750
ns1.26
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
3292
ns3625
ns0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2917
ns2709
ns1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2917
ns2917
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
185714.5
ns189309.5
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI
8803725
ns7798739
ns1.13
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal
950374.5
ns905666.5
ns1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
150601
ns151136.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
424603.5
ns467667
ns0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
425000
ns444750
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
430459
ns425999.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
443562.5
ns421833.5
ns1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
136540
ns137895
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6325011.5
ns5774821
ns1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2056896
ns2386500
ns0.86
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
365713
ns367024
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3790417
ns3802521
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3803834
ns3765917
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3804250
ns3811417
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3813000
ns3799541.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
699295
ns709425
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
34149887.5
ns33554230
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11037916.5
ns10457896
ns1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1464794
ns1471404
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49877979
ns49735229.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
35522250
ns25984959
ns1.37
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35535229
ns35560875
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
96934583
ns96902041.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1591242
ns1616773
ns0.98
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1047550
ns1045271
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154708541.5
ns153907333
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
112454083.5
ns89247291.5
ns1.26
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112480333
ns112379750
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
296379229
ns294166500
ns1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6494323.5
ns6515848
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5551012.5
ns5562255.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
19062.5
ns14521
ns1.31
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
17833.5
ns14958
ns1.19
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
17041
ns16833
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15875
ns14854.5
ns1.07
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
21028
ns20539.5
ns1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI
1230713
ns1114507
ns1.10
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal
219604.5
ns206959
ns1.06
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
25950
ns26060
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
10958
ns10625
ns1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
9041
ns7771
ns1.16
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
9041.5
ns9208
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17375
ns17437.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
257331
ns260548
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI
10803925
ns9528073.5
ns1.13
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal
1552917
ns1587125
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
147801
ns149326.5
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
9354.5
ns7958
ns1.18
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
10000
ns9292
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10458
ns9500
ns1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7458
ns7958.5
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
114779
ns116273.5
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
3881100
ns3476228
ns1.12
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
797833
ns810375
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
233502
ns233683
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9916
ns9208.5
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9708
ns10645.5
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9334
ns10208
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9709
ns10375
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
616669
ns619508.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
25342914
ns22906068.5
ns1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
4989750
ns4432792
ns1.13
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
651926
ns654786
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10583
ns8291.5
ns1.28
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9146
ns10459
ns0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10584
ns10042
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9875
ns9250
ns1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
120200.5
ns120531
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
3758991
ns3436472
ns1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
905750
ns901792
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
71611
ns71071
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13541
ns13250
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
15500
ns16042
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
15458.5
ns17208
ns0.90
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
18125
ns15167
ns1.20
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
585824
ns592138
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
21389400
ns18951458.5
ns1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
4649750
ns4027062.5
ns1.15
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
343933
ns345753
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
500
ns459
ns1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
584
ns583
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
500
ns500
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
500
ns541
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
34550
ns34521
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
1371228
ns1191899
ns1.15
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
447645.5
ns371562.5
ns1.20
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
203956.5
ns206352
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8270.5
ns7062.5
ns1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8708
ns8333.5
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9167
ns8583
ns1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10625
ns8000
ns1.33
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
231015.5
ns233771
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
24528595.5
ns23357164
ns1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5171458.5
ns4885833
ns1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
654796
ns662116
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
16167
ns12292
ns1.32
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
15895.5
ns13229
ns1.20
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
15979
ns15125
ns1.06
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
11875
ns10167
ns1.17
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
21988
ns22042
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI
1304948
ns1119591.5
ns1.17
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal
257646
ns189125
ns1.36
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
184412
ns189132
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
32084
ns31875
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
31875
ns32333.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
32250
ns32291.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
31708
ns32000
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
271511.5
ns276327
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI
12350146
ns12201192
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal
1659167
ns1697542
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
587425
ns595015.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
504958
ns480875
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
481520.5
ns441083
ns1.09
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
443208
ns450250
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
488374.5
ns490979
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195092
ns194024
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6520561
ns5766516
ns1.13
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1945520.5
ns2629708
ns0.74
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
367668
ns368063.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3839417
ns3822958
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3824437.5
ns3807354
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3828250
ns3827834
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3827604.5
ns3826167
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
535436
ns544349
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32985580
ns29050298
ns1.14
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9639667
ns9196542
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1204966.5
ns1359983
ns0.89
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
781980875
ns838219667
ns0.93
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
543423875
ns415052604.5
ns1.31
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
542625875
ns543102500
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1559677978.5
ns1525021500
ns1.02
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22745322
ns22764607.5
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14786409
ns14772276
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2528971583
ns3570164958
ns0.71
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
2254450917
ns1502049709
ns1.50
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
2476668541
ns2269221042
ns1.09
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
6300456542
ns4773617583
ns1.32
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
366701385
ns369302709
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
88751089
ns87924411
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
75666
ns79646
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
79041.5
ns78895.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
79458.5
ns78667
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76208
ns77583
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
203948
ns207237
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
9083475
ns7871351
ns1.15
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
526062.5
ns520375
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
106536
ns107601
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
270270.5
ns250834
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
292875
ns294583.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
198312
ns285708.5
ns0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
194667
ns222333.5
ns0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1034833
ns1049109.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
46783284
ns43337417.5
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6115521
ns6122958
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
633286
ns640576
ns0.99
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199771000
ns199656458.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
138674666
ns103769666.5
ns1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
138669167
ns139342042
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
388512334
ns388182208
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5812826
ns5838796
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3596784
ns3577840.5
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
621035604.5
ns616451521
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
439829542
ns351188291.5
ns1.25
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
440801667
ns439680896
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1196350375
ns1178137125
ns1.02
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26769444
ns26651952
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
21887487
ns22092888
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7291
ns7333
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6042
ns5292
ns1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6125
ns6084
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9875
ns10167
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
27497
ns27714.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1432348
ns1202781
ns1.19
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
374083
ns351458
ns1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
46690
ns48481
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216042
ns218291.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
224375
ns222250
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220687.5
ns221209
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
208062.5
ns213708.5
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
218341
ns222292
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
35298334.5
ns31765824
ns1.11
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9155708
ns9125125
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
528325
ns529665
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9354.5
ns7271
ns1.29
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9396
ns9541.5
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9750
ns9791
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7958.5
ns8187.5
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
118295
ns117715.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
3790588
ns3188633
ns1.19
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
873834
ns885458
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
69600
ns69700
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8562.5
ns7479
ns1.14
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9834
ns10479.5
ns0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9500
ns10875
ns0.87
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
12312.5
ns8875
ns1.39
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
512184
ns519786.5
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
21450760
ns18597573.5
ns1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
4433459
ns3961208
ns1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
315553
ns316073
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns416
ns1.30
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
709
ns750
ns0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
500
ns459
ns1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
458
ns500
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
26098
ns26338
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
1299422.5
ns1200694
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
479708.5
ns488604.5
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
46840
ns46820
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9167
ns9291
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
11416
ns10416
ns1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
11062.5
ns9208.5
ns1.20
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9416
ns11583
ns0.81
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
251250
ns253612
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
25886430.5
ns25803867.5
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5832146.5
ns5171833.5
ns1.13
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
387883
ns388624
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
107916
ns104834
ns1.03
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
99250
ns84834
ns1.17
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
100645.5
ns99500
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
146583
ns146333
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
24989
ns24613
ns1.02
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI
1282751
ns1194962
ns1.07
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal
267229.5
ns246062.5
ns1.09
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
189842
ns192062
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
514208
ns526854
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
478541.5
ns478875
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
478375
ns500416.5
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
482875
ns478958.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
229903
ns232619
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI
12990087
ns11733131
ns1.11
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal
2133042
ns1709625
ns1.25
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
608146
ns610896
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5666
ns5125
ns1.11
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
7250
ns7167
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
6291
ns6791
ns0.93
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
6625
ns4042
ns1.64
batchedmm(16, Bsize=32)/forward/GPU/CUDA
16240.5
ns16580
ns0.98
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
79631
ns79701
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
12417
ns11708
ns1.06
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
11167
ns11584
ns0.96
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
12041.5
ns10792
ns1.12
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
16416.5
ns17687.5
ns0.93
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
211157
ns214143.5
ns0.99
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
375234
ns366964
ns1.02
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
39750
ns35792
ns1.11
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
52000
ns50791
ns1.02
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
53021
ns51833.5
ns1.02
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
16042
ns13542
ns1.18
batchedmm(16, Bsize=128)/forward/GPU/CUDA
19539
ns21568
ns0.91
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
90780.5
ns87241
ns1.04
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
42917
ns38979.5
ns1.10
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
32167
ns30708
ns1.05
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
32875
ns30416
ns1.08
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
57042
ns58458
ns0.98
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
190769.5
ns192010
ns0.99
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
392564
ns395119
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1833.5
ns1729.5
ns1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1875
ns1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2083
ns2146
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1792
ns1709
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
20462
ns20594
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI
1239481
ns1163029.5
ns1.07
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal
307042
ns326833
ns0.94
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
31870
ns33120
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2125
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2208
ns2333
ns0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2291
ns2250
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2208
ns2042
ns1.08
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
201344.5
ns204587
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI
10165131
ns9292587
ns1.09
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal
1570917
ns1518500
ns1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
136316.5
ns136826.5
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6520.5
ns4417
ns1.48
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5000
ns5250
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5625
ns6375.5
ns0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5500
ns4041.5
ns1.36
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
143896
ns145077
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
6277095.5
ns5424296
ns1.16
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
750374.5
ns725208
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
69261
ns69471
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8645.5
ns8041
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8583.5
ns8958
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9291.5
ns8416
ns1.10
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8020.5
ns9208
ns0.87
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
867420
ns875812.5
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
42275328
ns40742928.5
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5663374.5
ns5580917
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
387123
ns389804
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56875
ns56792
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57833
ns56875
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57750
ns57584
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
58375
ns58375
ns1
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36655
ns37054
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1241845
ns1234596.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
541750
ns336000
ns1.61
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
202922
ns203242
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
468937.5
ns485813
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
477229.5
ns499958.5
ns0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
464541
ns468208
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
433625
ns438854.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
263574
ns268055
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
28829027
ns27322975
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8162250
ns8122166.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
827187.5
ns832729
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3317521
ns3291250
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
2329500
ns1764708
ns1.32
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2336167
ns2339021
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6302416.5
ns6260292
ns1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204892
ns204625
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
208562
ns209992
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11517062.5
ns11332208
ns1.02
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
8328812.5
ns6550833
ns1.27
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8342500
ns8325250
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21059354.5
ns20937125
ns1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
734814.5
ns734916
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1048679.5
ns1048155.5
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5604.5
ns4291
ns1.31
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5875
ns5875
ns1
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6395.5
ns6583
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4750
ns4896
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
136624.5
ns137991.5
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
6038921
ns5581467
ns1.08
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
813000
ns785625
ns1.03
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
56330
ns56390
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9834
ns7042
ns1.40
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
11375
ns10562.5
ns1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10792
ns7104.5
ns1.52
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7083
ns7833
ns0.90
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
751768
ns754679
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
37322808
ns34960226
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5368750
ns5245042
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
366754
ns371414
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
126417
ns127625
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
101833
ns95624.5
ns1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
97167
ns100000
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
135458.5
ns95708
ns1.42
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
149617
ns152137
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6377317.5
ns5871279.5
ns1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2013729
ns2635166.5
ns0.76
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203027
ns203242
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1956250
ns2017959
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2025708
ns2027771
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2023583
ns2021167
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2023875
ns1987167
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
699728
ns703925.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32486459.5
ns31965494
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11144687.5
ns11055292
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1109856
ns1255893
ns0.88
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
33708
ns29375
ns1.15
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
36250
ns34500
ns1.05
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
35292
ns35250
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
667
ns583
ns1.14
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15147
ns15622
ns0.97
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
78750
ns80130
ns0.98
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
3166
ns2542
ns1.25
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
3292
ns3125
ns1.05
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3916
ns2834
ns1.38
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2125
ns3000
ns0.71
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
138043.5
ns141408
ns0.98
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
341483.5
ns343344
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7333
ns7125
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns5375
ns1.14
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5959
ns6000
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10083
ns10209
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36390.5
ns36671
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1443013
ns1208337
ns1.19
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
577687.5
ns331459
ns1.74
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48291
ns48221
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217083
ns217479
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
233729
ns229625
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220875
ns225000
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206583
ns212875
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
241954
ns244929
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
28863209.5
ns26091309.5
ns1.11
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8063584
ns7984187.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
578495
ns574266
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3958
ns3959
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3917
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3916
ns3917
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
21377
ns21419
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI
2296242.5
ns2118188.5
ns1.08
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal
246729.5
ns234583
ns1.05
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
42010
ns42620
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14750
ns14791
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15000
ns14750
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14834
ns14875
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14937.5
ns14833
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
306378
ns311492
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI
12904688
ns10906139
ns1.18
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal
1048854
ns982000
ns1.07
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
192742
ns192231.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
128750
ns140834
ns0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
128042
ns127417
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
102500
ns105167
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
128458
ns141000
ns0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133598
ns152595
ns0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6098969
ns6050834
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1992062.5
ns2057334
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203872
ns213297
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1913375
ns1917833
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1918875.5
ns1898875
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1920354.5
ns1922083
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1922729.5
ns1898854
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
684636
ns692137
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31678268
ns31139112
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10983583.5
ns10436541
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1217291
ns1217872
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19708
ns18250
ns1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18000
ns18625
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21250
ns20750
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17541
ns17749.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
107089
ns110137
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3668703
ns3282416
ns1.12
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1366125
ns480541.5
ns2.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79431
ns79421
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222250
ns252041.5
ns0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
227291.5
ns217541.5
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221667
ns219687.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217604.5
ns222729.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
512942
ns519298
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20906293
ns20051825.5
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6227916.5
ns6194812.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
478915
ns478425
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
24625
ns23291.5
ns1.06
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
32084
ns28583
ns1.12
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
29583.5
ns28792
ns1.03
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1354
ns1229.5
ns1.10
batchedmm(16, Bsize=4)/forward/GPU/CUDA
15775
ns16210
ns0.97
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
87130
ns82241
ns1.06
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
5208
ns4292
ns1.21
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4937.5
ns4729
ns1.04
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
6250
ns5042
ns1.24
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4208
ns5771
ns0.73
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
205104
ns207444.5
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
375704
ns378084
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
305500
ns305417
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
305958
ns306250
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
308125
ns308084
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
304792
ns305750
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
224810.5
ns228609
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8473173
ns7545946
ns1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1064042
ns604584
ns1.76
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
272523
ns273963
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
588083
ns532917
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
540979
ns538167
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
532187.5
ns539125
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
530000
ns572709
ns0.93
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1066787
ns1074383
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
49056863
ns44755027.5
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6401167
ns6115208.5
ns1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
857918.5
ns858603.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20292
ns19291
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
21021
ns20708
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21584
ns22375.5
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19459
ns19875
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
111914.5
ns114907
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3915484
ns3614583
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1445124.5
ns593916
ns2.43
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79161
ns79421
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
259584
ns215708
ns1.20
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
218709
ns220584
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213833
ns213625
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221709
ns215875
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
729277
ns762395
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
28351086
ns25444001
ns1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7519125
ns7232562.5
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
535735
ns542290.5
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7542
ns6125
ns1.23
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6750
ns7083
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7854
ns7917
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6416
ns6208
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
139596.5
ns140165.5
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
6386332.5
ns5168559
ns1.24
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
812791.5
ns799291
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
64971
ns65270
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12937
ns9542
ns1.36
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9604
ns10333.5
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10479
ns10375
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9042
ns11145.5
ns0.81
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
821389
ns826456
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
41212440
ns37337383
ns1.10
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5394125
ns5311708
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
376673
ns387474
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5542
ns4875
ns1.14
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6041
ns6917
ns0.87
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5979
ns7250
ns0.82
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4125
ns4812.5
ns0.86
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
143159
ns144262
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
6135330
ns5426091.5
ns1.13
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
841208
ns808375
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
66410
ns66621
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8125
ns7458
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7625
ns8083
ns0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7333
ns7541.5
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7458
ns7833
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
779803.5
ns783702
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
42489606.5
ns37497088
ns1.13
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5806041.5
ns5566229
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
385114
ns395004
ns0.97
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14517875
ns14350584
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
10107833
ns7693688
ns1.31
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10123375
ns10127042
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27737959
ns27615959
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
529900
ns548306
ns0.97
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
392854
ns393134
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46502041.5
ns45943208
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
33504375
ns26437417
ns1.27
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33527167
ns33454833
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85258875
ns84782667
ns1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2630210
ns2657066
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3305402
ns3290613
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
68083
ns66375
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
66021
ns68584
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
69042
ns69333.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
66875
ns65979
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
120187.5
ns121920.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3913619.5
ns3593431.5
ns1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1439458.5
ns508166
ns2.83
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
224532
ns229397.5
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
502375
ns446833
ns1.12
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
452542
ns452437.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
441146
ns446375
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
444833
ns445834
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
732944
ns728139
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
29462542.5
ns26912797
ns1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7794083
ns7552104
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
779447
ns790108
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns500
ns1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns666
ns0.88
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns500
ns1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns667
ns0.75
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
33084
ns32311
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
1348590.5
ns1198752.5
ns1.12
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
458416.5
ns473500
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
47291
ns47340
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9209
ns8666
ns1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9500
ns9208
ns1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
8666
ns8458
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9167
ns17104
ns0.54
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
289186
ns286358
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
24166950
ns20778583
ns1.16
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5210708.5
ns4681395.5
ns1.11
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
381324
ns375004
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9792
ns9875
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9834
ns9875
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9792
ns9792
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9792
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23519
ns23012
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI
2258808.5
ns2014844
ns1.12
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal
221041.5
ns215645.5
ns1.03
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
207272
ns205762
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
45959
ns45958
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45959
ns46042
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
46041
ns46041
ns1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
46375
ns46250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
292709.5
ns290878
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI
13279604
ns9152947
ns1.45
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal
963562.5
ns942542
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
601736
ns607695
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56834
ns56250
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57208
ns56458
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57000
ns57083
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
57791
ns57709
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28797
ns28552
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1296667
ns1253508.5
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
599375
ns663666.5
ns0.90
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
214467.5
ns203541.5
ns1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
488583
ns448583
ns1.09
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
506875
ns465562
ns1.09
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
467854
ns465458.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
444854
ns454041.5
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
247966
ns245887
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
35422277.5
ns33424426
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9625250
ns9545520.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
889783
ns887779
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
662791
ns645812.5
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
645583
ns575959
ns1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
641458
ns640542
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
654708
ns646271
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
204631.5
ns208584
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
9404311.5
ns8406939
ns1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1366041
ns1406395.5
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
307612.5
ns315503
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2256146
ns2214979
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2230917
ns2211999.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2237292
ns2220812.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2235916
ns2227958
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
983378
ns978439
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
51532010
ns47363900
ns1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7223667
ns10481646
ns0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1360743
ns1213952
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21208
ns18625
ns1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
21895.5
ns20729
ns1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24000
ns21583
ns1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18708
ns18875
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
113606
ns113850.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
4029922
ns3565557.5
ns1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1470375
ns497958
ns2.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81911
ns79731
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
263833
ns227375
ns1.16
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
230917
ns259417
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221375
ns225541
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
261833.5
ns227084
ns1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
732293.5
ns729838
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
28666996
ns26163617
ns1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7932292
ns7560500
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
557920
ns554315
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
584
ns500
ns1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
584
ns584
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
583
ns541
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23564
ns23274
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
1402930.5
ns1191789
ns1.18
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
479854.5
ns484250
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
49551
ns48040
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10042
ns9083
ns1.11
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9833
ns10437.5
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9208
ns9541
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
8625
ns9500
ns0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
271175.5
ns268183
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
27354439
ns24685731.5
ns1.11
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5706584
ns5000875
ns1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
399053
ns398234
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9709
ns7250
ns1.34
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9104
ns9187.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9437.5
ns9645.5
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8375
ns8041
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
122324.5
ns118921.5
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
3848922
ns3382327
ns1.14
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
890083
ns886791.5
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
69951
ns71801
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7417
ns7604
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7500
ns8125
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7500
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7333
ns7562.5
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
514534
ns507494
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
19594222
ns17189656.5
ns1.14
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4165479
ns3782375
ns1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
320028
ns320313
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1562.5
ns1500
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1708.5
ns1708.5
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833.5
ns1791
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1333
ns1375
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
21964
ns21598
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI
1238732.5
ns1189888
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal
302542
ns313375
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
188582
ns190932
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3333
ns3541
ns0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3458
ns3583
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3334
ns3458
ns0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3250
ns3292
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
224397.5
ns218452
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10897600
ns9603283
ns1.13
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal
1688875
ns1797375
ns0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
578505.5
ns583116
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
148875
ns148104.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
132708
ns106833
ns1.24
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
130750
ns128562.5
ns1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
225250
ns225000
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
24103
ns23975
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI
1297180
ns1165725
ns1.11
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal
269833
ns254292
ns1.06
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
40231
ns41470
ns0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
162604
ns157645.5
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
127166
ns87625
ns1.45
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
112750
ns112000
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
265229
ns250708.5
ns1.06
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
219287
ns218220.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI
11195277.5
ns10460438
ns1.07
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal
1990375
ns1096666
ns1.81
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
267987.5
ns269773
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7167
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5959
ns5333
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6000
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10209
ns10458
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33200
ns32755
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1323539
ns1178842
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
615604
ns330458
ns1.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50040
ns50720
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
260750
ns253104
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
234833
ns229041.5
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
265125
ns234187.5
ns1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221333
ns227938
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
264591
ns263186.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
29454390
ns27448206
ns1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8466083
ns8237750
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
592630
ns594190.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
15750
ns13792
ns1.14
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
15667
ns15166
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
16167
ns16499.5
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
14541
ns14667
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
140225
ns139540
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
6115964
ns5436668.5
ns1.12
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
798333
ns786729
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
232492
ns232963
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
23708
ns23000
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23479
ns23937.5
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
23562.5
ns23875
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
22667
ns23979.5
ns0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
872247
ns870094.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
42738683
ns40010466.5
ns1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5646770.5
ns5595708
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
676987
ns679366
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
10041
ns8750
ns1.15
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
10187.5
ns10312.5
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11666
ns11271
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8792
ns9584
ns0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
125357.5
ns123388.5
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
3857738.5
ns3563169
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
898625
ns858292
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
75221
ns74460
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14000
ns13375
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13812.5
ns14458.5
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14062.5
ns13958
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14292
ns13625
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
675390
ns667308
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
23526980.5
ns21257602
ns1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5359958.5
ns4997708
ns1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
365113
ns365743
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10292
ns8583
ns1.20
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9646
ns10333
ns0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10958
ns10312.5
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
8542
ns9166
ns0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
124246
ns121770.5
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
3650341
ns3365145.5
ns1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
890042
ns906625
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
72050
ns75170
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13084
ns12292
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12896
ns13437.5
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12542
ns12916
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12667
ns12458
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
557269
ns553718.5
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
20940364
ns18868109
ns1.11
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
4415208
ns3865125.5
ns1.14
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
341913.5
ns341293
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
30438
ns26354.5
ns1.15
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
32771
ns30645.5
ns1.07
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
32145.5
ns31541
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
1875
ns1833
ns1.02
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16382
ns16183
ns1.01
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
80651
ns81001
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5375
ns5209
ns1.03
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
4937
ns5021
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5208
ns5417
ns0.96
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6292
ns6604
ns0.95
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
141456.5
ns140577.5
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
382544
ns370423.5
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns250
ns1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
292
ns250
ns1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
250
ns291
ns0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
26188
ns25697
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
1349689
ns1197018
ns1.13
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
455771
ns465667
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
48850
ns47180
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6583
ns6125
ns1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6729
ns0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6250
ns6333
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6250
ns6312.5
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
190177
ns187721.5
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
25715880
ns23736279.5
ns1.08
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5628084
ns4952833.5
ns1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
388664
ns386429
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
2042
ns1959
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2042
ns2042
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2125
ns2000
ns1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
1958
ns1959
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
26944
ns26463
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
1363088
ns1170027.5
ns1.17
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
471437.5
ns479625
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
205032
ns206252
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16958
ns16250
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16250
ns16666
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16749.5
ns16208.5
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16250
ns16417
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
278717.5
ns276067
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
26543319
ns24921263
ns1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6143666
ns5326083
ns1.15
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
701356
ns700836
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
193791
ns173875
ns1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
174166.5
ns148750
ns1.17
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
151875
ns155708
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
161458
ns147458
ns1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
200117.5
ns203847
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8677326
ns8347024.5
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1431250
ns1561917
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
224822
ns232482
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1332708
ns1328917
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1313042
ns1311771
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1321250
ns1320791
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1320542
ns1322500
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
914262.5
ns909940.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
52072722
ns44667022
ns1.17
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6865145.5
ns7124333
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1099471
ns995559.5
ns1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25270.5
ns22958
ns1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25750
ns26833
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28167
ns27625
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24645.5
ns24667
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
236681
ns234608.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8520645.5
ns7924652
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
960167
ns576541
ns1.67
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
114711
ns116011
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
128833.5
ns118166.5
ns1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
184437.5
ns122375
ns1.51
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
126541.5
ns158041.5
ns0.80
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
117313
ns123833.5
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1084581
ns1073695
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
48584064.5
ns44153968
ns1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6244708
ns6127166
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
609766
ns612925
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
334
ns250
ns1.34
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
292
ns291
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
250
ns250
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23179.5
ns23160
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
1352649.5
ns1212472
ns1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
470375
ns478542
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
47251
ns47471
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6875
ns6291
ns1.09
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6667
ns6833.5
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6250
ns6458
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6604
ns6584
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
206812
ns204382.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
26430531.5
ns24496787
ns1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5939666
ns5334937.5
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
393154
ns388703
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6750
ns5208
ns1.30
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6416.5
ns7021
ns0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7042
ns7458
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6750
ns5667
ns1.19
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
147041
ns145933.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
6204224
ns5745568
ns1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
711062.5
ns753959
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
232702
ns234802
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10250
ns9583
ns1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9875
ns10375
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10250
ns10125
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9792
ns10042
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
908474
ns903827
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
42229280
ns42297357
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
6135833
ns5826479
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
665637
ns668457
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
667
ns667
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
667
ns709
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
667
ns625
ns1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
625
ns625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22806
ns22371
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI
2183221
ns2015786
ns1.08
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal
228667
ns208416
ns1.10
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
206602
ns207552
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4625
ns4584
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4666
ns4833
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4625
ns4666
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4584
ns4584
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
229835
ns228749
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10794904
ns10461831
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal
1685770.5
ns1654416.5
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
577495
ns580735
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
9042
ns7750
ns1.17
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
9083.5
ns9166.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9354
ns8834
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7834
ns8291
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
124219
ns121959
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
3899985
ns3411255
ns1.14
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
810375
ns827916
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
74040.5
ns74011
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9000
ns8625
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8291
ns9041.5
ns0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8750
ns8583.5
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8375
ns8375
ns1
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
596441
ns591884.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
23179785
ns20708574.5
ns1.12
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
4819896
ns4264875
ns1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
338953
ns342784
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
127000
ns122750
ns1.03
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
131000
ns96459
ns1.36
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
129584
ns130187.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
180958.5
ns180875
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46329
ns45830
ns1.01
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
104561
ns101721
ns1.03
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
341167
ns328000
ns1.04
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
333583
ns166666
ns2.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
325333
ns347541.5
ns0.94
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
588354
ns608646
ns0.97
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
194256.5
ns192063
ns1.01
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
512055
ns505519.5
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
399208
ns395916
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288166.5
ns214250
ns1.35
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287875
ns288167
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
755750
ns756500
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43515
ns43676.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI
1420150
ns1411321
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal
420292
ns429792
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
81701
ns82131
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1396437
ns1458834
ns0.96
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1134500
ns857583
ns1.32
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1133416.5
ns1134333
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2443791.5
ns2441958.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
250930
ns249859
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI
12447603
ns10370982
ns1.20
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal
1797500
ns1909646
ns0.94
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
352383.5
ns352903
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
658917
ns616500
ns1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
647083.5
ns598250
ns1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
625729
ns648916.5
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
629562.5
ns642667
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
202467
ns200586.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
9193261
ns7794534
ns1.18
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1344749.5
ns1363291
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
311273
ns313733
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2486625
ns2445375
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2447229
ns2426917
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2446229
ns2441500
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2455167
ns2440750
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
999287
ns994961
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
61254580
ns50766350
ns1.21
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10164208
ns9661291
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1302412
ns1307388
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
33437.5
ns28521
ns1.17
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
35145.5
ns34625
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
33896
ns33916.5
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
875
ns875
ns1
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15909
ns15425.5
ns1.03
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
84991
ns79381
ns1.07
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3250
ns3062.5
ns1.06
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3083.5
ns3416
ns0.90
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3333
ns3208
ns1.04
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3041
ns3209
ns0.95
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
139820.5
ns139741
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
335653
ns338953
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
409291
ns404500
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
408167
ns402125
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
408916
ns408334
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
420042
ns422458
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
43861
ns43145
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1610692
ns1417291
ns1.14
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1146937.5
ns1128750.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
241802
ns239562
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3890500
ns3863292
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3991792
ns3971625
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3995938
ns3996791
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3777541.5
ns3757979.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
245384
ns242826
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
40053105
ns38623864
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11890208
ns11673750
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1427303
ns1433229
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3959
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3916
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3875
ns3917
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33956
ns33968
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI
1415999
ns1232483
ns1.15
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal
180646
ns167334
ns1.08
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
39530
ns38620
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15583
ns15666
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15708
ns15750
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15708
ns15625
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15625
ns15625
ns1
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
256980
ns255128
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI
9741901
ns8717525
ns1.12
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal
867771
ns843520.5
ns1.03
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
177356.5
ns169816.5
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
403959
ns402625
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
295875
ns220209
ns1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
295292
ns295959
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760750
ns760791.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113403.5
ns113239
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI
1056307
ns1047524
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal
458041
ns348895.5
ns1.31
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
89041
ns89300.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1445458
ns1474958.5
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1158000
ns881146
ns1.31
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1156604
ns1159083.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2464729.5
ns2461917
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
241604
ns241292
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI
12919628
ns9318727.5
ns1.39
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal
1936541.5
ns1946459
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
353843
ns354883
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
583
ns500
ns1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
584
ns542
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
583
ns500
ns1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
459
ns500
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
26174
ns25844
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
1343237.5
ns1200537.5
ns1.12
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
430334
ns496709
ns0.87
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
209062
ns209382
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7875
ns7375
ns1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7708
ns8104.5
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7500
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7250
ns7375
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
214822.5
ns217033.5
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
28436000
ns25754399
ns1.10
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5825750
ns5254333.5
ns1.11
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
684816
ns685977
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
836604
ns825125.5
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
618875
ns468584
ns1.32
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
620167
ns621500
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1552792
ns1536542
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA
130046
ns130845.5
ns0.99
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
229912
ns229862
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2694187.5
ns2661979
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
2000104.5
ns1535250.5
ns1.30
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1999042
ns2000792
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4936792
ns4906416
ns1.01
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
251857
ns242304
ns1.04
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
837543
ns841449
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
334
ns375
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
291
ns250
ns1.16
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
291
ns291
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32688
ns32216
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
1331487
ns1218492
ns1.09
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
447625
ns464375
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
46711
ns47630
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6666
ns6125
ns1.09
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6458
ns6708
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6208
ns6500
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6417
ns6375
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
232857
ns224154.5
ns1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
24854567
ns21407773
ns1.16
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5311167
ns4615291
ns1.15
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
359813.5
ns357793.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2405750
ns2392708
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2416666
ns2371959
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2377375
ns2404416
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2392666
ns2370084
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
201638
ns200035.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8402298
ns7868335
ns1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1416500
ns1597041.5
ns0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
372683.5
ns373933
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4654167
ns4648292
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4665479
ns4644250
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4644229.5
ns4636708
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4648583
ns4642750
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
902404.5
ns891890
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
52065462
ns46027858
ns1.13
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6861875
ns6938541.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1391004
ns1391633
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6708.5
ns7187.5
ns0.93
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7208
ns7542
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7645.5
ns7125
ns1.07
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
13396
ns6875
ns1.95
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
23661
ns23289
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI
1330674.5
ns1167669
ns1.14
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal
266208
ns243458.5
ns1.09
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
39961
ns39800
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
51604
ns46396.5
ns1.11
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
49000
ns32917
ns1.49
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
45750
ns45875.5
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
45375
ns67312
ns0.67
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
218958
ns214725
ns1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI
11575244
ns10485830
ns1.10
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal
2067250
ns1121562
ns1.84
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
264843
ns269102.5
ns0.98
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
21396
ns19604.5
ns1.09
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
25667
ns24021
ns1.07
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
24249.5
ns23750
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
7375
ns5084
ns1.45
batchedmm(2, Bsize=512)/forward/GPU/CUDA
17124
ns17227
ns0.99
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
84151
ns83741
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
12229
ns11916
ns1.03
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
10687
ns9354.5
ns1.14
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
10229
ns10417
ns0.98
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
17792
ns17958
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
229557
ns225890
ns1.02
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
371578.5
ns371753
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
406750
ns404000
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
297125
ns222584
ns1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
296834
ns296875
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762417
ns762667
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46955
ns46288
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI
1453711
ns1401617.5
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal
484187.5
ns358375
ns1.35
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
88881
ns89491
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1431645.5
ns1480896
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1166209
ns888250
ns1.31
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1164750
ns1164959
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2471229
ns2465417
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
294082.5
ns288016
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI
12353848
ns12678894
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal
2093020.5
ns2117375
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
380814
ns381744
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
434500
ns432125
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
437125
ns430333
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
437250
ns436917
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
447542
ns448604.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
54894
ns54122.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1083139
ns1002212
ns1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1087416
ns1059021
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
233642
ns234952
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3902292
ns3895042
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4012625
ns4004458
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4016541
ns4030291.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3808250
ns3789979
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
266487.5
ns260055
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35233900
ns30675954
ns1.15
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10616978.5
ns10349458.5
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1364063
ns1223712
ns1.11
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
8750
ns8750
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
7667
ns6917
ns1.11
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7667
ns7583
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12375
ns12416
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24395
ns23553.5
ns1.04
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI
2388137.5
ns2134096
ns1.12
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal
229041
ns214667
ns1.07
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
209122
ns211142
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
44875
ns44958
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45000
ns45083
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
45000
ns45000
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
45292
ns44958
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
350021
ns344550
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI
14581645
ns14001329.5
ns1.04
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal
1777208
ns1862458
ns0.95
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
655627
ns659011.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
124000
ns122729
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
96270.5
ns83521
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86562.5
ns87354.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
86958.5
ns105375
ns0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
189446
ns190055
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6078785
ns5969481
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1983729
ns1972791.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
221122
ns214447
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2025375
ns2012458.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2011792
ns1980000
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2010229
ns2023917
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2013666.5
ns2011645.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
536819
ns529776
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
29198754
ns29142428
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9376375
ns9305500.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
967839
ns1088680
ns0.89
This comment was automatically generated by workflow using github-action-benchmark.