This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
4 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7ba127a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
7ba127a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/115248
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
7ba127a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4667
ns6083
ns0.77
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6666.5
ns6250
ns1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7500
ns8104
ns0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5750
ns5333
ns1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
117321
ns127763
ns0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
2723919
ns2680722
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
3008750
ns817500
ns3.68
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
404195
ns410844
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9896
ns9771
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9833
ns9958
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9979
ns9834
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9958.5
ns9958
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
533872
ns539870
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
18512917
ns18273784
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
2324292
ns2523292
ns0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
674968
ns669947
ns1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1437.5
ns2812.5
ns0.51
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
2875
ns1416
ns2.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
2083
ns1584
ns1.32
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1437.5
ns1333
ns1.08
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
21479
ns21455
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI
1282166
ns1323661
ns0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal
190209
ns216625
ns0.88
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
29540
ns28950
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4250
ns4458
ns0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4167
ns3375
ns1.23
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4145.5
ns4167
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4375
ns4000
ns1.09
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
144438.5
ns142970.5
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI
9108147.5
ns10240879
ns0.89
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal
1604875
ns1524333
ns1.05
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
145092
ns149491.5
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
55875
ns57833
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39209
ns40417
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46625
ns46375
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84167
ns83000
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36824
ns36725
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
542002
ns558408
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1333104
ns1040458
ns1.28
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
81391
ns81776
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2024917
ns2036667
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2079125
ns2086500
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2081625
ns2090375
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1993125
ns1993667
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
226688
ns226490
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
7623752
ns7533597
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7427958
ns8034167
ns0.92
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1252074
ns986919
ns1.27
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
174750
ns146666
ns1.19
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
164541.5
ns151000
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
148812.5
ns151062.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144375
ns194750
ns0.74
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165480
ns166182
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7680925
ns7689190
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1457521
ns1596770.5
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
204852
ns209312
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1117250
ns1113896
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1109375.5
ns1120062.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1113334
ns1119104
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1112187.5
ns1106542
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
694582
ns695636.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33705507.5
ns34400023
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6238375
ns7210396
ns0.87
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1026961
ns1024730
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4417
ns5291
ns0.83
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5041
ns4916
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5208
ns6125
ns0.85
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4583
ns4375
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
93299.5
ns91792.5
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
5368327
ns5267805
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
634041.5
ns474000
ns1.34
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
69460
ns67381
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8375
ns8750
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8542
ns8917
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8833
ns8792
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8833
ns8687.5
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
604485
ns600359
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
36365543
ns36489972
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5669937.5
ns5930125
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
388374
ns390114
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17000
ns17562.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17709
ns17979
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18021
ns20812.5
ns0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16895.5
ns17750
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
66654.5
ns66076.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
2923981.5
ns3263389.5
ns0.90
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
477833
ns1274334
ns0.37
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
78451
ns76030
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216834
ns212792
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219896
ns213000
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
225583.5
ns218292
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217625
ns254395.5
ns0.86
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
356473
ns351925
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
14201022
ns15484392
ns0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5644395.5
ns5673084
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
465005
ns468334.5
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
667
ns625
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
750
ns708
ns1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
812.5
ns770.5
ns1.05
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
625
ns666
ns0.94
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
20462
ns20050
ns1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI
1162134.5
ns1150135.5
ns1.01
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal
302625
ns295625
ns1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
32870
ns32420
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1417
ns1459
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1458
ns1520.5
ns0.96
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1417
ns1459
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1416
ns1500
ns0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
125127
ns122512.5
ns1.02
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI
8831211
ns8913698.5
ns0.99
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal
1526500
ns1644687.5
ns0.93
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
136521
ns135591
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns7334
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5416
ns5417
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6125
ns6042
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10666
ns10250
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23625
ns23888.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1207481
ns1207370.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
356458
ns446750
ns0.80
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48881
ns47420
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
226166
ns236834
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
265333
ns241875
ns1.10
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
234854
ns269875
ns0.87
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219500
ns257687.5
ns0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
192027
ns191906.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
31211143.5
ns32212683
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9046313
ns8558250.5
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
649247
ns645121
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4125
ns4083
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4083
ns4083
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4084
ns4042
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4083
ns4125
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23477
ns23307
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI
2001417
ns2000762.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal
214833
ns223875
ns0.96
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
47261
ns48080
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
17083
ns16792
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
17000
ns16625
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16833
ns16792
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
17334
ns16917
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
195303
ns191629
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI
14536946
ns10282963
ns1.41
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal
918208
ns937125
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
174652
ns176282
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
508750
ns509292
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
330583
ns332354.5
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
404666
ns404834
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
864791
ns865333
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113620
ns113483
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI
401393
ns392476
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal
490979
ns487333
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
242133
ns240773
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2313834
ns2308770.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1747479
ns1756875
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2035208
ns2033625
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3272708.5
ns3270500
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
241207
ns237569
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI
10021457.5
ns11006777.5
ns0.91
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal
2011770.5
ns2028666.5
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
743443
ns739942
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4708.5
ns6062.5
ns0.78
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
7625
ns6584
ns1.16
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7708
ns8208.5
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5479.5
ns6875
ns0.80
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
92351.5
ns91839.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
5442998
ns5704966
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
783479
ns776250
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
65411
ns65360
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10333.5
ns11041.5
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11875
ns11875
ns1
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11750
ns11125
ns1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12062.5
ns12187.5
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
634956
ns637048
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
40400531.5
ns37465688
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5457291.5
ns5651896.5
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
409979.5
ns408644
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
541
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
583
ns542
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns541
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23181
ns22899
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI
2216579
ns1980954
ns1.12
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal
332584
ns214375
ns1.55
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
47221
ns49101
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2166
ns2084
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2083
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2084
ns2208
ns0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2084
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
215755
ns228216
ns0.95
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI
11357397.5
ns11133138.5
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal
1978417
ns2019750
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
172626.5
ns180086.5
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8937.5
ns9083
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9729.5
ns8500
ns1.14
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9459
ns10833.5
ns0.87
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8958
ns8542
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
96639
ns108383.5
ns0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
3207607
ns3207332
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
876000
ns816208
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
71941
ns74171
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18521
ns16875
ns1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
19104.5
ns18792
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17625
ns18250
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18812.5
ns17812.5
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
554001
ns615805
ns0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
16517942.5
ns16767446
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5180916.5
ns5170312.5
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
378539
ns383838.5
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
458
ns500
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns500
ns1.25
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
666
ns666
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
35213
ns35553
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
1186873
ns1192710.5
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
466396
ns293146
ns1.59
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
46270
ns46141
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9312.5
ns8541.5
ns1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9916.5
ns8541
ns1.16
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9167
ns9958.5
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9458.5
ns9458.5
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
267136
ns264293
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
18948901
ns18241947
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
4572250
ns5274687.5
ns0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
367694
ns366223
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
395333
ns396958
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
214416
ns215500
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288292
ns287792
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756291
ns755333
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111882
ns110939.5
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI
329474.5
ns326929
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal
300208.5
ns365521
ns0.82
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
77331
ns74351
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1453791.5
ns1446854
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
852583
ns859125
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1132645.5
ns1132854
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2440625
ns2436292
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
207032
ns204467
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI
10204120
ns8967194.5
ns1.14
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal
1668041.5
ns1574375
ns1.06
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
324428.5
ns321063
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7041.5
ns7187.5
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7750
ns7270.5
ns1.07
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9396
ns8541.5
ns1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7791.5
ns6979.5
ns1.12
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
144806.5
ns145872
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
5813106.5
ns5766375
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
437250
ns448125
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
66071
ns65611
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13083
ns14770.5
ns0.89
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14479
ns16916.5
ns0.86
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15709
ns15687.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15354.5
ns15562.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
956377
ns956937.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
42729213
ns42931711
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5700250
ns6186333
ns0.92
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
428955
ns421904
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24000
ns25292
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24875
ns25292
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
29292
ns28583.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
27667
ns30125
ns0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
199144
ns198270.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7744284
ns7924119
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
999584
ns654625
ns1.53
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
116931
ns113131
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103583
ns157000
ns0.66
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
152687
ns118479
ns1.29
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
153583
ns118792
ns1.29
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
151000
ns145083.5
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1075746
ns1072793
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
43042130
ns41512479
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5733792
ns5879750
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
590946.5
ns587055
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
75000
ns76417
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
77084
ns74917
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
86333.5
ns80458
ns1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74875
ns82834
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
205585
ns204563.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8027595.5
ns7289524
ns1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
519187.5
ns532021
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
127562
ns126591
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
293542
ns263209
ns1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
308750
ns316562
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
315187.5
ns248479.5
ns1.27
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
304208
ns210125
ns1.45
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1108118
ns1111658.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
40422383
ns39831914
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6276458
ns6266000
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
695017
ns691997
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
15875
ns16771
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
17521
ns16791.5
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
18500
ns17542
ns1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
16958
ns16750
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
146489
ns144759.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
5586208
ns5606829
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
723083.5
ns474208
ns1.52
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
232683
ns232022
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26667
ns26895.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26687.5
ns25167
ns1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28208.5
ns27333
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27708.5
ns24167
ns1.15
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
982068.5
ns972458
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
40344043
ns41939896
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5743229
ns6295958
ns0.91
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
686807.5
ns695306.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11083
ns11209
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12042
ns11333.5
ns1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12334
ns12416.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
10791
ns11042
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
124134
ns122668.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
3473152
ns3386989.5
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
880000
ns858500
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
234213
ns233942
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21958
ns21584
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
22729.5
ns22563
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21895.5
ns22583
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
22000
ns21291
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
701831.5
ns697229
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
21157140
ns21507216
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5204750
ns5485375
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
674667
ns669687
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
63437.5
ns63104
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
65521
ns66479
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
66750
ns66584
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
63042
ns64208.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
106345.5
ns105012.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3373870
ns3348443
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
480667
ns1297624.5
ns0.37
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
233433
ns232172
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
437896
ns440625
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
456000
ns448937.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
450542
ns440917
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
444000
ns438250
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
515188
ns511759
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
21597008
ns20624860
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6095791.5
ns5921625
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
717017.5
ns713498
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6792
ns7521
ns0.90
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8000
ns8084
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8583.5
ns8667
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6917
ns7750
ns0.89
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
146052.5
ns143457
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
5510181.5
ns5597779
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
726500
ns446771
ns1.63
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
65301
ns64960
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14292
ns14875
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15292
ns15709
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14084
ns16542
ns0.85
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16209
ns15541.5
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
947670
ns938762
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
39845105
ns39706040
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5499875
ns5775541
ns0.95
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
399764
ns398045
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6131500
ns6154854
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
3224875
ns3224917
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6379229.5
ns6376292
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11911084
ns11902583
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
349856
ns347379
ns1.01
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
303248
ns297978.5
ns1.02
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19059708.5
ns19104063
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
11090437.5
ns11143020.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
20005646
ns19964417
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36446770.5
ns36518125
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1081781.5
ns1020967.5
ns1.06
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1153782
ns1158972
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
958
ns958
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1000
ns1000
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
958
ns1000
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
917
ns958
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23071
ns22897
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI
2085318
ns2091957.5
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal
332541.5
ns232500
ns1.43
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
207622
ns206842
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3667
ns3708
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3750
ns3709
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3708
ns3792
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3667
ns3667
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
281551.5
ns277378
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI
12095727
ns11186074
ns1.08
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal
2129583
ns2130584
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
626307
ns626357
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8042
ns7750
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8145.5
ns7937.5
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9042
ns9771
ns0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7937.5
ns7437.5
ns1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
121104
ns119515
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
3679976
ns3487658
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
802541.5
ns816562.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
65471
ns65701
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
13125
ns11208
ns1.17
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12875
ns13416.5
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11417
ns12834
ns0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12708
ns11584
ns1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
638151
ns631148
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
22685670
ns21438278
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
4390333
ns5005375
ns0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
355644
ns354774
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22337
ns22106
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI
2195388.5
ns2144977
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal
207833
ns226937
ns0.92
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
47401
ns46510
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3042
ns2875
ns1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3375
ns3000
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2916
ns2917
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
3333
ns2958
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
204047
ns199810.5
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI
14763707.5
ns9182273
ns1.61
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal
1611395.5
ns1664167
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
157641.5
ns161676.5
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10250
ns11625
ns0.88
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12167
ns11979
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12187.5
ns13333
ns0.91
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10604
ns11604.5
ns0.91
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
121713.5
ns120755
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
3281210
ns3560641
ns0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
904791.5
ns1031500
ns0.88
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
233512.5
ns233163
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
21104.5
ns20687.5
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
22583
ns20583
ns1.10
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21083
ns23000
ns0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21708
ns20541.5
ns1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
595173
ns590597
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
20531194.5
ns20721086
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
4095583
ns4786083
ns0.86
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
638246.5
ns646557
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4417
ns4375
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4417
ns4417
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24193.5
ns23934
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI
2211530
ns2235095.5
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal
215041
ns221479.5
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
47690
ns47181
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16292
ns16667
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16291
ns16541
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16667
ns16709
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16416
ns16708
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
330020.5
ns326329
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI
12280627
ns12543391.5
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal
1639709
ns1672458
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
206457.5
ns204152
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
1917
ns2084
ns0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2167
ns2125
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2084
ns2083
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2084
ns1958
ns1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
35891
ns35852
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
1213015
ns1224950
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
474917
ns293583
ns1.62
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
204052
ns203142
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
19687.5
ns18208
ns1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
17187.5
ns17187.5
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
17750
ns18041.5
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
16667
ns17021
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
293976.5
ns291174
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
21212198
ns21237766
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
4767354.5
ns5676396
ns0.84
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
686777
ns684357.5
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
55771
ns60208.5
ns0.93
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
62792
ns62042
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
65604.5
ns65750
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51333
ns51250
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66418
ns66352.5
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
114241
ns112971
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
202896
ns188541.5
ns1.08
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
135104
ns140250.5
ns0.96
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
130083
ns124249.5
ns1.05
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
245666
ns220125
ns1.12
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
215296
ns213978
ns1.01
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
607861
ns616297
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
79709
ns84479
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
107104
ns83666.5
ns1.28
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85167
ns86167
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
124166.5
ns125666
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192861
ns193270.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5531381
ns5699293.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1816084
ns1963979.5
ns0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203512
ns204042
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1869895.5
ns1887292
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1901084
ns1916521
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1917666.5
ns1912333
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1889333
ns1806250
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
531825
ns528167
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32650285
ns24408984.5
ns1.34
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8859584
ns9102667
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
925670
ns1064601.5
ns0.87
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
291
ns250
ns1.16
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21389
ns21230
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI
2065883
ns2190815.5
ns0.94
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal
336229.5
ns367541.5
ns0.91
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
42770.5
ns41291
ns1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1834
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1792
ns1875
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
253832
ns249025
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI
10417238
ns10051558
ns1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal
1009479
ns1526271
ns0.66
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
184376.5
ns182202
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8000
ns8583
ns0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
10042
ns9542
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10375
ns10604
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8167
ns8125
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
119090.5
ns117788.5
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
3309191
ns3476276
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
876708
ns921312.5
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
232622
ns232182
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9083
ns9000
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10625
ns8958
ns1.19
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9542
ns11292
ns0.85
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10125
ns9145.5
ns1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
527209
ns518629.5
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
22247571
ns19406043
ns1.15
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
3949187.5
ns4477584
ns0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
624237
ns626986
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56166
ns57458
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38916
ns39875
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46125
ns46750
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83958
ns82583
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40233
ns39259
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1343252
ns1309251
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1123667
ns1121542
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
76266
ns74341
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1923750
ns1867542
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1952750.5
ns1978791
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1982854
ns1977229
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1850708.5
ns1853979.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
221906.5
ns219172
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33376877
ns32964288
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11408021
ns11253292
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1191052
ns1160142
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
416333
ns419229.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
421645.5
ns435958
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
421208.5
ns420208
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
417667
ns417291.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
208798
ns208124
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7659621
ns8033766
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
518208
ns539333.5
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
282883
ns280723
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
747916.5
ns718729.5
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
671583
ns670917
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
673562.5
ns681646
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
748021
ns671125
ns1.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1048327.5
ns1045689
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45569778.5
ns44612818
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6335208.5
ns6579583
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
914290
ns909619.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
3428937.5
ns3431646
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
3384709
ns3418041.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
3435000
ns3459666
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
3417875
ns3424604
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
175238.5
ns172982
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8069034
ns8225049
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1424083
ns1418875
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
426124
ns438875
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
6191270.5
ns6211958.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
6170041
ns6239125
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
6167416.5
ns6228166.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6190792
ns6164812.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
994959
ns989377
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
50094330
ns49957898
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7413750
ns7609083
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1549811
ns1545101
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
470666
ns470459
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
252458
ns254333
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
342417
ns342000
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
901125
ns901833
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46139
ns45850.5
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI
884569
ns874511
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal
368208
ns485291
ns0.76
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
243602
ns241413
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2334750
ns2331458
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1752562
ns1762250
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2041187.5
ns2040791.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3280124.5
ns3281083
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
255952
ns263882
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI
12850913
ns13135947
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal
2244770.5
ns2243500
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
770018
ns765467.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
55708
ns57083
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39041
ns38854.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46020.5
ns46125
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84125
ns82875
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28321
ns28162
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1407008
ns1368315
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1106875
ns1138958
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
76505.5
ns74570.5
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2029708
ns2033792
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2082292
ns2094125
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2090958
ns2089041.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1949604
ns2003042
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
232547
ns231932
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35887652
ns35712411
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11649979
ns11300791.5
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1052311
ns1044461
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
55833
ns57500
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39083.5
ns39917
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46375
ns46500
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84042
ns82625
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
49287
ns48905
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
790006.5
ns744836.5
ns1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1049084
ns1117520.5
ns0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
69820
ns64946
ns1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1919458
ns1922750
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1955416.5
ns1974334
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1946334
ns1956833.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1890750
ns1889708
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
239685
ns239067
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
17609091
ns16476478
ns1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9788042
ns9755374.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
918859
ns916609
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns291
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
417
ns333
ns1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
34717
ns35081.5
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
1181143
ns1290014
ns0.92
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
263500
ns287438
ns0.92
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
46211
ns45840
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6333
ns6541
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7500
ns6687.5
ns1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6583
ns7000
ns0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7000
ns6500
ns1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
208392.5
ns205115.5
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
20162243
ns20319441.5
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4479667
ns5303083
ns0.84
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
365124
ns367174
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns250
ns1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32562
ns31894
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI
1251080
ns1192240
ns1.05
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal
258000
ns254292
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
37000
ns36310
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2750
ns3334
ns0.82
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
3625
ns2958
ns1.23
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2709
ns3167
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2917
ns2958
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
189309.5
ns185317.5
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI
7798739
ns7518628
ns1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal
905666.5
ns1115709
ns0.81
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
151136.5
ns149472
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
467667
ns422083
ns1.11
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
444750
ns423833
ns1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
425999.5
ns427834
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
421833.5
ns424937.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
137895
ns137292
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5774821
ns5779699.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2386500
ns2076458
ns1.15
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
367024
ns366143.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3802521
ns3813229.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3765917
ns3824249.5
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3811417
ns3788084
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3799541.5
ns3812042
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
709425
ns705310
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33554230
ns31262641
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10457896
ns10824937.5
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1471404
ns1464005
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49735229.5
ns49892959
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
25984959
ns26011834
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35560875
ns35523145.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
96902041.5
ns97645833
ns0.99
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1616773
ns1616287
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1045271
ns1048102
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
153907333
ns154680021
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
89247291.5
ns88850291.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112379750
ns112398500
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
294166500
ns298306271
ns0.99
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6515848
ns6498761
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5562255.5
ns5545318
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
14521
ns19937.5
ns0.73
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
14958
ns15167
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
16833
ns17041.5
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
14854.5
ns14792
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
20539.5
ns20017
ns1.03
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI
1114507
ns1149888
ns0.97
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal
206959
ns229541
ns0.90
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
26060
ns27001
ns0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
10625
ns10417
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
7771
ns7250
ns1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
9208
ns9104
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17437.5
ns17375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
260548
ns257217
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI
9528073.5
ns9674368
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal
1587125
ns1641396
ns0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
149326.5
ns147861
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7958
ns8063
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
9292
ns9125
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9500
ns10667
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7958.5
ns8917
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
116273.5
ns114750.5
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
3476228
ns3651219
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
810375
ns861125
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
233683
ns233283
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9208.5
ns9792
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10645.5
ns10750
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10208
ns10917
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10375
ns10271
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
619508.5
ns614307
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
22906068.5
ns28192305
ns0.81
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
4432792
ns5310750
ns0.83
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
654786
ns649747
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8291.5
ns9708
ns0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10459
ns10000
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10042
ns11541
ns0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9250
ns9584
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
120531
ns119206
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
3436472
ns3481764
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
901792
ns937459
ns0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
71071
ns72050
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13250
ns17479.5
ns0.76
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
16042
ns14375
ns1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
17208
ns15125
ns1.14
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
15167
ns14667
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
592138
ns586931
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
18951458.5
ns19607421
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
4027062.5
ns4735125
ns0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
345753
ns343533
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
459
ns500
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
583
ns584
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
500
ns584
ns0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
541
ns459
ns1.18
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
34521
ns34228
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
1191899
ns1215476
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
371562.5
ns314188
ns1.18
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
206352
ns203452
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7062.5
ns9334
ns0.76
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8333.5
ns8604.5
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8583
ns9041
ns0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8000
ns8250
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
233771
ns230655
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
23357164
ns22072831
ns1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
4885833
ns5460541
ns0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
662116
ns654892
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
12292
ns17375
ns0.71
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
13229
ns14792
ns0.89
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
15125
ns16000
ns0.95
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
10167
ns10458
ns0.97
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
22042
ns21718
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI
1119591.5
ns1102903
ns1.02
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal
189125
ns208666
ns0.91
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
189132
ns184622
ns1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
31875
ns31542
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
32333.5
ns32000
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
32291.5
ns32208
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
32000
ns32354.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
276327
ns271707
ns1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI
12201192
ns10769694
ns1.13
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal
1697542
ns1820875
ns0.93
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
595015.5
ns588176
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
480875
ns452584
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
441083
ns441979.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
450250
ns467167
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
490979
ns438521
ns1.12
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194024
ns194827
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5766516
ns5920885
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2629708
ns1997667
ns1.32
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
368063.5
ns368184
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3822958
ns3829250
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3807354
ns3838292
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3827834
ns3802021
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3826167
ns3830584
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
544349
ns544632
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
29050298
ns28778535
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9196542
ns9720812.5
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1359983
ns1358284
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
838219667
ns831986833
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
415052604.5
ns416264500
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
543102500
ns543217708
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1525021500
ns1509789750
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22764607.5
ns22539644.5
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14772276
ns14678121
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
3570164958
ns3779013833
ns0.94
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1502049709
ns1885743917
ns0.80
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
2269221042
ns1788587042
ns1.27
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4773617583
ns4810183875
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
369302709
ns364565745
ns1.01
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
87924411
ns88375525
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
79646
ns75520.5
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
78895.5
ns76416.5
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78667
ns79958.5
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
77583
ns78625
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
207237
ns207155.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7871351
ns7714255
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
520375
ns534709
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
107601
ns106301.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
250834
ns235667
ns1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
294583.5
ns283229.5
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
285708.5
ns247208
ns1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222333.5
ns210874.5
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1049109.5
ns1048818
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
43337417.5
ns44375934
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6122958
ns6248084
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
640576
ns631246
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199656458.5
ns199488333
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
103769666.5
ns103922541.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
139342042
ns139224666
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
388182208
ns393811292
ns0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5838796
ns5835255
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3577840.5
ns3578582
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
616451521
ns620321291.5
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
351188291.5
ns354710917
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
439680896
ns440219958
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1178137125
ns1185414250
ns0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26651952
ns26495134
ns1.01
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
22092888
ns22065145
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7333
ns7417
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns5417
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6084
ns6292
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10167
ns10145.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
27714.5
ns27466
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1202781
ns1213453.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
351458
ns432833
ns0.81
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48481
ns47620
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
218291.5
ns213000
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222250
ns223041
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221209
ns220917
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213708.5
ns206896
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
222292
ns223324
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
31765824
ns31525343
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9125125
ns9133958
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
529665
ns524095
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7271
ns8854.5
ns0.82
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9541.5
ns9312.5
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9791
ns10583
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8187.5
ns9625
ns0.85
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
117715.5
ns116401
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
3188633
ns3333892
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
885458
ns911750
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
69700
ns69370
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7479
ns7437.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10479.5
ns8854
ns1.18
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10875
ns7959
ns1.37
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8875
ns9145.5
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
519786.5
ns515224
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
18597573.5
ns18606821
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
3961208
ns4708917
ns0.84
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
316073
ns318334
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
750
ns709
ns1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
459
ns500
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
26338
ns25690
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
1200694
ns1183861
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
488604.5
ns493792
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
46820
ns46791
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9291
ns9000
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10416
ns10791.5
ns0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9208.5
ns9854.5
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
11583
ns10042
ns1.15
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
253612
ns251338.5
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
25803867.5
ns23713128.5
ns1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5171833.5
ns6062250
ns0.85
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
388624
ns386044
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
104834
ns107354.5
ns0.98
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
84834
ns84667
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
99500
ns100375
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
146333
ns146729.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
24613
ns24618
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI
1194962
ns1206806.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal
246062.5
ns266292
ns0.92
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
192062
ns190862
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
526854
ns478500
ns1.10
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
478875
ns492271
ns0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
500416.5
ns481000
ns1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
478958.5
ns479145.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
232619
ns230580
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI
11733131
ns11914566
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal
1709625
ns2188458.5
ns0.78
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
610896
ns605276
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5125
ns6042
ns0.85
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
7167
ns7000
ns1.02
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
6791
ns7583
ns0.90
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
4042
ns6000
ns0.67
batchedmm(16, Bsize=32)/forward/GPU/CUDA
16580
ns16947
ns0.98
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
79701
ns79345.5
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
11708
ns12062.5
ns0.97
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
11584
ns10542
ns1.10
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
10792
ns10917
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
17687.5
ns18208
ns0.97
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
214143.5
ns212062.5
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
366964
ns367674
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
35792
ns39750
ns0.90
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
50791
ns50708
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
51833.5
ns52625
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13542
ns13750
ns0.98
batchedmm(16, Bsize=128)/forward/GPU/CUDA
21568
ns19888.5
ns1.08
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
87241
ns87991
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
38979.5
ns36500
ns1.07
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
30708
ns28959
ns1.06
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
30416
ns31500
ns0.97
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
58458
ns58583
ns1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
192010
ns190552
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
395119
ns413955
ns0.95
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1729.5
ns1750
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1937.5
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2146
ns2125
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1709
ns1792
ns0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
20594
ns20369
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI
1163029.5
ns1137759
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal
326833
ns312000
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
33120
ns32711
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2250
ns0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2333
ns2396
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2250
ns2333
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2042
ns2250
ns0.91
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
204587
ns201543.5
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI
9292587
ns9195441
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal
1518500
ns1575208
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
136826.5
ns136711
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4417
ns4562.5
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5250
ns4708.5
ns1.12
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6375.5
ns6834
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4041.5
ns5125
ns0.79
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
145077
ns144149.5
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
5424296
ns5753580
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
725208
ns707854
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
69471
ns69031
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8041
ns8167
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8958
ns9250
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8416
ns8667
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9208
ns9209
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
875812.5
ns867994
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
40742928.5
ns37396018.5
ns1.09
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5580917
ns5747500
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
389804
ns386354
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56792
ns56917
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
56875
ns56875
ns1
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57584
ns57833
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
58375
ns58125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
37054
ns37109
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1234596.5
ns1131214.5
ns1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
336000
ns421167
ns0.80
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
203242
ns203222.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
485813
ns451020.5
ns1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
499958.5
ns475979
ns1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
468208
ns465354
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
438854.5
ns487041.5
ns0.90
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
268055
ns264507
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27322975
ns28501147
ns0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8122166.5
ns7943604
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
832729
ns830424
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3291250
ns3311000
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1764708
ns1770250
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2339021
ns2337729.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6260292
ns6302417
ns0.99
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204625
ns204131.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
209992
ns211992
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11332208
ns11485250
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
6550833
ns6571812.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8325250
ns8309250
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
20937125
ns21151875.5
ns0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
734916
ns735481
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1048155.5
ns1057071
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4291
ns5125
ns0.84
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5875
ns5375
ns1.09
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6583
ns7125
ns0.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4896
ns6208.5
ns0.79
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
137991.5
ns137212.5
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
5581467
ns5624260
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
785625
ns793500
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
56390
ns56010
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7042
ns7000
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10562.5
ns7500
ns1.41
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7104.5
ns7458
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7833
ns9083
ns0.86
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
754679
ns754137
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
34960226
ns34576213
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5245042
ns5244167
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
371414
ns366813
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
127625
ns103250
ns1.24
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
95624.5
ns103875
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
100000
ns125291
ns0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
95708
ns101042
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
152137
ns151348
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5871279.5
ns6050689.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2635166.5
ns2052375
ns1.28
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203242
ns203192
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2017959
ns2018375
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2027771
ns2029000
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2021167
ns2023521
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1987167
ns1991417
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
703925.5
ns703391
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31965494
ns31442085
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11055292
ns11046312.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1255893
ns1250762
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
29375
ns34667
ns0.85
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
34500
ns34750
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
35250
ns35041.5
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
583
ns646
ns0.90
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15622
ns15242
ns1.02
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
80130
ns79571
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2542
ns2729.5
ns0.93
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
3125
ns2917
ns1.07
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2834
ns3000
ns0.94
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
3000
ns2208
ns1.36
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
141408
ns139866
ns1.01
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
343344
ns342158.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7125
ns7167
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5375
ns5417
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6084
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10209
ns10042
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36671
ns36552
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1208337
ns1221281.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
331459
ns674708
ns0.49
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48221
ns48261
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217479
ns213624.5
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229625
ns221166.5
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
225000
ns220812.5
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212875
ns205833
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
244929
ns243393.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26091309.5
ns25870086.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7984187.5
ns7741583
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
574266
ns575566
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3959
ns3958
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3959
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3958
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3958
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
21419
ns21563
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI
2118188.5
ns2027782.5
ns1.04
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal
234583
ns250542
ns0.94
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
42620
ns43640
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14791
ns14917
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14750
ns14791
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14875
ns14958
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14833
ns14917
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
311492
ns306375
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI
10906139
ns11210297
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal
982000
ns1037625
ns0.95
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
192231.5
ns194327
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
140834
ns105583
ns1.33
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
127417
ns106167
ns1.20
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
105167
ns124875
ns0.84
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
141000
ns102583
ns1.37
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
152595
ns139877
ns1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6050834
ns5810927
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2057334
ns2048416
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
213297
ns208802
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1917833
ns1878500
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1898875
ns1927583.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1922083
ns1867521
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1898854
ns1917937.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
692137
ns684487.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31139112
ns30087516
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10436541
ns10640458
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1217872
ns1063341
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18250
ns17583
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18625
ns19500
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20750
ns20708
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17749.5
ns18791
ns0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
110137
ns109550
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3282416
ns3331480
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
480541.5
ns1318708
ns0.36
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79421
ns80701
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
252041.5
ns216271
ns1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
217541.5
ns222292
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219687.5
ns217916
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222729.5
ns216167
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
519298
ns516519
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20051825.5
ns19724665.5
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6194812.5
ns6017791.5
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
478425
ns477585
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
23291.5
ns26583
ns0.88
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
28583
ns28770.5
ns0.99
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
28792
ns29104
ns0.99
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1229.5
ns1334
ns0.92
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16210
ns15984
ns1.01
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
82241
ns81921
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
4292
ns4833.5
ns0.89
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4729
ns4833
ns0.98
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5042
ns5208.5
ns0.97
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
5771
ns4333
ns1.33
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
207444.5
ns206128
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
378084
ns379654
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
305417
ns305792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
306250
ns306042
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
308084
ns306833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
305750
ns307083
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
228609
ns227988.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7545946
ns7778230
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
604584
ns1241125
ns0.49
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
273963
ns272793
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
532917
ns535708
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
538167
ns533084
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
539125
ns538208
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
572709
ns530917
ns1.08
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1074383
ns1080430
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
44755027.5
ns42644591.5
ns1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6115208.5
ns6182083
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
858603.5
ns851073.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19291
ns19125
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20708
ns20624.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22375.5
ns21458
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19875
ns20000
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
114907
ns112864
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3614583
ns3473281
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
593916
ns1444854
ns0.41
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79421
ns80611
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215708
ns220167
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220584
ns222791.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213625
ns214771
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215875
ns212625
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
762395
ns737028
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
25444001
ns25214419
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7232562.5
ns7109375
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
542290.5
ns531685
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6125
ns5916
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
7083
ns7083
ns1
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7917
ns8604.5
ns0.92
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6208
ns6500
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
140165.5
ns140088
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
5168559
ns5562789
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
799291
ns803937.5
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
65270
ns64661
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9542
ns10000
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10333.5
ns10937.5
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10375
ns10750
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11145.5
ns10041
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
826456
ns822803
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
37337383
ns36817844
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5311708
ns5484583
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
387474
ns382033
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4875
ns4334
ns1.12
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6917
ns5291
ns1.31
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7250
ns7333
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4812.5
ns5584
ns0.86
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
144262
ns142901.5
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
5426091.5
ns5758977.5
ns0.94
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
808375
ns800458
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
66621
ns66271
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7458
ns7208
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8083
ns7646
ns1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7541.5
ns7750
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7833
ns7583
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
783702
ns782456.5
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
37497088
ns39501262
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5566229
ns6034250
ns0.92
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
395004
ns392794
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14350584
ns14539375
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
7693688
ns7723291.5
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10127042
ns10145625
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27615959
ns27763416
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA
548306
ns554910
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
393134
ns393434
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
45943208
ns46429208.5
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
26437417
ns26609416
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33454833
ns33517458
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
84782667
ns85405667
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2657066
ns2664805
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3290613
ns3291838.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
66375
ns66292
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
68584
ns67875
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
69333.5
ns68250
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
65979
ns65917
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
121920.5
ns119249
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3593431.5
ns3647654
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
508166
ns1440312.5
ns0.35
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
229397.5
ns232702
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
446833
ns441250
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
452437.5
ns441625
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
446375
ns447167
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
445834
ns441478.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
728139
ns727144.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26912797
ns26208342
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7552104
ns7477375
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
790108
ns793922.5
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
666
ns584
ns1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
500
ns625
ns0.80
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
667
ns583
ns1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32311
ns31836
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
1198752.5
ns1180672
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
473500
ns286667
ns1.65
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
47340
ns47841
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8666
ns9458
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9208
ns9271
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
8458
ns9750
ns0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
17104
ns9416
ns1.82
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
286358
ns283587
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
20778583
ns22547365
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
4681395.5
ns5502666.5
ns0.85
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
375004
ns374188.5
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9875
ns9792
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9875
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9792
ns9875
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9833
ns9875
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23012
ns22851
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI
2014844
ns2120178
ns0.95
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal
215645.5
ns221333
ns0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
205762
ns207772
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
45958
ns46167
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
46042
ns46083
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
46041
ns46417
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
46250
ns46062.5
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
290878
ns287950
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI
9152947
ns12273456
ns0.75
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal
942542
ns1033833.5
ns0.91
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
607695
ns600566
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56250
ns56167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
56458
ns56875
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57083
ns57166
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
57709
ns57875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28552
ns28495
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1253508.5
ns1157087.5
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
663666.5
ns660125
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
203541.5
ns202572
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
448583
ns448229
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
465562
ns464979
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
465458.5
ns472292
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
454041.5
ns474437.5
ns0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
245887
ns244496.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
33424426
ns33157318.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9545520.5
ns9248750
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
887779
ns888349
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
645812.5
ns614125
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
575959
ns648750
ns0.89
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
640542
ns652521
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
646271
ns642542
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
208584
ns208606.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8406939
ns7841403
ns1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1406395.5
ns1401250
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
315503
ns305493
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2214979
ns2245937.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2211999.5
ns2247291
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2220812.5
ns2238062.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2227958
ns2241541
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
978439
ns971988
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
47363900
ns48958299
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10481646
ns7597458.5
ns1.38
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1213952
ns1213901.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18625
ns19333
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20729
ns21646
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21583
ns21833
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18875
ns24291
ns0.78
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
113850.5
ns111706.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3565557.5
ns3500994.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
497958
ns1437895.5
ns0.35
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79731
ns79141
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
227375
ns219459
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
259417
ns219791.5
ns1.18
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
225541
ns222104.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
227084
ns219875
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
729838
ns728212.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26163617
ns26675294
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7560500
ns7278312
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
554315
ns555140
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
584
ns584
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
541
ns667
ns0.81
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns583
ns0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23274
ns22972
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
1191789
ns1186538
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
484250
ns461542
ns1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
48040
ns49541
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9083
ns9750
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10437.5
ns9333.5
ns1.12
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9541
ns9896
ns0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9500
ns10000
ns0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
268183
ns265448
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
24685731.5
ns24827341.5
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5000875
ns6076333
ns0.82
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
398234
ns415154
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7250
ns7917
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9187.5
ns10208
ns0.90
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9645.5
ns10542
ns0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8041
ns9292
ns0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
118921.5
ns118520
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
3382327
ns3378687
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
886791.5
ns891583
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
71801
ns75371
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7604
ns7291.5
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8125
ns7875
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7500
ns7833.5
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7562.5
ns7708
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
507494
ns503824
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
17189656.5
ns17507211
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
3782375
ns4534375
ns0.83
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
320313
ns318933
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1500
ns1437.5
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1708.5
ns1667
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1791
ns1917
ns0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1375
ns1417
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
21598
ns21272
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI
1189888
ns1191094
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal
313375
ns307229
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
190932
ns189132
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3541
ns3292
ns1.08
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3583
ns3333
ns1.08
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3458
ns3500
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3292
ns3500
ns0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
218452
ns216668.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI
9603283
ns10523301.5
ns0.91
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal
1797375
ns1655750
ns1.09
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
583116
ns579466
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
148104.5
ns148229.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
106833
ns106166.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
128562.5
ns129250
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
225000
ns225167
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
23975
ns23640
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI
1165725
ns1169047
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal
254292
ns281229
ns0.90
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
41470
ns40580
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
157645.5
ns143125
ns1.10
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
87625
ns87375
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
112000
ns112875.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
250708.5
ns250792
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
218220.5
ns214898
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI
10460438
ns10468792
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal
1096666
ns2056708
ns0.53
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
269773
ns266232
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7167
ns7208
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5333
ns5375
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6083
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10458
ns10000
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32755
ns33010
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1178842
ns1218913
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
330458
ns357271
ns0.92
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50720
ns50911
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
253104
ns227938
ns1.11
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229041.5
ns228354.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
234187.5
ns235708
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
227938
ns249729
ns0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
263186.5
ns263220
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27448206
ns28851277
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8237750
ns8089625
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
594190.5
ns591956
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
13792
ns15375
ns0.90
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
15166
ns14917
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
16499.5
ns16834
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
14667
ns15583
ns0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
139540
ns138290
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
5436668.5
ns5390404
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
786729
ns805167
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
232963
ns231372.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
23000
ns23333
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23937.5
ns23438
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
23875
ns24459
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23979.5
ns23666
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
870094.5
ns863635.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
40010466.5
ns39146915
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5595708
ns5702250
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
679366
ns683727
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8750
ns8875
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
10312.5
ns10041.5
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11271
ns11750
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9584
ns9917
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
123388.5
ns122685
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
3563169
ns3570923
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
858292
ns917271
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
74460
ns75270
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13375
ns14166
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14458.5
ns14458.5
ns1
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
13958
ns14979.5
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13625
ns13542
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
667308
ns660959
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
21257602
ns21424061
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
4997708
ns5279979
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
365743
ns365744
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8583
ns8417
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10333
ns10146
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10312.5
ns12125
ns0.85
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9166
ns9792
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
121770.5
ns121433.5
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
3365145.5
ns3352559.5
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
906625
ns952146
ns0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
75170
ns72460
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12292
ns13166
ns0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13437.5
ns12938
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12916
ns13125
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12458
ns12916
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
553718.5
ns548948
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
18868109
ns18645332
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
3865125.5
ns4735063
ns0.82
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
341293
ns340583
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
26354.5
ns31125.5
ns0.85
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
30645.5
ns31520.5
ns0.97
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
31541
ns32333.5
ns0.98
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
1833
ns1834
ns1.00
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16183
ns16210
ns1.00
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
81001
ns80860
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5209
ns5229.5
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
5021
ns4959
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5417
ns5250
ns1.03
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6604
ns6334
ns1.04
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
140577.5
ns138594
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
370423.5
ns388224
ns0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
250
ns291
ns0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
250
ns375
ns0.67
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
291
ns334
ns0.87
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
25697
ns25350
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
1197018
ns1199368
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
465667
ns478250.5
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
47180
ns49490
ns0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6125
ns6292
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6729
ns6750
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6333
ns6792
ns0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6312.5
ns6584
ns0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
187721.5
ns186417
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
23736279.5
ns23013025
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
4952833.5
ns5920458
ns0.84
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
386429
ns393209
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
1959
ns1958
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2042
ns2042
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2000
ns2083
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
1959
ns2000
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
26463
ns25999.5
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
1170027.5
ns1183440.5
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
479625
ns314229
ns1.53
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
206252
ns206522
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16250
ns16583.5
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16666
ns15958
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16208.5
ns16854
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16417
ns16791.5
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
276067
ns272947
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
24921263
ns25132475.5
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5326083
ns6200500
ns0.86
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
700836
ns699897
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
173875
ns158000
ns1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
148750
ns152895.5
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
155708
ns179875
ns0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
147458
ns175625
ns0.84
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
203847
ns205507.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8347024.5
ns8109426
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1561917
ns1459854.5
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
232482
ns213437
ns1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1328917
ns1279667
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1311771
ns1336958
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1320791
ns1276333
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1322500
ns1332729.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
909940.5
ns907688
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
44667022
ns46524861.5
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7124333
ns6921834
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
995559.5
ns1109576
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22958
ns25937.5
ns0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
26833
ns25750
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
27625
ns27437.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24667
ns24042
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
234608.5
ns236630
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7924652
ns7924614
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
576541
ns1195645.5
ns0.48
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
116011
ns112891.5
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
118166.5
ns117812.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
122375
ns125958
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
158041.5
ns130667
ns1.21
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
123833.5
ns132625
ns0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1073695
ns1078111.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
44153968
ns48454865.5
ns0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6127166
ns6291354
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
612925
ns604836
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
250
ns250
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
250
ns334
ns0.75
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23160
ns22703
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
1212472
ns1228350.5
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
478542
ns303875
ns1.57
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
47471
ns47155.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6291
ns6333
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6833.5
ns6937.5
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6458
ns6750
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6584
ns6687.5
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
204382.5
ns201918.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
24496787
ns24022047
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5334937.5
ns6154291
ns0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
388703
ns390799
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5208
ns5584
ns0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7021
ns6729
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7458
ns7834
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5667
ns6333
ns0.89
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
145933.5
ns144556.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
5745568
ns5802837
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
753959
ns465083.5
ns1.62
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
234802
ns231623
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9583
ns9875
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10375
ns10500
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns10250
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10042
ns10084
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
903827
ns898422
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
42297357
ns41540865
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5826479
ns5925625
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
668457
ns667721.5
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
667
ns625
ns1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
709
ns625
ns1.13
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
625
ns625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
625
ns667
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22371
ns22281
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI
2015786
ns2048848.5
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal
208416
ns228500
ns0.91
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
207552
ns205022
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4584
ns4625
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4833
ns4625
ns1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4666
ns4791
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4584
ns4584
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
228749
ns224113.5
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10461831
ns11648202
ns0.90
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal
1654416.5
ns1667208
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
580735
ns578966
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7750
ns8604.5
ns0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
9166.5
ns9500
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8834
ns10125
ns0.87
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8291
ns8125
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
121959
ns121216
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
3411255
ns3493631.5
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
827916
ns797562.5
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
74011
ns73391
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8625
ns8166.5
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9041.5
ns9020.5
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8583.5
ns9292
ns0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8375
ns8834
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
591884.5
ns585686
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
20708574.5
ns21659888
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
4264875
ns5138604.5
ns0.83
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
342784
ns345673
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
122750
ns128166
ns0.96
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
96459
ns95895.5
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
130187.5
ns130416
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
180875
ns193500
ns0.93
batchedmm(128, Bsize=4)/forward/GPU/CUDA
45830
ns45829
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
101721
ns100941
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
328000
ns335583
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
166666
ns167167
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
347541.5
ns354375
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
608646
ns609249.5
ns1.00
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
192063
ns190876
ns1.01
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
505519.5
ns517555
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
395916
ns397541
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
214250
ns215333
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288167
ns288458
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756500
ns756458
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43676.5
ns43687
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI
1411321
ns1356444.5
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal
429792
ns420167
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
82131
ns80321
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1458834
ns1457000
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
857583
ns862125
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1134333
ns1134520.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2441958.5
ns2444500
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
249859
ns251807.5
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI
10370982
ns10565821
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal
1909646
ns1852750
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
352903
ns350374
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
616500
ns683334
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
598250
ns650583
ns0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
648916.5
ns641791.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
642667
ns653250
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
200586.5
ns202465
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7794534
ns8364163.5
ns0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1363291
ns1384458
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
313733
ns302773
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2445375
ns2447209
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2426917
ns2468625
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2441500
ns2446166.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2440750
ns2452188
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
994961
ns992979
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
50766350
ns51629265.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9661291
ns9882875
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1307388
ns1311863
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
28521
ns34667
ns0.82
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
34625
ns34291.5
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
33916.5
ns35521
ns0.95
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
875
ns875
ns1
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15425.5
ns15660
ns0.99
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
79381
ns78941
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3062.5
ns3125
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3416
ns3458.5
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3208
ns3312.5
ns0.97
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3209
ns3084
ns1.04
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
139741
ns137070.5
ns1.02
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
338953
ns338254
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
404500
ns406166
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
402125
ns404458
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
408334
ns408458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
422458
ns420458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
43145
ns42995
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1417291
ns1466063
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1128750.5
ns1144125
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
239562
ns238192
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3863292
ns3877875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3971625
ns3990896
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3996791
ns3992562.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3757979.5
ns3778146
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
242826
ns240990
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
38623864
ns36589646
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11673750
ns11933709
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1433229
ns1433854
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3959
ns3916
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3916
ns3917
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33968
ns33931
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI
1232483
ns1232713.5
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal
167334
ns183709
ns0.91
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
38620
ns38031
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15666
ns15708
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15750
ns15750
ns1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15625
ns15958
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15625
ns15750
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
255128
ns252887
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI
8717525
ns9179273
ns0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal
843520.5
ns893625
ns0.94
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
169816.5
ns172862
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
402625
ns404417
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
220209
ns221125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
295959
ns296500
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760791.5
ns761125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113239
ns112867
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI
1047524
ns1050270.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal
348895.5
ns406792
ns0.86
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
89300.5
ns87471
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1474958.5
ns1471292
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
881146
ns884000
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1159083.5
ns1160146
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2461917
ns2466083.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
241292
ns238614
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI
9318727.5
ns9255273
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal
1946459
ns1932833
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
354883
ns350549
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
500
ns583
ns0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
500
ns583
ns0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
25844
ns25487
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
1200537.5
ns1217335.5
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
496709
ns387333
ns1.28
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
209382
ns206202
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7375
ns7375
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8104.5
ns8020.5
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7500
ns7916
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7375
ns7542
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
217033.5
ns209854.5
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
25754399
ns25469136
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5254333.5
ns6294375
ns0.83
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
685977
ns684857
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
825125.5
ns833124.5
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
468584
ns467292
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
621500
ns621750
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1536542
ns1543666
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA
130845.5
ns130036
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
229862
ns230222
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2661979
ns2684437.5
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1535250.5
ns1538583
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
2000792
ns2002583
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4906416
ns4933354
ns0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
242304
ns243369
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
841449
ns836303.5
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns250
ns1.17
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
250
ns375
ns0.67
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
291
ns334
ns0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32216
ns31581
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
1218492
ns1181114.5
ns1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
464375
ns425666.5
ns1.09
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
47630
ns49050
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6125
ns6291
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6708
ns6708.5
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6500
ns6667
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6375
ns6375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
224154.5
ns222549
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
21407773
ns20723673
ns1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
4615291
ns5408500
ns0.85
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
357793.5
ns364253.5
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2392708
ns2412916
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2371959
ns2399708
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2404416
ns2391250
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2370084
ns2406375
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
200035.5
ns201130.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7868335
ns8039466.5
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1597041.5
ns1500813
ns1.06
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
373933
ns371169
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4648292
ns4645417
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4644250
ns4666145.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4636708
ns4648375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4642750
ns4646334
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
891890
ns899895.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
46027858
ns47712828
ns0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6938541.5
ns6893375
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1391633
ns1384804
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
7187.5
ns7083
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7542
ns7000
ns1.08
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7125
ns7750
ns0.92
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6875
ns6792
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
23289
ns23107
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI
1167669
ns1160499
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal
243458.5
ns282458
ns0.86
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
39800
ns40431
ns0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
46396.5
ns48667
ns0.95
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
32917
ns57125
ns0.58
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
45875.5
ns51042
ns0.90
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
67312
ns33354.5
ns2.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
214725
ns215404
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI
10485830
ns10709204
ns0.98
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal
1121562
ns2066833
ns0.54
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
269102.5
ns264313
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
19604.5
ns22854
ns0.86
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
24021
ns24375.5
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
23750
ns24917
ns0.95
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5084
ns5209
ns0.98
batchedmm(2, Bsize=512)/forward/GPU/CUDA
17227
ns16790
ns1.03
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
83741
ns89191
ns0.94
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
11916
ns12250
ns0.97
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
9354.5
ns9375
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
10417
ns10604.5
ns0.98
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
17958
ns18083
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
225890
ns225960
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
371753
ns387419
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404000
ns406584
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
222584
ns223292
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
296875
ns297000
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762667
ns762667
ns1
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46288
ns45879
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI
1401617.5
ns1417981
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal
358375
ns424354.5
ns0.84
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
89491
ns89741
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1480896
ns1486000.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
888250
ns892208.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1164959
ns1169500
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2465417
ns2471625
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
288016
ns279157
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI
12678894
ns13109750
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal
2117375
ns2047333
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
381744
ns376633
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
432125
ns433500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
430333
ns430292
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
436917
ns436292
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
448604.5
ns446958
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
54122.5
ns54004
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1002212
ns1003277
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1059021
ns1090562.5
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
234952
ns236733
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3895042
ns3866292
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4004458
ns4019812.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4030291.5
ns4022583.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3789979
ns3812208.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
260055
ns261348.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
30675954
ns32496173.5
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10349458.5
ns10504750
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1223712
ns1365148
ns0.90
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
8750
ns8708
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
6917
ns6958
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7583
ns7667
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12416
ns12417
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
23553.5
ns23411
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI
2134096
ns2120051
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal
214667
ns229334
ns0.94
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
211142
ns208012
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
44958
ns45583
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45083
ns45291
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
45000
ns45416
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
44958
ns45042
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
344550
ns345424.5
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI
14001329.5
ns13588599
ns1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal
1862458
ns1751750
ns1.06
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
659011.5
ns653876
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
122729
ns113812.5
ns1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
83521
ns90020.5
ns0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
87354.5
ns88625
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
105375
ns81000
ns1.30
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
190055
ns190227.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5969481
ns6167893
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1972791.5
ns2705500
ns0.73
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
214447
ns221462
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2012458.5
ns1871229
ns1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1980000
ns2028479
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2023917
ns2015645.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2011645.5
ns2020395.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
529776
ns534895
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
29142428
ns28188330
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9305500.5
ns9724208
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1088680
ns1078565.5
ns1.01
This comment was automatically generated by workflow using github-action-benchmark.