This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
6 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,4 +5,4 @@ using Static: True | |
|
||
Utils.is_extension_loaded(::Val{:Enzyme}) = True() | ||
|
||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
99fc6ac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
99fc6ac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/115662
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
99fc6ac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7000
ns5666
ns1.24
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5874.5
ns5667
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8250
ns7062.5
ns1.17
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5625
ns5541.5
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
88896
ns117778
ns0.75
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
400425
ns404275
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9958
ns9937.5
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9708
ns10041
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9875
ns10291
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9979.5
ns9875
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
370778
ns544239
ns0.68
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
665927
ns11501326
ns0.0579000195281831
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1249.5
ns1416.5
ns0.88
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3000
ns1479
ns2.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1959
ns1625
ns1.21
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1687.5
ns1542
ns1.09
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
13908
ns21518
ns0.65
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
30060
ns29030
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
3959
ns4250
ns0.93
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4291
ns4333
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
3875
ns4313
ns0.90
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4375
ns4459
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
104640
ns145904.5
ns0.72
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
145602
ns145511
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58042
ns58625
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39708.5
ns39750
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
40084
ns40042
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82708
ns83395.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
30831
ns37436
ns0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
79190
ns80685.5
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2061042
ns2046125
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2079750
ns2077896
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2084916
ns2083625.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2001229
ns1999104
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
181552
ns229936
ns0.79
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1440455
ns1490545
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
148042
ns162312.5
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
148000
ns164083
ns0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
155708
ns174959
ns0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
176313
ns153854
ns1.15
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168318
ns166305
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203247.5
ns198262
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1122729.5
ns1121458.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1119625
ns1114979
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1125833
ns1119209
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1123854.5
ns1123521
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
539424
ns696644
ns0.77
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
912000
ns1026480.5
ns0.89
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4625
ns4875
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5084
ns4916
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6125
ns5875
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4125
ns5375
ns0.77
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
60787
ns92112
ns0.66
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
67560
ns69791
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8500
ns8875
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8584
ns8917
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8667
ns8959
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8417
ns8625
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
418528
ns596620
ns0.70
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
384969
ns389954
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17542
ns18312
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17542
ns18104.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20458
ns20021
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18770.5
ns17771
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
59728.5
ns67875.5
ns0.88
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
76240
ns77581
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224208
ns235917
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219500
ns212458
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221312.5
ns213667
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213000
ns225292
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
293183.5
ns353373
ns0.83
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
463935
ns470510
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
667
ns708
ns0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
625
ns625
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
916
ns959
ns0.96
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
625
ns729.5
ns0.86
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
13248
ns20362
ns0.65
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
30930
ns32440
ns0.95
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1459
ns1375
ns1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1417
ns1458
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1417
ns1459
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1417
ns1375
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
92361
ns125347.5
ns0.74
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
136232
ns135651
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7417
ns7458
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5333
ns5292
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5416
ns5458
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10375
ns10416
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
18749
ns24280.5
ns0.77
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48581
ns48481
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
231083
ns256833
ns0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
237166.5
ns268834
ns0.88
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
241042
ns238167
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
255583
ns213521
ns1.20
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
154979
ns190543
ns0.81
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
646107
ns644671.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4125
ns4125
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4084
ns4083
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4125
ns4084
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4084
ns4083
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
19985
ns23269
ns0.86
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
46780
ns48260
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16458
ns16542
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16500
ns16542
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16625
ns16833
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16791
ns16583
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
176107
ns195985.5
ns0.90
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
175202
ns174616.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
511792
ns511667
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
331959
ns331875
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
332000
ns332042
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
865083
ns865458
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
116899.5
ns113196
ns1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
241233
ns243182
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2275354
ns2277833
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1753833
ns1758208
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1758916
ns1758041.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3193500
ns3193625
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
203284.5
ns242653
ns0.84
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
738868
ns741122
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7459
ns6396
ns1.17
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6854.5
ns7021
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6895.5
ns7583
ns0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6459
ns6084
ns1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
84654
ns90386
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
65201
ns65841
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11604
ns11812
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11125
ns11729.5
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12083
ns12250
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12021
ns10125
ns1.19
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
566453.5
ns626387
ns0.90
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
408354
ns405759
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
541
ns542
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
583
ns542
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
541
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
20386
ns23421
ns0.87
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
47011
ns46570
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2083
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2083
ns2208
ns0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2166
ns2167
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2084
ns2084
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
228468
ns221475.5
ns1.03
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
179272
ns174101.5
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8250
ns9041
ns0.91
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8833
ns9292
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9292
ns10375
ns0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8875
ns9000
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
107454
ns94379
ns1.14
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
74891
ns72281
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16812.5
ns17375
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17750
ns17729
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
19271
ns19209
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17791.5
ns17562.5
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
534728
ns576225.5
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
378084
ns378363
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns542
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
27220
ns35667
ns0.76
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
48461
ns46061
ns1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10021
ns10687.5
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9125
ns9083.5
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9584
ns9750
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9729
ns8666.5
ns1.12
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
168737.5
ns258995
ns0.65
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
367733.5
ns366948.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
399000
ns399292
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215542
ns215291
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
215541
ns215292
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756208
ns756083
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
110802
ns113061
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
76450
ns74731
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1398875
ns1407958
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
858375
ns860333
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
861479
ns860854
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2355542
ns2357500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
178308
ns211180.5
ns0.84
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
321323
ns323393
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7354
ns7125
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7042
ns7542
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8666.5
ns9000
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7563
ns7250.5
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
114410.5
ns143379.5
ns0.80
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
65791
ns66420
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13354.5
ns15250
ns0.88
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13542
ns14959
ns0.91
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15667
ns13687.5
ns1.14
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14979
ns12333.5
ns1.21
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
689799.5
ns942342
ns0.73
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
423374
ns425844
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25770.5
ns24646
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25875
ns28000
ns0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
29083
ns26666
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
27854
ns28334
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
168075.5
ns199235
ns0.84
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
114031
ns114286.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
118417
ns153084
ns0.77
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
119041
ns157166.5
ns0.76
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
141458.5
ns145958.5
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
155166
ns153417
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
861211
ns1075111
ns0.80
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
582431
ns585190.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74666
ns76625
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75750
ns76729
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
84875
ns81229
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
77084
ns79750
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
169153
ns206416.5
ns0.82
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
126942
ns129541
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
278291
ns307729
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
305021
ns294250
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
305833
ns290520.5
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
287270.5
ns291458
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
972909
ns1105738.5
ns0.88
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
695847
ns696697
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
16917
ns16875
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
17000
ns16500
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
18354.5
ns18375
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
16458
ns17584
ns0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
113778
ns145532.5
ns0.78
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
231482
ns232517.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
27604.5
ns27125
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25875
ns26750
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26958.5
ns27208
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
28166.5
ns26604
ns1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
702837
ns980431.5
ns0.72
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
696858
ns686517
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
10375
ns11625
ns0.89
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
10875
ns12250
ns0.89
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13625
ns13875
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
10625
ns10458
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
112473.5
ns123683.5
ns0.91
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
236187.5
ns236852
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21583
ns22709
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
22396
ns22063
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
22250
ns23083
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
22041
ns21833
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
556668
ns703893
ns0.79
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
670387
ns673557
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
65542
ns64250
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
64437.5
ns69208
ns0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
66333
ns65937.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
66167
ns63250
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96734
ns107264.5
ns0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
232362
ns232543
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
437459
ns457334
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
479417
ns450791
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
438167
ns449333.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
498625
ns488708
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
442769
ns515904.5
ns0.86
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
712032
ns701456.5
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7562.5
ns7333.5
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7625
ns7750
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8125
ns9208
ns0.88
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7250
ns6979
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
113892.5
ns144382.5
ns0.79
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
69331
ns65051
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14334
ns14354.5
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14500
ns15459
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16562
ns15000
ns1.10
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
11709
ns15604
ns0.75
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
675585.5
ns949171
ns0.71
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
399579
ns399874
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6158208
ns6153958.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
3224959
ns3225750
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
3225125
ns3225687.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11921125
ns11912750
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
347611.5
ns350232.5
ns0.99
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
322793
ns320283
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19113166.5
ns19165042
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
11081437.5
ns11087125
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
11182250
ns11132791
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36513062
ns36531187.5
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1026355
ns1015711
ns1.01
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1162657.5
ns1168797
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
958
ns958
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
958
ns1000
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1041
ns1000
ns1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1000
ns917
ns1.09
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
20341
ns23879
ns0.85
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
206602
ns206962
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3708
ns3667
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3666
ns3750
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3750
ns3709
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3709
ns3667
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
243936
ns284113
ns0.86
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
622497
ns623016
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8125
ns8312.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8145.5
ns8604.5
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10209
ns10083
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7645.5
ns8146
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
110001.5
ns119881.5
ns0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
64821
ns71901
ns0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11417
ns12166.5
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12146
ns12145.5
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12625
ns13313
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12083
ns11395.5
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
533401.5
ns642520
ns0.83
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
351113
ns357894
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
291
ns291
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
333
ns291
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
20031
ns22935
ns0.87
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
47010
ns46631
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2875
ns2917
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2917
ns2917
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3125
ns3167
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
3042
ns2958
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
139419
ns206899.5
ns0.67
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
160172
ns161012
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11708
ns12500
ns0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11208
ns11354
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12917
ns13083
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11708
ns10958.5
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
52993
ns121271
ns0.44
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
232812
ns233822
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20666.5
ns20291.5
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20208
ns21083
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
22458
ns22187.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21187.5
ns20104.5
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
249123.5
ns597659.5
ns0.42
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
648996.5
ns638656
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4458
ns4417
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4417
ns4416
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
20585
ns24156
ns0.85
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
48820
ns47331
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16375
ns16167
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16250
ns16375
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16458
ns16333
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16208
ns16333
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
169722
ns333657
ns0.51
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
209702
ns207757
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
1958
ns2125
ns0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
1958
ns2125
ns0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2084
ns2084
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2042
ns2041
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
28203
ns36462
ns0.77
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
202342
ns202982
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
17125
ns17021
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
16791.5
ns17625
ns0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
17542
ns16667
ns1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
17209
ns17083.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
147741
ns296284
ns0.50
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
682312
ns684797
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
59062
ns59562.5
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
62416
ns61667
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
61312.5
ns61875
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
53875
ns50958
ns1.06
batchedmm(16, Bsize=512)/forward/GPU/CUDA
71192
ns66679
ns1.07
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
116711
ns117392
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
202750.5
ns190771
ns1.06
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
98750
ns149541
ns0.66
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
118104
ns116312.5
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
297958
ns298166
ns1.00
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
170047
ns219498
ns0.77
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
616606
ns614646
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
84208
ns83166.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
83646
ns83395.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85166
ns110041.5
ns0.77
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
128334
ns83020.5
ns1.55
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
184384
ns190710.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203702
ns206032
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1889375
ns1873645.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1916750
ns1919416
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1919083
ns1920792
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1899041
ns1919291.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
379904
ns533490
ns0.71
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1068311
ns1074210
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
18502
ns21800
ns0.85
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
41550.5
ns43000
ns0.97
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1750
ns1792
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1791
ns1875
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1834
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
145894.5
ns256181.5
ns0.57
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
181622
ns182412
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8458
ns8458
ns1
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8937.5
ns9958
ns0.90
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11208.5
ns11708
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8875
ns7583
ns1.17
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
51415
ns119063.5
ns0.43
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
232043
ns234272.5
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9125
ns9208
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8667
ns9854
ns0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10458.5
ns9792
ns1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9583
ns8750
ns1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
241818.5
ns528065
ns0.46
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
623402
ns634101
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58604.5
ns58208
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39333
ns39375
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39792
ns39959
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83417
ns83291
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
32658
ns39916.5
ns0.82
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
79585.5
ns79101
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1931459
ns1906833
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1973750
ns1969916.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1980958.5
ns1979458
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1884875
ns1901458
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
152863
ns221725
ns0.69
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1040311
ns1161491.5
ns0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
418333
ns417125
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
418709
ns420562.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
422000
ns422103.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
418583.5
ns417979
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
94366
ns210226
ns0.45
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
281763
ns283213
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
673562.5
ns680083.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
753812.5
ns675125
ns1.12
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
769958
ns672375
ns1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
751938
ns672542
ns1.12
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
470483
ns1049720
ns0.45
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
903129
ns908698.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
3419645.5
ns3405187.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
3437875
ns3449917
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
3451375
ns3463646
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
3429042
ns3430687
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
140481
ns170640
ns0.82
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
441684
ns450759.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
6220250
ns6244167
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
6224937
ns6219417
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
6214292
ns6254812
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6141041.5
ns6201688
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
620637
ns1001354
ns0.62
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1629761.5
ns1637156.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
474958
ns474833
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
253000
ns253792
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
253292
ns253584
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
901709
ns901250
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
43146
ns47396
ns0.91
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
241942.5
ns241892
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2271000
ns2269791
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1763792
ns1760416
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1760167
ns1763687.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3188958
ns3197937.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
200260
ns271388
ns0.74
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
764328
ns765898
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58125
ns58541
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39334
ns39292
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39750
ns39792
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83375
ns84166
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
23268
ns28606
ns0.81
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
74721
ns73921
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2035750
ns2031396
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2088417
ns2088958.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2090333
ns2084000
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1963541
ns1977812.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
155158
ns235137
ns0.66
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1195637.5
ns1110895.5
ns1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58625
ns58667
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39834
ns39833
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
40083
ns40000
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83042
ns83291
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
41354
ns49806.5
ns0.83
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
77975.5
ns76691
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1927125
ns1930083.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1971541.5
ns1967645.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1976833
ns1961750
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1885312.5
ns1797166
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
164726
ns240260.5
ns0.69
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1051246
ns929734.5
ns1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
ns250
ns1.16
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns416
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
26436
ns35036
ns0.75
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
46511
ns46470
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7333
ns7584
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6500
ns6875
ns0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6917
ns7458
ns0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7834
ns5916
ns1.32
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
132779
ns213960
ns0.62
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
364088.5
ns368994
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
30026
ns33302
ns0.90
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
40500
ns36481
ns1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
3250
ns2959
ns1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2958
ns3083
ns0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
3042
ns3042
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2792
ns2625
ns1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
139460
ns192793
ns0.72
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
156362
ns151232
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
453562
ns420458.5
ns1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
426854
ns458333.5
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
424771
ns443562.5
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
454396.5
ns454625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
128743
ns138662
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
374513
ns376564
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3812646
ns3808250
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3818687.5
ns3812458
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3824687.5
ns3814333.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3809020.5
ns3779687.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
467612
ns712866
ns0.66
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1414714
ns1464519
ns0.97
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49937813
ns49902208
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
25988125
ns26041000
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
26009646
ns26000917
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
97113375
ns97099875
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1610536
ns1600470
ns1.01
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1049471
ns1045150
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154792729.5
ns154793291.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
89048958.5
ns88667041.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
89207416
ns89550541
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
294786708.5
ns294974291.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6494841
ns6495543
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5562936
ns5606170
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
18916.5
ns18750
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
15584
ns15666.5
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
14667
ns14167
ns1.04
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15896
ns15270.5
ns1.04
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
13971
ns20352.5
ns0.69
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
27630
ns25851
ns1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
11291
ns11041
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
7458.5
ns7833
ns0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
7750
ns7958
ns0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17520.5
ns17083
ns1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
101782
ns261162.5
ns0.39
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
148192
ns148401.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
9541.5
ns8375
ns1.14
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
9125.5
ns9083
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10333
ns10583
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8542
ns7916.5
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
53666.5
ns113294.5
ns0.47
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
235372
ns234072
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9541
ns10521
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10209
ns10416.5
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10458
ns10042
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10250
ns9666.5
ns1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
269358
ns615911
ns0.44
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
652326
ns655506
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9812.5
ns9625
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9250
ns9833
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10812.5
ns12042
ns0.90
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9562.5
ns8479
ns1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
53391
ns120314
ns0.44
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
71711
ns71931
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
14333
ns13083
ns1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
14083
ns15021
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
15167
ns14542
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
16625
ns13417
ns1.24
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
251184.5
ns587303
ns0.43
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
344093
ns344908.5
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
458
ns459
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
458
ns583
ns0.79
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
583
ns542
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
583
ns459
ns1.27
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
27208
ns34757
ns0.78
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
203792
ns201632
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8625
ns7333.5
ns1.18
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8125
ns9270.5
ns0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8604.5
ns7833
ns1.10
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8416.5
ns7229.5
ns1.16
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
147255
ns231923.5
ns0.63
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
656126
ns657851
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
16625
ns15875
ns1.05
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
14500
ns14645.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
13354
ns12167
ns1.10
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
10229
ns10375
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
13896.5
ns21214
ns0.66
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
186472
ns184672
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
31750
ns31375
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
32000
ns32416
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
32042
ns32270.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
31833
ns31541
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
110682.5
ns276539
ns0.40
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
592116
ns588126
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
450209
ns444792
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
445500
ns484417
ns0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
444167
ns448792
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
462958
ns443250
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
188096.5
ns194813
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
367068.5
ns367924
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3834209
ns3843833
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3836666
ns3831916.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3847459
ns3838417
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3828250
ns3835042
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
383846
ns537386
ns0.71
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1358354
ns1358632
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
784152667
ns784101083
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
416079687.5
ns418358083
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
422584917
ns418383604.5
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1509956229
ns1504938187.5
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22771101.5
ns22745060.5
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14743999
ns14695345
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2524849666
ns2524662875
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1511960000
ns1518103167
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1536159417
ns1524361625
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4778947333
ns4741835375
ns1.01
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
119521542
ns366822106
ns0.33
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
87915389
ns88277685
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
78208.5
ns76417
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
80271
ns76792
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
82708
ns80333
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
77334
ns77208
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
93705
ns206105.5
ns0.45
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
118801
ns118901
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
291334
ns191562.5
ns1.52
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
210333
ns287750
ns0.73
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
261874.5
ns209417
ns1.25
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
202208.5
ns253812.5
ns0.80
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
458544
ns1033097.5
ns0.44
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
662017
ns658411
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
200217604
ns200015521
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
103846750
ns103790000.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
104247042
ns104076875
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
389363833
ns389226000
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5840254.5
ns5819295
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3591326
ns3575713
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
620550500
ns621801500
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
352840416.5
ns353125646
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
353679646
ns354434874.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1181355417
ns1181638875
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26562043
ns26630294
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
22008202.5
ns22185623
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7167
ns7167
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns5375
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5458
ns5375
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns10500
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
20844
ns27436
ns0.76
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48671
ns46631
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
245770.5
ns212500
ns1.16
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
243083
ns220750
ns1.10
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221208
ns220458
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
207979
ns206104.5
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
137816.5
ns220558
ns0.62
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
523805
ns523545
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8334
ns10541.5
ns0.79
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8166.5
ns9541.5
ns0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11041
ns10875
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9020.5
ns8312
ns1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
50777
ns117824.5
ns0.43
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
69381
ns70451
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8875
ns7583.5
ns1.17
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8583
ns9792
ns0.88
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8166
ns8187.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10854.5
ns7562.5
ns1.44
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
245858
ns515354.5
ns0.48
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
312998.5
ns318733
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns583
ns0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
584
ns459
ns1.27
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
19411
ns26054
ns0.75
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
48630
ns46610
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10333
ns9083
ns1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
11375
ns9604
ns1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9770.5
ns8958
ns1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9708
ns9166
ns1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
120697
ns252407.5
ns0.48
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
388289
ns388539
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
105500
ns107458.5
ns0.98
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
85875
ns84708
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
87000
ns86000
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
146333.5
ns146750
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
16870
ns23950.5
ns0.70
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
190057
ns191282
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
478500
ns516625
ns0.93
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
485458
ns502312.5
ns0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
481521
ns478354.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
478833
ns498167
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
117100
ns232559
ns0.50
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
608201.5
ns606451
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5959
ns5250
ns1.14
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
6625
ns6500
ns1.02
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
7479.5
ns7749.5
ns0.97
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
6229.5
ns5687.5
ns1.10
batchedmm(16, Bsize=32)/forward/GPU/CUDA
14736
ns16126.5
ns0.91
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
79970
ns85781
ns0.93
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
13500
ns11625
ns1.16
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
9750
ns9917
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
10167
ns10104.5
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
17125
ns16584
ns1.03
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
109548
ns215162.5
ns0.51
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
366884
ns378354
ns0.97
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
40458
ns38708
ns1.05
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
50417
ns51125
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
51354
ns52146
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13667
ns14417
ns0.95
batchedmm(16, Bsize=128)/forward/GPU/CUDA
20278.5
ns19504
ns1.04
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
85591
ns93401
ns0.92
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
37250
ns36334
ns1.03
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
29541
ns28167
ns1.05
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
29875
ns28625
ns1.04
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
57562.5
ns56895.5
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
119274.5
ns190765
ns0.63
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
395964
ns410848.5
ns0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1833
ns1666.5
ns1.10
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1667
ns2000
ns0.83
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2291
ns2167
ns1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
2041.5
ns1667
ns1.22
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
13524
ns20338
ns0.66
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
32690
ns32440
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2167
ns2042
ns1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2145.5
ns2375
ns0.90
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2395.5
ns2417
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2312.5
ns2083
ns1.11
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
89460.5
ns202489
ns0.44
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
136351
ns136411
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6104
ns6750
ns0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4708.5
ns4833
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6187.5
ns5896
ns1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5874.5
ns4916.5
ns1.19
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
58659.5
ns142403
ns0.41
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
67281
ns69051
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9083.5
ns8395.5
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9000
ns8625
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8709
ns8542
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8750
ns8292
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
386636
ns858082
ns0.45
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
384884
ns388048.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56916
ns56834
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
56833
ns56916
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
56958
ns56917
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
58291
ns58291
ns1
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
29539
ns37048
ns0.80
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
203102.5
ns204772
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
453791.5
ns484583.5
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
466875
ns475541.5
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
465666.5
ns465562.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
436208
ns445666
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
167893
ns263380
ns0.64
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
823238
ns819218
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3327646
ns3332458
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1773958
ns1767958
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
1770208
ns1766125
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6318167
ns6295583.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
203665
ns206330
ns0.99
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
213597.5
ns212392
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11522375
ns11495438
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
6550792
ns6565688
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
6579708.5
ns6570438
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21256687.5
ns21167562.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
761872
ns737845
ns1.03
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1057191
ns1062630
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6667
ns4833
ns1.38
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4917
ns5583
ns0.88
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7000
ns7333
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5166
ns4500
ns1.15
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
57961.5
ns136011
ns0.43
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
56041
ns56600
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
11458
ns7125
ns1.61
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8750
ns7500
ns1.17
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7541
ns7541.5
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8625
ns7292
ns1.18
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
382208
ns746443
ns0.51
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
361754
ns370888
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
126917
ns155000
ns0.82
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
102541
ns124709
ns0.82
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
101792
ns98541
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
98333
ns98709
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127201
ns150159
ns0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
206327
ns204262
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2039750.5
ns2031188
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2028645.5
ns2031500
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2040937.5
ns2037125
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1948458
ns2033000
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
443232
ns697162
ns0.64
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1211817
ns1208931
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
33542
ns33209
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
34416
ns34833
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
34583
ns33042
ns1.05
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
625
ns541
ns1.16
batchedmm(2, Bsize=4)/forward/GPU/CUDA
13510
ns15393
ns0.88
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
79871
ns79290
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
3750
ns2583
ns1.45
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
3209
ns3083
ns1.04
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3041
ns3209
ns0.95
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2333
ns2125
ns1.10
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
89708.5
ns138753
ns0.65
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
340203
ns341213
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7209
ns7250
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns5416
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5417
ns5416
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns10458
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
29375
ns36086
ns0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49300
ns49460
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222374.5
ns213395.5
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221270.5
ns227750
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221458
ns220792
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206500
ns205667
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
159760
ns240787.5
ns0.66
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
572920.5
ns569246
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3959
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
18490
ns21637
ns0.85
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
43450
ns42161
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14667
ns14625
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14666
ns14750
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14709
ns14667
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14708
ns14625
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
165588
ns307620
ns0.54
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
197842
ns192746.5
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
130708
ns100834
ns1.30
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
101313
ns118500
ns0.85
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
105000.5
ns101833
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
106666.5
ns102417
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125911
ns136873
ns0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
204662
ns205777
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1925042
ns1916625
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1928041
ns1916542
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1930583
ns1926979
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1855291
ns1898334
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
429902
ns683667
ns0.63
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1148786.5
ns1215256.5
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18166
ns19000
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18979
ns19000
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22458
ns22250
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18125
ns16916
ns1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
63187.5
ns107183.5
ns0.59
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79155.5
ns78581
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
252792
ns217813
ns1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
261875
ns222833
ns1.18
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219958
ns217417
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217125
ns216770.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
279978
ns512086.5
ns0.55
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
475684
ns476669.5
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
24729.5
ns24750
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
28125
ns28937.5
ns0.97
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
27000
ns26875
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1375
ns1083
ns1.27
batchedmm(16, Bsize=4)/forward/GPU/CUDA
13843
ns16054
ns0.86
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
81051
ns81581
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
5479.5
ns4896.5
ns1.12
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
5167
ns4917
ns1.05
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5270.5
ns5333
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4708
ns4229
ns1.11
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
110586.5
ns206611
ns0.54
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
379244
ns377863
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
308792
ns306208
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
305625
ns305084
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
307291
ns309729.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
306834
ns307625
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
102299
ns224320
ns0.46
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
272803
ns274612
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
544417
ns531959
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
575000
ns543458
ns1.06
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
545958.5
ns535333.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
538167
ns542209
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
500049
ns1058263
ns0.47
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
849309
ns853108
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22000
ns22084
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
21083
ns21083
ns1
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22042
ns23542
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19667
ns19459
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
64471.5
ns112165.5
ns0.57
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
78011
ns78361
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
226000
ns221750
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
245604
ns217666.5
ns1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215584
ns224750
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212791
ns222416
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
344357
ns732048.5
ns0.47
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
535535
ns533125
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7542
ns6958
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5791.5
ns6958
ns0.83
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8416
ns9208
ns0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
7167
ns6417
ns1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
63232
ns137815
ns0.46
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
65391
ns65160
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13667
ns9958
ns1.37
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11916
ns10792
ns1.10
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10125
ns10541
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10041
ns9875
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
396144.5
ns815812
ns0.49
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
386814
ns385314
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6541.5
ns4750
ns1.38
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4666
ns5208
ns0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6500
ns6271
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7042
ns5000
ns1.41
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
64824
ns141314
ns0.46
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
68750
ns66780
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8083
ns7709
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8166
ns7916
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7708
ns7875
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7583
ns7959
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
423945
ns775695
ns0.55
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
394914
ns388324
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14516708
ns14550291
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
7713187.5
ns7721375
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
7704854
ns7712187.5
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27801334
ns27857958
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
531151.5
ns529799
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
393889
ns389819
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46558771.5
ns46686916.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
26529584
ns26553583
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
26598312
ns26597104.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85686792
ns85700209
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
3208907
ns2648481
ns1.21
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3300533
ns3297251
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
67833
ns66125
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
65625
ns68667
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
69333.5
ns70437.5
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
67292
ns66917
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
68650
ns117160.5
ns0.59
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
232393
ns233212
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
450333
ns455375
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
453834
ns452500
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
446417
ns453833.5
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
441584
ns441375
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
394734
ns721437
ns0.55
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
788457.5
ns786047
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns667
ns0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns500
ns1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
26112
ns32085
ns0.81
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
47140
ns47371
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10542
ns8667
ns1.22
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9583
ns9042
ns1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9250
ns10000
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10708
ns8458
ns1.27
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
152524.5
ns282627
ns0.54
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
373324
ns375423.5
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9833
ns9792
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9792
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9833
ns9792
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9833
ns9833
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
20835
ns22901
ns0.91
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
208092
ns208212
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
46333
ns45625
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45833
ns45958
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
46000
ns45875
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
45959
ns45917
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
189222
ns288260
ns0.66
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
603691
ns607426
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56334
ns56625
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
56375
ns56833
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
56458
ns56834
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
57875
ns58250
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
21828
ns28250
ns0.77
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
202032
ns202042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
464834
ns496854
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
474250.5
ns504833
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
465771
ns482959
ns0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
434770.5
ns434145.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
162400
ns242768
ns0.67
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
877129
ns877308
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
651104.5
ns642729
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
683542
ns659250
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
656292
ns650437.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
616541.5
ns609291.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
140209
ns203473.5
ns0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
305778
ns309673
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2262562.5
ns2253979
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2231521
ns2246042
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2245125
ns2231375
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2244604.5
ns2238292
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
644538
ns956636.5
ns0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1307248
ns1324473
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21625
ns20292
ns1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20833
ns23500
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23208
ns24250
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20125
ns19333
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
69407.5
ns111824.5
ns0.62
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
78811
ns80571
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
233042
ns271000
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
233125
ns258000
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221333
ns231875
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
224875
ns221125
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
410361
ns720921
ns0.57
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
557581
ns554706
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns500
ns1.25
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns583
ns0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns667
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
625
ns500
ns1.25
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
18190
ns22764
ns0.80
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
47870
ns47580
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9812.5
ns9541
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9250
ns9625
ns0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10042
ns10208
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10000
ns9333
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
136633
ns264550
ns0.52
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
397114
ns398354
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8958
ns10750
ns0.83
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8438
ns8875
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10750
ns11125
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8084
ns8917
ns0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
64696.5
ns117075.5
ns0.55
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
71891
ns69781
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7666
ns7500
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns7750
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8417
ns8083
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7708
ns7750
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
292900
ns498929
ns0.59
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
318078
ns322428
ns0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1542
ns1458
ns1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1458
ns1584
ns0.92
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2208
ns2000
ns1.10
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1708
ns1541
ns1.11
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
13397
ns20430
ns0.66
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
188372
ns188361
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3312.5
ns3292
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3375
ns3458
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3667
ns3541
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3375
ns3208
ns1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
117821
ns218522.5
ns0.54
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
578906
ns578345
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
147437.5
ns148312.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
106312.5
ns105937.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
107750
ns108125
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
226021
ns226084
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
16777
ns23769
ns0.71
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
40540
ns40471
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
163417
ns173291.5
ns0.94
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
106833
ns104500
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
98125
ns105208
ns0.93
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
251458
ns287062
ns0.88
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
141681
ns215904
ns0.66
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
266553
ns268567
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7292
ns7250
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5333
ns5333
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5375
ns5416
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10209
ns10416
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26669.5
ns32778
ns0.81
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48681
ns48640
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
256208
ns226583
ns1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
258709
ns229645.5
ns1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
231395.5
ns238083
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
224896
ns213229.5
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
185868.5
ns258784
ns0.72
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
589590.5
ns595636
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
16125
ns15375
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
14750
ns15125
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
17000
ns16959
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
15375
ns15083
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
76403.5
ns137028
ns0.56
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
230202
ns230152
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24416
ns23500
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23708
ns24208
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
23792
ns24500
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23417
ns24375
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
496390.5
ns858623.5
ns0.58
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
676296.5
ns679476
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
10334
ns9750
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9375
ns10104.5
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11666.5
ns11000
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9292
ns9084
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
81566
ns120301.5
ns0.68
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
72771
ns74161
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14333
ns13875
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13666.5
ns14646
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14729.5
ns15000
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14750
ns13958
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
412717
ns655428
ns0.63
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
362433
ns366138.5
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8917
ns10250
ns0.87
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9750
ns10625.5
ns0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11896
ns11792
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9542
ns9125
ns1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
84716
ns119866.5
ns0.71
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
71721
ns72421
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13250
ns12208
ns1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12521
ns12791.5
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13542
ns13084
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12875
ns12875
ns1
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
346105
ns541791
ns0.64
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
338603.5
ns341643
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
31041.5
ns30750
ns1.01
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
32438
ns32333
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
29625
ns29792
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
2167
ns1625
ns1.33
batchedmm(2, Bsize=128)/forward/GPU/CUDA
14504
ns16024
ns0.91
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
80601
ns80551
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5250
ns5042
ns1.04
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
4750
ns5458
ns0.87
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5208
ns5083
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6541
ns6209
ns1.05
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
107471
ns139561
ns0.77
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
370164
ns368314
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns291
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
291
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns250
ns1.50
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
18911
ns25032
ns0.76
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
46920
ns46980
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6542
ns6167
ns1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6292
ns6666.5
ns0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6958.5
ns6958
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6708
ns6125
ns1.10
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
135126
ns184207
ns0.73
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
386254
ns388954
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
2000
ns2000
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
1958
ns2042
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2083
ns2083
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
2042
ns1959
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
20048
ns26042
ns0.77
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
204122
ns204582
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16937.5
ns17083
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17042
ns16875
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17000
ns16896
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15875
ns16584
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
151188.5
ns271146.5
ns0.56
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
698796.5
ns701017
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
150292
ns147458
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
188375
ns175562.5
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
152834
ns153292
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
152750
ns152541
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
169794
ns195620
ns0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
225092
ns226692
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1328166
ns1323500
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1339625
ns1327791
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1339979
ns1331125
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1321375
ns1301042
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
738732.5
ns891045
ns0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1067311
ns1116140.5
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
26042
ns25000
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25313
ns24437.5
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28208
ns28250
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
25750
ns25979.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
179072
ns231362.5
ns0.77
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
113981
ns115561
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
181083.5
ns178562
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
169917
ns126166
ns1.35
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
118875
ns178437.5
ns0.67
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
125563
ns157500
ns0.80
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
736737.5
ns1053949
ns0.70
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
606996
ns608216
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns334
ns0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
333
ns375
ns0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
17782
ns22518
ns0.79
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
47020
ns47580
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6917
ns6416
ns1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6500
ns6834
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7270.5
ns7020.5
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6958
ns6417
ns1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
149426
ns200663
ns0.74
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
389994
ns396354
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6209
ns7062.5
ns0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5708
ns5874.5
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7666
ns7791
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5958
ns6791
ns0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
100369
ns142964.5
ns0.70
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
231643
ns231792
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10083
ns10208.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9666.5
ns10250
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10333
ns10500
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10125
ns10333
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
656519
ns887713
ns0.74
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
676037
ns669276
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
708
ns667
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
667
ns667
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
667
ns667
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
667
ns625
ns1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
20098
ns22120
ns0.91
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
205502
ns205382
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4667
ns4667
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4584
ns4833
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4875
ns4833
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4709
ns4584
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
183686.5
ns224988.5
ns0.82
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
577406
ns575835.5
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8062
ns8167
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8083
ns8437
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10062
ns9833
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7979.5
ns7958
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
112521
ns119167.5
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
75781
ns74331
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8625
ns8416
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8750
ns8938
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9459
ns9625
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8959
ns8458
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
542270
ns578635
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
339298.5
ns344473
ns0.98
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
126979.5
ns126875
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
100291
ns97229
ns1.03
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
97208
ns97333.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
180729.5
ns183291.5
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA
44342
ns45455.5
ns0.98
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
101011
ns101051
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
340250
ns340292
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
192146
ns182250
ns1.05
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
167166
ns191959
ns0.87
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
573958.5
ns612416.5
ns0.94
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
199334
ns191737
ns1.04
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
515465
ns516500
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
399208
ns399042
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215250
ns215417
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
215625
ns215333
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756875
ns756333
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
40054
ns43626
ns0.92
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
80551
ns81280
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1406459
ns1398374.5
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
862312
ns864000
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
864000
ns864270.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2359542
ns2358708.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
234952
ns253991.5
ns0.93
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
353324
ns350903.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
659917
ns653500
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
658270.5
ns655916
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
624271
ns653041.5
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
677791.5
ns622146
ns1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196665.5
ns201217.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
305543
ns306973
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2481875
ns2461125.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2467479.5
ns2469625
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2476313
ns2481375
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2446833
ns2480333
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
984615.5
ns998464.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1399689
ns1392463.5
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
34062.5
ns32521
ns1.05
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
34666.5
ns34291
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
32791.5
ns33084
ns0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
958
ns833
ns1.15
batchedmm(2, Bsize=32)/forward/GPU/CUDA
14044
ns15542.5
ns0.90
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
84401
ns78871
ns1.07
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3166.5
ns3000
ns1.06
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3166
ns3417
ns0.93
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3500
ns3500
ns1
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3250
ns3042
ns1.07
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
121484
ns141700
ns0.86
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
362074
ns337663
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
406584
ns408916
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
402458
ns403770.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
403000
ns404375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
420645.5
ns423959
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36583
ns43511.5
ns0.84
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
238852
ns237932
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3879583
ns3878166.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3983541.5
ns3999042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3998250
ns4003416
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3674250
ns3792395.5
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
237279.5
ns245738
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1428125
ns1432279
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
32312
ns34288
ns0.94
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
38101
ns37921
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15459
ns15459
ns1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15459
ns15666
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15666
ns15666
ns1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15458
ns15458
ns1
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
242437.5
ns258924
ns0.94
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
167902
ns173651.5
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404458
ns404583
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
221625
ns220833
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
221375
ns221125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760125
ns760833
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
117928
ns113269
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
87841
ns87641
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1429417
ns1424020.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
887583
ns888041.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
887396
ns888875
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2378208
ns2382770.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
192870.5
ns245573
ns0.79
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
353053
ns354303
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
542
ns500
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
458
ns583
ns0.79
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
583
ns500
ns1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
19335
ns25789
ns0.75
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
205012
ns204972
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7459
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns7667
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8166
ns7958
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8167
ns7250
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
165885
ns217010.5
ns0.76
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
683986
ns692821.5
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
832729.5
ns832771
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
467000
ns467416
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
469250
ns470562.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1575625
ns1544541
ns1.02
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129567
ns129883
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
227872
ns229272
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2691958.5
ns2692000
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1537333
ns1540000
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1540083.5
ns1542312.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4938125
ns4931479
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
274489
ns248014
ns1.11
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
806443
ns809797.5
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
334
ns375
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
334
ns291
ns1.15
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
25740
ns32644
ns0.79
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
47540
ns47000
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6625
ns6208
ns1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6125
ns6562.5
ns0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6791
ns6916
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6667
ns6333
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
180518.5
ns226410
ns0.80
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
359293
ns357804
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2375375
ns2407917
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2422500
ns2401417
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2407959
ns2386750
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2370375
ns2392333
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
178233.5
ns200791
ns0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
374734
ns374543.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4668709
ns4663875
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4652084
ns4666063
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4665083.5
ns4675291
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4600917
ns4670208
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
872920
ns902618
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1382244
ns1376633
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
9437.5
ns6875
ns1.37
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7833
ns7542
ns1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7292
ns7250
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6834
ns6917
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
16361
ns23477
ns0.70
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
39440
ns39221
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
74520.5
ns32313
ns2.31
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
49250
ns49125
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
51729
ns49583
ns1.04
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
49083.5
ns52291.5
ns0.94
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
212837
ns219072.5
ns0.97
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
266233
ns262272
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
22250
ns21666.5
ns1.03
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
25000
ns24541.5
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
21854.5
ns22416.5
ns0.97
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5375
ns5166
ns1.04
batchedmm(2, Bsize=512)/forward/GPU/CUDA
15953
ns18191
ns0.88
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
83861
ns82841
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
11834
ns11979
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
9187.5
ns9645.5
ns0.95
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
9520.5
ns9541.5
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
18354.5
ns18062.5
ns1.02
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
203711.5
ns231197.5
ns0.88
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
388864
ns365714
ns1.06
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
406375
ns406041
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
223500
ns223459
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
223792
ns223375
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762958
ns762584
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
43379
ns46689.5
ns0.93
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
89781
ns87501
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1427542
ns1427542
ns1
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
892959
ns894125
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
892958
ns896417
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2385625
ns2384229
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
239711
ns287677.5
ns0.83
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
376923.5
ns377703
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
434375
ns434334
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
430000
ns430229.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
430417
ns430333
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
448375
ns447583
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
46179
ns55000
ns0.84
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
235662
ns233247
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3912500
ns3915625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4004000
ns4018146
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4025375.5
ns4025959
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3768792
ns3782667
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
251012
ns265792.5
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1368994
ns1207206.5
ns1.13
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
8750
ns8750
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
6875
ns6875
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
6917
ns6875
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12458
ns12416
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
20602
ns24680
ns0.83
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
209952
ns210232
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
44958
ns44583
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45083
ns44959
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
45250
ns44875
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
44750
ns44667
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
314279
ns349913
ns0.90
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
653907
ns651936
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
115896
ns119750.5
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
125812.5
ns123750
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
126604.5
ns89667
ns1.41
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
89000
ns81771
ns1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
186375.5
ns189502
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
219802
ns218452
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2026583
ns2022125
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2025000
ns2026083
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2024729.5
ns2027729
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2026520.5
ns2023895.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
566645
ns540867
ns1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1084851
ns1089800
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.