This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: bump crate-ci/typos from 1.24.5 to 1.24.6
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.5...v1.24.6) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>
- Loading branch information
350b7c7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7625
ns7000
ns1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7333
ns5874.5
ns1.25
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7437
ns8250
ns0.90
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5500
ns5625
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
88183
ns88896
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
2389684
nslayernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
405334
ns400425
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9916.5
ns9958
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9542
ns9708
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9792
ns9875
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10000
ns9979.5
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
383362
ns370778
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
17679354
nslayernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
677366
ns665927
ns1.02
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
2334
ns1249.5
ns1.87
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1500
ns3000
ns0.50
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1688
ns1959
ns0.86
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1729.5
ns1687.5
ns1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
14281
ns13908
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI
1297688
nsbias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
30200
ns30060
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4271
ns3959
ns1.08
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4458
ns4291
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
3750
ns3875
ns0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3917
ns4375
ns0.90
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
106099.5
ns104640
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI
9298154.5
nsbias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
144956.5
ns145602
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57333
ns58042
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46750
ns39708.5
ns1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46250
ns40084
ns1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83708
ns82708
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
30588.5
ns30831
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
572856.5
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
77970
ns79190
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2018916
ns2061042
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2087937.5
ns2079750
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2087229
ns2084916
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1997063
ns2001229
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
182309
ns181552
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
7656207
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1482305
ns1440455
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
146584
ns148042
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
174667
ns148000
ns1.18
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
149333.5
ns155708
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
178791.5
ns176313
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167232
ns168318
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
9038666
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
197432
ns203247.5
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1107750.5
ns1122729.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1114208
ns1119625
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1117604.5
ns1125833
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1114000.5
ns1123854.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
537253
ns539424
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35616369
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1026475
ns912000
ns1.13
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5291
ns4625
ns1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4645.5
ns5084
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5541
ns6125
ns0.90
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4166
ns4125
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
60281
ns60787
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
5328970.5
nslayernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
70560
ns67560
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8562.5
ns8500
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8459
ns8584
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9166.5
ns8667
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8750
ns8417
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
414715.5
ns418528
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
33923657
nslayernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
387834
ns384969
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17792
ns17542
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17708
ns17542
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21500
ns20458
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17208.5
ns18770.5
ns0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
60282.5
ns59728.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3008486
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
75721
ns76240
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212333
ns224208
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212458
ns219500
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213521
ns221312.5
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222750
ns213000
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
291687
ns293183.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
14295306
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
471954.5
ns463935
ns1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
583
ns667
ns0.87
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
583
ns625
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
792
ns916
ns0.86
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583.5
ns625
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
13225
ns13248
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI
1210151
nsbias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
30961
ns30930
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1541
ns1459
ns1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1542
ns1417
ns1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1542
ns1417
ns1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1416.5
ns1417
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
92964
ns92361
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI
9171879
nsbias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
134891
ns136232
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7417
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns5333
ns1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6167
ns5416
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns10375
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
18616
ns18749
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1243379.5
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
46921
ns48581
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
263062
ns231083
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
240459
ns237166.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228792
ns241042
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
237750
ns255583
ns0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
154023
ns154979
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
32407548
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
637591
ns646107
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4083
ns4125
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4125
ns4084
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4167
ns4125
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4083
ns4084
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
20561
ns19985
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI
2115667
nsdense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
46550
ns46780
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
17125
ns16458
ns1.04
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16750
ns16500
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16958
ns16625
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16416
ns16791
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
174545.5
ns176107
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI
10156857.5
nsdense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
173982
ns175202
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
509375
ns511792
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
405541
ns331959
ns1.22
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
404292
ns332000
ns1.22
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
864750
ns865083
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
117562
ns116899.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI
397557
nsdense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
240702
ns241233
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2318458
ns2275354
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2034500
ns1753833
ns1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2032084
ns1758916
ns1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3191167
ns3193500
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
202548
ns203284.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI
11415659
nsdense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
739097
ns738868
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5979.5
ns7459
ns0.80
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6312.5
ns6854.5
ns0.92
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8542
ns6895.5
ns1.24
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6542
ns6459
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
84957.5
ns84654
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
5409712
nslayernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
66831
ns65201
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11937.5
ns11604
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11541.5
ns11125
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11604
ns12083
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10583
ns12021
ns0.88
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
561493
ns566453.5
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
37617116
nslayernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
405534
ns408354
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns541
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns583
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns541
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
20286
ns20386
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI
2161771
nsdense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
51190
ns47011
ns1.09
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2084
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2083
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2208
ns2166
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
223022
ns228468
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI
10990252.5
nsdense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
182361
ns179272
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8625
ns8250
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9520.5
ns8833
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9334
ns9292
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7917
ns8875
ns0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
108611
ns107454
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
3137439.5
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
74611
ns74891
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18395.5
ns16812.5
ns1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16917
ns17750
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18854
ns19271
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18396
ns17791.5
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
518312
ns534728
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
16860013
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
380934
ns378084
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
458
ns500
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
708
ns625
ns1.13
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
27063
ns27220
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
1178178
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
46160
ns48461
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8500
ns10021
ns0.85
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9020.5
ns9125
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9208.5
ns9584
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8937.5
ns9729
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
166677
ns168737.5
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
18801518
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
371663
ns367733.5
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397208
ns399000
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288208.5
ns215542
ns1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288000
ns215541
ns1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756583
ns756208
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
110755
ns110802
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI
333813
nsdense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
75971
ns76450
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1448374.5
ns1398875
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1133083
ns858375
ns1.32
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1131833
ns861479
ns1.31
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2357875
ns2355542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
177520.5
ns178308
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI
10029153
nsdense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
322173
ns321323
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7291.5
ns7354
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6875
ns7042
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8666
ns8666.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7208
ns7563
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
110478
ns114410.5
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
5505252
nslayernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
65640
ns65791
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
12145.5
ns13354.5
ns0.91
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14167
ns13542
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
13792
ns15667
ns0.88
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14729
ns14979
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
664318.5
ns689799.5
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
42216111.5
nslayernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
426745
ns423374
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24770.5
ns25770.5
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
28375
ns25875
ns1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
30459
ns29083
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
25729.5
ns27854
ns0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
167386
ns168075.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7615563
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
113401
ns114031
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
151292
ns118417
ns1.28
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
151187.5
ns119041
ns1.27
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
153583
ns141458.5
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
143875
ns155166
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
857621
ns861211
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
44631154
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
587816
ns582431
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
79833
ns74666
ns1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
85583.5
ns75750
ns1.13
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
80437
ns84875
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73583
ns77084
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
168427.5
ns169153
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7736056
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
129412
ns126942
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
285333
ns278291
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
300667
ns305021
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
300791.5
ns305833
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222625
ns287270.5
ns0.77
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
971830
ns972909
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
41332252
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
696216
ns695847
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
17000
ns16917
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
16833
ns17000
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
17125
ns18354.5
ns0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
16542
ns16458
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
112981
ns113778
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
5793916
nslayernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
231572
ns231482
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
28083.5
ns27604.5
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26500
ns25875
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28083.5
ns26958.5
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27187.5
ns28166.5
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
696173.5
ns702837
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
41169551
nslayernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
689617
ns696858
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
10292
ns10375
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11333.5
ns10875
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11750
ns13625
ns0.86
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
10250
ns10625
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
111360
ns112473.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
3372766
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
235923
ns236187.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
23687.5
ns21583
ns1.10
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21375
ns22396
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
22583
ns22250
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
22375
ns22041
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
554045
ns556668
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
22400526.5
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
674936
ns670387
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
63875
ns65542
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
65292
ns64437.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
66458
ns66333
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
62667
ns66167
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96846
ns96734
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3400257
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
235422
ns232362
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
437167
ns437459
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
485500
ns479417
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
486250
ns438167
ns1.11
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
442291
ns498625
ns0.89
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
440935
ns442769
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20393573
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
716017
ns712032
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7208
ns7562.5
ns0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7250
ns7625
ns0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8646
ns8125
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6812.5
ns7250
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
113059.5
ns113892.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
5983032
nslayernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
64461
ns69331
ns0.93
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
11875
ns14334
ns0.83
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13583
ns14500
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14854.5
ns16562
ns0.90
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14750
ns11709
ns1.26
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
670072
ns675585.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
40018921
nslayernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
400084
ns399579
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6149145.5
ns6158208
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
6373791
ns3224959
ns1.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6369958
ns3225125
ns1.98
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11914917
ns11921125
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
348199
ns347611.5
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/oneAPI
55221895
nsbatchedmm(512, Bsize=4)/forward/GPU/AMDGPU
318854
ns322793
ns0.99
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19112395.5
ns19113166.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
19954875
ns11081437.5
ns1.80
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
19933333
ns11182250
ns1.78
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36546937.5
ns36513062
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1032394
ns1026355
ns1.01
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI
78448314.5
nsbatchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1157393
ns1162657.5
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
958
ns958
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
958
ns958
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1000
ns1041
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
958
ns1000
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
20220
ns20341
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI
2011379
nsdense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
207432
ns206602
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3667
ns3708
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3667
ns3666
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3750
ns3750
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3625
ns3709
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
242662
ns243936
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI
11613706.5
nsdense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
625907
ns622497
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7229.5
ns8125
ns0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8208
ns8145.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9063
ns10209
ns0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7833
ns7645.5
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
110132.5
ns110001.5
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
3376276
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
72491
ns64821
ns1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11792
ns11417
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11708
ns12146
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12833
ns12625
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12042
ns12083
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
533463.5
ns533401.5
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
22224767.5
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
357164
ns351113
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
20014
ns20031
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI
2044805
nsdense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
46611
ns47010
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3167
ns2875
ns1.10
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2875
ns2917
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3083
ns3125
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2834
ns3042
ns0.93
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
168487.5
ns139419
ns1.21
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI
9185467
nsdense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
163482
ns160172
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11500
ns11708
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11292
ns11208
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13562.5
ns12917
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9687.5
ns11708
ns0.83
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
110742.5
ns52993
ns2.09
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
3318937
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
234383
ns232812
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
22041.5
ns20666.5
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21312.5
ns20208
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
22292
ns22458
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21375.5
ns21187.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
445412.5
ns249123.5
ns1.79
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
20307385
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
648033
ns648996.5
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4417
ns4375
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4417
ns4458
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
21103
ns20585
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI
2254531
nsdense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
47271
ns48820
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16542
ns16375
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16458
ns16250
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16667
ns16458
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16542
ns16208
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
292441
ns169722
ns1.72
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI
12584045
nsdense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
206702.5
ns209702
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
2041
ns1958
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2042
ns1958
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2083
ns2084
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
1916
ns2042
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
27885
ns28203
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
1248055
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
203262
ns202342
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
16833
ns17125
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
18250
ns16791.5
ns1.09
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
18125
ns17542
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
17667
ns17209
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
178504
ns147741
ns1.21
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
21525405
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
684992
ns682312
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
59104
ns59062
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
65041
ns62416
ns1.04
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
66583.5
ns61312.5
ns1.09
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51125
ns53875
ns0.95
batchedmm(16, Bsize=512)/forward/GPU/CUDA
71334
ns71192
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/oneAPI
89279199
nsbatchedmm(16, Bsize=512)/forward/GPU/AMDGPU
118362
ns116711
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
163062.5
ns202750.5
ns0.80
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
151271
ns98750
ns1.53
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
157250
ns118104
ns1.33
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
313146
ns297958
ns1.05
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
195169
ns170047
ns1.15
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI
151578490.5
nsbatchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
624817
ns616606
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82145.5
ns84208
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
82749.5
ns83646
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86667
ns85166
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
85000
ns128334
ns0.66
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
186525
ns184384
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5756836
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
205352
ns203702
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1808020.5
ns1889375
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1915916.5
ns1916750
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1905270.5
ns1919083
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1911375
ns1899041
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
475978
ns379904
ns1.25
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
27045542
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1069182
ns1068311
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns291
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
18638
ns18502
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI
2108817.5
nsdense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
42830
ns41550.5
ns1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1750
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns1791
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1791
ns1834
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
225657
ns145894.5
ns1.55
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI
9833710
nsdense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
182527.5
ns181622
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8375
ns8458
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9125
ns8937.5
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11083
ns11208.5
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8041
ns8875
ns0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
108185.5
ns51415
ns2.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
3365841
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
232582
ns232043
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10167
ns9125
ns1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9542
ns8667
ns1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10417
ns10458.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9291
ns9583
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
420282.5
ns241818.5
ns1.74
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
20467429
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
629687
ns623402
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57916
ns58604.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46583
ns39333
ns1.18
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46458
ns39792
ns1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83583
ns83417
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
32500
ns32658
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1374457
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
72281
ns79585.5
ns0.91
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1911000
ns1931459
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1970187.5
ns1973750
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1937771
ns1980958.5
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1899667
ns1884875
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
176646
ns152863
ns1.16
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33503348
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1152023
ns1040311
ns1.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
418084
ns418333
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
417375
ns418709
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
427542
ns422000
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
420250
ns418583.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
173254.5
ns94366
ns1.84
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7736703
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
280773
ns281763
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
671833.5
ns673562.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
766666.5
ns753812.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
684542
ns769958
ns0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
731041.5
ns751938
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
887279
ns470483
ns1.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
46741128
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
905534.5
ns903129
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
3464375
ns3419645.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
3437833
ns3437875
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
3397500
ns3451375
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
3449958
ns3429042
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
148014
ns140481
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8945738
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
441160
ns441684
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
6193666.5
ns6220250
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
6178645.5
ns6224937
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
6207958
ns6214292
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6230917
ns6141041.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
821729
ns620637
ns1.32
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
51511265
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1636158
ns1629761.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
473083.5
ns474958
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
342041.5
ns253000
ns1.35
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
341500
ns253292
ns1.35
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
902375
ns901709
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
42882
ns43146
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI
400566
nsdense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
241152
ns241942.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2324750
ns2271000
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2038541.5
ns1763792
ns1.16
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2032354
ns1760167
ns1.15
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3197000
ns3188958
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
202331
ns200260
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI
12642725
nsdense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
763338
ns764328
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57520.5
ns58125
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46395.5
ns39334
ns1.18
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
45959
ns39750
ns1.16
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83250
ns83375
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
23227
ns23268
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1432334
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
75651
ns74721
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2029625
ns2035750
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2079979
ns2088417
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2070791
ns2090333
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2000354
ns1963541
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
191580
ns155158
ns1.23
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35959863
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1041881.5
ns1195637.5
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57291
ns58625
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46645.5
ns39834
ns1.17
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46625
ns40083
ns1.16
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83334
ns83042
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40746
ns41354
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
810264
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
80396
ns77975.5
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1890166
ns1927125
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1976042
ns1971541.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1971667
ns1976833
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1895583
ns1885312.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
198522
ns164726
ns1.21
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
17337732
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
936080
ns1051246
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
ns291
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
291
ns333
ns0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns416
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
25307.5
ns26436
ns0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
1259521
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
46650
ns46511
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6562.5
ns7333
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6917
ns6500
ns1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7292
ns6917
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6834
ns7834
ns0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
168328.5
ns132779
ns1.27
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
20601648
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
371864
ns364088.5
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns291
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
30302
ns30026
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI
1177600.5
nsdense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
37815.5
ns40500
ns0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
3542
ns3250
ns1.09
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2833
ns2958
ns0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
3250
ns3042
ns1.07
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2959
ns2792
ns1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
169119
ns139460
ns1.21
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI
7614831
nsdense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
152811
ns156362
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
450021
ns453562
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
441041
ns426854
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
425041.5
ns424771
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
422292
ns454396.5
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
130746.5
ns128743
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6115924
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
366698.5
ns374513
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3801375
ns3812646
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3799958
ns3818687.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3805000
ns3824687.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3829062.5
ns3809020.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
640512
ns467612
ns1.37
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35444962
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1468321
ns1414714
ns1.04
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49831750
ns49937813
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
35529708
ns25988125
ns1.37
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35490875
ns26009646
ns1.36
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
97095125
ns97113375
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1612269
ns1610536
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/oneAPI
56680008
nsbatchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1041171
ns1049471
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154466500.5
ns154792729.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
112376375
ns89048958.5
ns1.26
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112311958
ns89207416
ns1.26
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
295244375
ns294786708.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6476168
ns6494841
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI
174388525
nsbatchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5549710
ns5562936
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
16979
ns18916.5
ns0.90
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
19562.5
ns15584
ns1.26
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
17188
ns14667
ns1.17
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15020.5
ns15896
ns0.94
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
14071
ns13971
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI
1254861
nsbias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
25910
ns27630
ns0.94
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
10520.5
ns11291
ns0.93
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
8709
ns7458.5
ns1.17
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
8917
ns7750
ns1.15
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17479
ns17520.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
209068
ns101782
ns2.05
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI
10230351.5
nsbias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
148622
ns148192
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7750
ns9541.5
ns0.81
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7854.5
ns9125.5
ns0.86
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10334
ns10333
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7583
ns8542
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
111568.5
ns53666.5
ns2.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
3718095.5
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
237553
ns235372
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11541.5
ns9541
ns1.21
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9687.5
ns10209
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10708
ns10458
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10709
ns10250
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
501739
ns269358
ns1.86
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
23065545
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
655677
ns652326
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8770.5
ns9812.5
ns0.89
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9750
ns9250
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10583
ns10812.5
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
8750
ns9562.5
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
53968
ns53391
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
3498205
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
72631
ns71711
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13459
ns14333
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
15479
ns14083
ns1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
19209
ns15167
ns1.27
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
14125
ns16625
ns0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
250540
ns251184.5
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
20620278
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
346043
ns344093
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
459
ns458
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
500
ns458
ns1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
458
ns583
ns0.79
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
26861
ns27208
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
1254571
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
204762
ns203792
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7208.5
ns8625
ns0.84
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9000
ns8125
ns1.11
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9125
ns8604.5
ns1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8166
ns8416.5
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
147122.5
ns147255
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
22634021
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
659287
ns656126
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
15416
ns16625
ns0.93
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
16625
ns14500
ns1.15
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
14917
ns13354
ns1.12
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
11291
ns10229
ns1.10
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
13973
ns13896.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI
1108916
nsbias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
186562
ns186472
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
32000
ns31750
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
32000
ns32000
ns1
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
31958
ns32042
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
32167
ns31833
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
109160
ns110682.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI
11487029
nsbias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
588817
ns592116
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
492875
ns450209
ns1.09
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
442125
ns445500
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
444958
ns444167
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
440604
ns462958
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
188096.5
ns188096.5
ns1
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5891615
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
369779
ns367068.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3834584
ns3834209
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3827292
ns3836666
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3817250
ns3847459
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3836104.5
ns3828250
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
382999
ns383846
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
28452071
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1355634
ns1358354
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
831622791.5
ns784152667
ns1.06
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
544951167
ns416079687.5
ns1.31
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
544430500
ns422584917
ns1.29
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1552948271
ns1509956229
ns1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22763244.5
ns22771101.5
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/oneAPI
185795205
nsbatchedmm(512, Bsize=512)/forward/GPU/AMDGPU
15420059
ns14743999
ns1.05
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
3888050458
ns2524849666
ns1.54
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
3211667750
ns1511960000
ns2.12
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1819585250
ns1536159417
ns1.18
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4769468292
ns4778947333
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118595684
ns119521542
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI
1039230192
nsbatchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
88183228
ns87915389
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
75333.5
ns78208.5
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
77458
ns80271
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78584
ns82708
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76292
ns77334
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
93335
ns93705
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
6083372
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
120232
ns118801
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
279333
ns291334
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
194937.5
ns210333
ns0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
234771
ns261874.5
ns0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
194125
ns202208.5
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
451188
ns458544
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
46239896
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
657366.5
ns662017
ns0.99
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199509499.5
ns200217604
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
139162834
ns103846750
ns1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
138977625
ns104247042
ns1.33
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
388989959
ns389363833
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5833602
ns5840254.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/oneAPI
79568180.5
nsbatchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3573358
ns3591326
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
619161479.5
ns620550500
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
440796833
ns352840416.5
ns1.25
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
439294646
ns353679646
ns1.24
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1189363000
ns1181355417
ns1.01
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26219564
ns26562043
ns0.99
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI
283162239
nsbatchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
21927537.5
ns22008202.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7417
ns7167
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6083.5
ns5292
ns1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6208
ns5458
ns1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns10000
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
21654
ns20844
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1302067
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48161
ns48671
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212583
ns245770.5
ns0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228396
ns243083
ns0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
222250
ns221208
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213166.5
ns207979
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
136607.5
ns137816.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
29564519.5
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
524845
ns523805
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9833.5
ns8334
ns1.18
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7979
ns8166.5
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10250
ns11041
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7978.5
ns9020.5
ns0.88
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
51011
ns50777
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
3317085
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
69811
ns69381
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7333.5
ns8875
ns0.83
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9625
ns8583
ns1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
13562.5
ns8166
ns1.66
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns10854.5
ns0.76
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
242632
ns245858
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
19151322
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
316738.5
ns312998.5
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
416
ns500
ns0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
708
ns500
ns1.42
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
500
ns625
ns0.80
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns584
ns0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
19752
ns19411
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
1203125.5
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
46481
ns48630
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
8625
ns10333
ns0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10249.5
ns11375
ns0.90
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9500
ns9770.5
ns0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10292
ns9708
ns1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
119755.5
ns120697
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
25677237
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
388684
ns388289
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
105959
ns105500
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
98500
ns85875
ns1.15
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
101021
ns87000
ns1.16
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
146271
ns146333.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
16996
ns16870
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI
756914
nsbias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
190327
ns190057
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
478333
ns478500
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
509583
ns485458
ns1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
478459
ns481521
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
478458.5
ns478833
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
113991
ns117100
ns0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI
12514796
nsbias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
604977
ns608201.5
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5375
ns5959
ns0.90
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
5333
ns6625
ns0.80
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
7208
ns7479.5
ns0.96
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
6729
ns6229.5
ns1.08
batchedmm(16, Bsize=32)/forward/GPU/CUDA
15434
ns14736
ns1.05
batchedmm(16, Bsize=32)/forward/GPU/oneAPI
73679048
nsbatchedmm(16, Bsize=32)/forward/GPU/AMDGPU
79381
ns79970
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
12375
ns13500
ns0.92
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
11000
ns9750
ns1.13
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
10875
ns10167
ns1.07
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
16625
ns17125
ns0.97
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
108121
ns109548
ns0.99
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI
100453387
nsbatchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
364504
ns366884
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
39375
ns40458
ns0.97
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
51917
ns50417
ns1.03
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
52770.5
ns51354
ns1.03
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13604
ns13667
ns1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA
20011
ns20278.5
ns0.99
batchedmm(16, Bsize=128)/forward/GPU/oneAPI
79258230
nsbatchedmm(16, Bsize=128)/forward/GPU/AMDGPU
85481
ns85591
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
36271
ns37250
ns0.97
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
35313
ns29541
ns1.20
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
31291.5
ns29875
ns1.05
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
57750
ns57562.5
ns1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
121997.5
ns119274.5
ns1.02
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI
113144013
nsbatchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
410244.5
ns395964
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1584
ns1833
ns0.86
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1750
ns1667
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2250
ns2291
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1687.5
ns2041.5
ns0.83
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
13818
ns13524
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI
1224877
nsbias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
32640
ns32690
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2166
ns2167
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2292
ns2145.5
ns1.07
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2417
ns2395.5
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2250
ns2312.5
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
89827
ns89460.5
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI
9149897
nsbias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
136461
ns136351
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5666.5
ns6104
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4896
ns4708.5
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6333.5
ns6187.5
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5375
ns5874.5
ns0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
59437.5
ns58659.5
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
5810721.5
nslayernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
68755.5
ns67281
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8167
ns9083.5
ns0.90
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8542
ns9000
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8500
ns8709
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9042
ns8750
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
383098.5
ns386636
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
38586019
nslayernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
387674
ns384884
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56708
ns56916
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57666
ns56833
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57625
ns56958
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
58250
ns58291
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
30235
ns29539
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1254024
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
204092
ns203102.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
448000
ns453791.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
472083.5
ns466875
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
465125
ns465666.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
436541.5
ns436208
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
170026
ns167893
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
28109365.5
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
826388
ns823238
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3312500
ns3327646
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
2340084
ns1773958
ns1.32
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2339583.5
ns1770208
ns1.32
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6318792
ns6318167
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204725
ns203665
ns1.01
batchedmm(128, Bsize=128)/forward/GPU/oneAPI
83409682
nsbatchedmm(128, Bsize=128)/forward/GPU/AMDGPU
240632
ns213597.5
ns1.13
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11441604
ns11522375
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
8301208
ns6550792
ns1.27
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8329792
ns6579708.5
ns1.27
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21184729.5
ns21256687.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
760406.5
ns761872
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI
125395684.5
nsbatchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1063686
ns1057191
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5666
ns6667
ns0.85
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5604.5
ns4917
ns1.14
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6438
ns7000
ns0.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6312.5
ns5166
ns1.22
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
57453
ns57961.5
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
5296827
nslayernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
56241
ns56041
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7125
ns11458
ns0.62
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns8750
ns0.84
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7250
ns7541
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8292
ns8625
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
367190
ns382208
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
35508394
nslayernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
362159
ns361754
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
140708
ns126917
ns1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
123917
ns102541
ns1.21
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
100667
ns101792
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
104958
ns98333
ns1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127546.5
ns127201
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6179687.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
206197
ns206327
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1992625
ns2039750.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2016083.5
ns2028645.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2019875
ns2040937.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2026687.5
ns1948458
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
432468
ns443232
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33529611.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1184812.5
ns1211817
ns0.98
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
32208.5
ns33542
ns0.96
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
37167
ns34416
ns1.08
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
35833
ns34583
ns1.04
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
583
ns625
ns0.93
batchedmm(2, Bsize=4)/forward/GPU/CUDA
13995
ns13510
ns1.04
batchedmm(2, Bsize=4)/forward/GPU/oneAPI
74212471
nsbatchedmm(2, Bsize=4)/forward/GPU/AMDGPU
79370
ns79871
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2645.5
ns3750
ns0.71
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2750
ns3209
ns0.86
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3020.5
ns3041
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2333
ns2333
ns1
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
92140
ns89708.5
ns1.03
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI
94219800
nsbatchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
341683
ns340203
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7167
ns7209
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns5292
ns1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6083
ns5417
ns1.12
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns10042
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
29283
ns29375
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1222925.5
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48131
ns49300
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
248271
ns222374.5
ns1.12
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221125
ns221270.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221042
ns221458
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
216625
ns206500
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
158765.5
ns159760
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26714332
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
569975.5
ns572920.5
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3959
ns3958
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
4000
ns3917
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3958
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3958
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
18821
ns18490
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI
2189549
nsdense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
41970
ns43450
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14958
ns14667
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15125
ns14666
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14917
ns14709
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14666
ns14708
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
163909.5
ns165588
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI
11534435
nsdense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
192582
ns197842
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
146416
ns130708
ns1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
103750
ns101313
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
103791
ns105000.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
100208
ns106666.5
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127104
ns125911
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6092161.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
207422.5
ns204662
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1791000
ns1925042
ns0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1909958
ns1928041
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1910875
ns1930583
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1922250
ns1855291
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
418877
ns429902
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
29586225
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1089381
ns1148786.5
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17291
ns18166
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22583
ns18979
ns1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21062.5
ns22458
ns0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17417
ns18125
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
61422.5
ns63187.5
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3492174
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
80420
ns79155.5
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216354.5
ns252792
ns0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
256145.5
ns261875
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216500
ns219958
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219146
ns217125
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
272581
ns279978
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
19535498.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
477435
ns475684
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
26813
ns24729.5
ns1.08
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
31333
ns28125
ns1.11
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
28812.5
ns27000
ns1.07
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1312
ns1375
ns0.95
batchedmm(16, Bsize=4)/forward/GPU/CUDA
14764
ns13843
ns1.07
batchedmm(16, Bsize=4)/forward/GPU/oneAPI
75193108
nsbatchedmm(16, Bsize=4)/forward/GPU/AMDGPU
81295.5
ns81051
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
5000
ns5479.5
ns0.91
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4833.5
ns5167
ns0.94
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5083.5
ns5270.5
ns0.96
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4854
ns4708
ns1.03
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
110219
ns110586.5
ns1.00
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI
96235287
nsbatchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
380423.5
ns379244
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
304917
ns308792
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
306417
ns305625
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
307500
ns307291
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
306312
ns306834
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96559
ns102299
ns0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8040746
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
273553
ns272803
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
534959
ns544417
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
578875
ns575000
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
532250
ns545958.5
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
532292
ns538167
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
478700.5
ns500049
ns0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45273096.5
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
854594
ns849309
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18833
ns22000
ns0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
21500
ns21083
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21500
ns22042
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18729
ns19667
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
61059
ns64471.5
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3648054
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79701
ns78011
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
225292
ns226000
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215459
ns245604
ns0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214416
ns215584
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215625
ns212791
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
315781.5
ns344357
ns0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
25685453.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
536640.5
ns535535
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6875
ns7542
ns0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6729
ns5791.5
ns1.16
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7875.5
ns8416
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6187
ns7167
ns0.86
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
59473
ns63232
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
5742399
nslayernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
65660
ns65391
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9875
ns13667
ns0.72
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10541.5
ns11916
ns0.88
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10542
ns10125
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11395.5
ns10041
ns1.13
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
375474.5
ns396144.5
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
37560344
nslayernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
385404
ns386814
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4958
ns6541.5
ns0.76
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5792
ns4666
ns1.24
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6937
ns6500
ns1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4813
ns7042
ns0.68
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
59336
ns64824
ns0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
5881412.5
nslayernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
66901
ns68750
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7333
ns8083
ns0.91
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7292
ns8166
ns0.89
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7708
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7917
ns7583
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
400389
ns423945
ns0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
41438719.5
nslayernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
390804
ns394914
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14514708
ns14516708
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
10142334
ns7713187.5
ns1.31
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10128041
ns7704854
ns1.31
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27891250
ns27801334
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
532579.5
ns531151.5
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/oneAPI
99192089
nsbatchedmm(128, Bsize=512)/forward/GPU/AMDGPU
394344
ns393889
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46256625
ns46558771.5
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
33475978.5
ns26529584
ns1.26
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33502666
ns26598312
ns1.26
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85530791
ns85686792
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
3411776.5
ns3208907
ns1.06
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI
197868624
nsbatchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3281874
ns3300533
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
66792
ns67833
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
67791
ns65625
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
69583
ns69333.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
66542
ns67292
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
63585
ns68650
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3635639
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
238943
ns232393
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
482792
ns450333
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
490208.5
ns453834
ns1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
443416
ns446417
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
443250
ns441584
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
333625.5
ns394734
ns0.85
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27824814.5
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
796928.5
ns788457.5
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns625
ns0.80
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
584
ns542
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
584
ns625
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
541
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
26261
ns26112
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
1201042
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
46591
ns47140
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9624.5
ns10542
ns0.91
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9458
ns9583
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9042
ns9250
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
15042
ns10708
ns1.40
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
154087.5
ns152524.5
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
22365155
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
376374
ns373324
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9792
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9875
ns9792
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9834
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9792
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
21245
ns20835
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI
2109275
nsdense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
207407
ns208092
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
45834
ns46333
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
46083
ns45833
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
48125
ns46000
ns1.05
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
46000
ns45959
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
181985
ns189222
ns0.96
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI
12501764
nsdense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
599026
ns603691
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56292
ns56334
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57208
ns56375
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57083
ns56458
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
57875
ns57875
ns1
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
22599
ns21828
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1231700.5
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
210062.5
ns202032
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
491041.5
ns464834
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
503250
ns474250.5
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
465875
ns465771
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
440959
ns434770.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
153666
ns162400
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
33436886
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
880644
ns877129
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
646396
ns651104.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
656479
ns683542
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
592854.5
ns656292
ns0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
616145.5
ns616541.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
128259.5
ns140209
ns0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8403444.5
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
302363
ns305778
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2232145.5
ns2262562.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2230708
ns2231521
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2231875
ns2245125
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2259375
ns2244604.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
617840
ns644538
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
50658009
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1318863
ns1307248
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22542
ns21625
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19458
ns20833
ns0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22583
ns23208
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19458
ns20125
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
64266
ns69407.5
ns0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3671624.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79151
ns78811
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224000
ns233042
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
254083
ns233125
ns1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221708
ns221333
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
220750
ns224875
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
347077
ns410361
ns0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
25817148
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
554175.5
ns557581
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
541
ns625
ns0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns500
ns1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
584
ns625
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns625
ns0.80
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
18626
ns18190
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
1230354
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
48171
ns47870
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
8917
ns9812.5
ns0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9875
ns9250
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9812.5
ns10042
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9270.5
ns10000
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
136039.5
ns136633
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
29131754
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
399044
ns397114
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8042
ns8958
ns0.90
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9312.5
ns8438
ns1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11292
ns10750
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8334
ns8084
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
57821.5
ns64696.5
ns0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
3436196.5
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
69710.5
ns71891
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7125
ns7666
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7791
ns7250
ns1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7875
ns8417
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7417
ns7708
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
255977.5
ns292900
ns0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
18497059
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
319743
ns318078
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1375
ns1542
ns0.89
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1645.5
ns1458
ns1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1917
ns2208
ns0.87
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1500
ns1708
ns0.88
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
13693.5
ns13397
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI
1186814
nsbias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
188882
ns188372
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3291
ns3312.5
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3479.5
ns3375
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3625
ns3667
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3291
ns3375
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
104102.5
ns117821
ns0.88
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10382640
nsbias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
575736
ns578906
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
146979
ns147437.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
129042
ns106312.5
ns1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
129875
ns107750
ns1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
226000
ns226021
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
17312
ns16777
ns1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI
1216751.5
nsbias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
39935.5
ns40540
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
159771
ns163417
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
110521
ns106833
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
136250
ns98125
ns1.39
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
251666.5
ns251458
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
118480.5
ns141681
ns0.84
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI
10669966
nsbias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
265838
ns266553
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7292
ns7292
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6041
ns5333
ns1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6042
ns5375
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10250
ns10209
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26774
ns26669.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1208039.5
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48681
ns48681
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219937.5
ns256208
ns0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
227375
ns258709
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228667
ns231395.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212729.5
ns224896
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
177762.5
ns185868.5
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
28372083
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
589856
ns589590.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
15958
ns16125
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
16208.5
ns14750
ns1.10
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
16687.5
ns17000
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
14792
ns15375
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
63622
ns76403.5
ns0.83
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
5760147.5
nslayernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
227543
ns230202
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
23916
ns24416
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24500
ns23708
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
23458
ns23792
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23000
ns23417
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
431176
ns496390.5
ns0.87
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
42796325.5
nslayernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
675657
ns676296.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9167
ns10334
ns0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9834
ns9375
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11021
ns11666.5
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8729.5
ns9292
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
64004
ns81566
ns0.78
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
3525023
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
73491
ns72771
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14292
ns14333
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13729
ns13666.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14208
ns14729.5
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13459
ns14750
ns0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
323073
ns412717
ns0.78
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
21480471
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
371404
ns362433
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8083
ns8917
ns0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10416.5
ns9750
ns1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10937.5
ns11896
ns0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9333
ns9542
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
66250
ns84716
ns0.78
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
3712952
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
74871
ns71721
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12708
ns13250
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13020.5
ns12521
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13333.5
ns13542
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12417
ns12875
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
286792
ns346105
ns0.83
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
19725639
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
340593.5
ns338603.5
ns1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
29166
ns31041.5
ns0.94
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
34604
ns32438
ns1.07
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
32229.5
ns29625
ns1.09
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
1750
ns2167
ns0.81
batchedmm(2, Bsize=128)/forward/GPU/CUDA
15001
ns14504
ns1.03
batchedmm(2, Bsize=128)/forward/GPU/oneAPI
78965877
nsbatchedmm(2, Bsize=128)/forward/GPU/AMDGPU
86890
ns80601
ns1.08
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5125
ns5250
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
5062.5
ns4750
ns1.07
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5167
ns5208
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6292
ns6541
ns0.96
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
99425.5
ns107471
ns0.93
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI
110379934
nsbatchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
383544
ns370164
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
334
ns291
ns1.15
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
19905
ns18911
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
1150337
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
48921
ns46920
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6292
ns6542
ns0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6458
ns6292
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6708
ns6958.5
ns0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6208
ns6708
ns0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
127212
ns135126
ns0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
23911059
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
394834
ns386254
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
1958
ns2000
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2041
ns1958
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2042
ns2083
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
1958
ns2042
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
20510
ns20048
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
1241051
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
210527
ns204122
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16896
ns16937.5
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17125
ns17042
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16750
ns17000
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15875
ns15875
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
143434
ns151188.5
ns0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
25814251
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
704697
ns698796.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
174541
ns150292
ns1.16
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
147000
ns188375
ns0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
152688
ns152834
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
150916
ns152750
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165022.5
ns169794
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7825451.5
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
226202.5
ns225092
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1318770.5
ns1328166
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1320292
ns1339625
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1329500
ns1339979
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1333791.5
ns1321375
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
605687
ns738732.5
ns0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
46439481
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1061786
ns1067311
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25084
ns26042
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25042
ns25313
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
27854
ns28208
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24749.5
ns25750
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
119756.5
ns179072
ns0.67
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7727314
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
116991
ns113981
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
131479
ns181083.5
ns0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
171708
ns169917
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
127521
ns118875
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
117479
ns125563
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
551436.5
ns736737.5
ns0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45901726
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
610436
ns606996
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns333
ns1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
17730.5
ns17782
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
1203450
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
48751
ns47020
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6416.5
ns6917
ns0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6542
ns6500
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6833
ns7270.5
ns0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6167
ns6958
ns0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
136341
ns149426
ns0.91
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
25006525.5
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
393949
ns389994
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6666
ns6209
ns1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6625
ns5708
ns1.16
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6917
ns7666
ns0.90
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5666
ns5958
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
72501
ns100369
ns0.72
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
5996889
nslayernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
233483
ns231643
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9917
ns10083
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10062.5
ns9666.5
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10250
ns10333
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9875
ns10125
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
493431
ns656519
ns0.75
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
41270237
nslayernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
675326
ns676037
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
666
ns708
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
667
ns667
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
667
ns667
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
667
ns667
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
20029
ns20098
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI
2098576
nsdense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
207872.5
ns205502
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4542
ns4667
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4625
ns4584
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4791
ns4875
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4625
ns4709
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
167220.5
ns183686.5
ns0.91
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI
9409031.5
nsdense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
577916
ns577406
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7854
ns8062
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8875
ns8083
ns1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9750
ns10062
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7334
ns7979.5
ns0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
72767.5
ns112521
ns0.65
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
3713250
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
77435.5
ns75781
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8167
ns8625
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8354.5
ns8750
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9041
ns9459
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8417
ns8959
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
372607.5
ns542270
ns0.69
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
21133871
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
345814
ns339298.5
ns1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
125395.5
ns126979.5
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
129042
ns100291
ns1.29
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
129959
ns97208
ns1.34
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
180916
ns180729.5
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA
44539
ns44342
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/oneAPI
75228887
nsbatchedmm(128, Bsize=4)/forward/GPU/AMDGPU
100291
ns101011
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
310917
ns340250
ns0.91
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
313833
ns192146
ns1.63
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
324083.5
ns167166
ns1.94
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
600354
ns573958.5
ns1.05
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
150808
ns199334
ns0.76
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI
91943409
nsbatchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
502450
ns515465
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
396750
ns399208
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288145.5
ns215250
ns1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287583
ns215625
ns1.33
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756625
ns756875
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
40964
ns40054
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI
1391370
nsdense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
81511
ns80551
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1449583.5
ns1406459
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1136667
ns862312
ns1.32
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1134771
ns864000
ns1.31
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2361041.5
ns2359542
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
207930
ns234952
ns0.88
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI
10356148
nsdense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
355144
ns353324
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
647292
ns659917
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
578500
ns658270.5
ns0.88
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
639416
ns624271
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
656333
ns677791.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
154081
ns196665.5
ns0.78
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8771052.5
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
306073.5
ns305543
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2453625
ns2481875
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2424291
ns2467479.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2442542
ns2476313
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2470583
ns2446833
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
767152.5
ns984615.5
ns0.78
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
52532777
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1399204.5
ns1399689
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
32604
ns34062.5
ns0.96
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
36937.5
ns34666.5
ns1.07
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
34542
ns32791.5
ns1.05
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
917
ns958
ns0.96
batchedmm(2, Bsize=32)/forward/GPU/CUDA
14042
ns14044
ns1.00
batchedmm(2, Bsize=32)/forward/GPU/oneAPI
78232811.5
nsbatchedmm(2, Bsize=32)/forward/GPU/AMDGPU
79530
ns84401
ns0.94
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3000
ns3166.5
ns0.95
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3084
ns3166
ns0.97
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3333
ns3500
ns0.95
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3000
ns3250
ns0.92
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
100283
ns121484
ns0.83
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI
96545751
nsbatchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
337334
ns362074
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
405958
ns406584
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
408209
ns402458
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
407958
ns403000
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
421459
ns420645.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36148.5
ns36583
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1554049.5
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
238757.5
ns238852
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3868375
ns3879583
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3988562.5
ns3983541.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3992667
ns3998250
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3776708.5
ns3674250
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
193888
ns237279.5
ns0.82
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
37305285.5
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1433244
ns1428125
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
32924.5
ns32312
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI
1242082
nsdense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
37990
ns38101
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15708
ns15459
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15750
ns15459
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15958
ns15666
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15500
ns15458
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
189381
ns242437.5
ns0.78
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI
9458441
nsdense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
169642
ns167902
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404708
ns404458
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
295750
ns221625
ns1.33
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
295958
ns221375
ns1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
761125
ns760125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
117898
ns117928
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI
1045095
nsdense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
87241
ns87841
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1478500
ns1429417
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1159645.5
ns887583
ns1.31
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1158042
ns887396
ns1.30
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2384583
ns2378208
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
189114
ns192870.5
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI
9516529.5
nsdense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
351188
ns353053
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
500
ns542
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
583
ns458
ns1.27
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
500
ns583
ns0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
18799
ns19335
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
1188091
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
205912
ns205012
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7292
ns7500
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7583
ns7250
ns1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7917
ns8166
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7375
ns8167
ns0.90
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
141427.5
ns165885
ns0.85
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
26708173
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
683937
ns683986
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
833042
ns832729.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
621208
ns467000
ns1.33
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
621791
ns469250
ns1.33
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1550917
ns1575625
ns0.98
batchedmm(128, Bsize=32)/forward/GPU/CUDA
134056.5
ns129567
ns1.03
batchedmm(128, Bsize=32)/forward/GPU/oneAPI
77301649
nsbatchedmm(128, Bsize=32)/forward/GPU/AMDGPU
227902
ns227872
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2692167
ns2691958.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1995500
ns1537333
ns1.30
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
2004812.5
ns1540083.5
ns1.30
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4935000
ns4938125
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
249781
ns274489
ns0.91
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI
100408887.5
nsbatchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
840633.5
ns806443
ns1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns334
ns0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
25355
ns25740
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
1311449
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
46990
ns47540
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6209
ns6625
ns0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6708.5
ns6125
ns1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6542
ns6791
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6125
ns6667
ns0.92
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
157347
ns180518.5
ns0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
21691879
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
365484
ns359293
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2366042
ns2375375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2395500
ns2422500
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2374083
ns2407959
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2382167
ns2370375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
170643
ns178233.5
ns0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8487051
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
374764
ns374734
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4646208
ns4668709
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4643687
ns4652084
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4660250
ns4665083.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4569374.5
ns4600917
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
714837
ns872920
ns0.82
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
50175411.5
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1351724
ns1382244
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
7208
ns9437.5
ns0.76
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7084
ns7833
ns0.90
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7208
ns7292
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6833
ns6834
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
16063.5
ns16361
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI
1173405
nsbias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
39030
ns39440
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
63792
ns74520.5
ns0.86
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
32833
ns49250
ns0.67
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
45917
ns51729
ns0.89
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
45229.5
ns49083.5
ns0.92
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
163785
ns212837
ns0.77
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI
10469728.5
nsbias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
262653
ns266233
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
20584
ns22250
ns0.93
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
26208
ns25000
ns1.05
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
23542
ns21854.5
ns1.08
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5125
ns5375
ns0.95
batchedmm(2, Bsize=512)/forward/GPU/CUDA
16017
ns15953
ns1.00
batchedmm(2, Bsize=512)/forward/GPU/oneAPI
90340662
nsbatchedmm(2, Bsize=512)/forward/GPU/AMDGPU
84110
ns83861
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
11791
ns11834
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
10229.5
ns9187.5
ns1.11
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
10625
ns9520.5
ns1.12
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
17895.5
ns18354.5
ns0.97
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
159148
ns203711.5
ns0.78
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI
149555538
nsbatchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
367264
ns388864
ns0.94
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
406500
ns406375
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
297583
ns223500
ns1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
297250
ns223792
ns1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762791
ns762958
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
43249
ns43379
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI
1362482
nsdense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
87411
ns89781
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1484125.5
ns1427542
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1167542
ns892959
ns1.31
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1161667
ns892958
ns1.30
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2387604.5
ns2385625
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
213476
ns239711
ns0.89
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI
13925589
nsdense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
377604
ns376923.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
433583
ns434375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
436917
ns430000
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
436666
ns430417
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
448291
ns448375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
45983
ns46179
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1048211.5
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
234192
ns235662
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3894625
ns3912500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4022709
ns4004000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4024624.5
ns4025375.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3801916.5
ns3768792
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
210260
ns251012
ns0.84
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32692776
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1361238
ns1368994
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
8709
ns8750
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
7667
ns6875
ns1.12
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7667
ns6917
ns1.11
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12375
ns12458
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
20402
ns20602
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI
2188548.5
nsdense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
210772
ns209952
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
45041
ns44958
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45208
ns45083
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
45208
ns45250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
44708
ns44750
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
253192
ns314279
ns0.81
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI
14008146.5
nsdense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
653917
ns653907
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82979
ns115896
ns0.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
126104.5
ns125812.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86229.5
ns126604.5
ns0.68
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84875
ns89000
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
184626
ns186375.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6066708
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
219662
ns219802
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2017833
ns2026583
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2016000
ns2025000
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2006375
ns2024729.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2025083
ns2026520.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
496955.5
ns566645
ns0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
27423881
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1040810
ns1084851
ns0.96
This comment was automatically generated by workflow using github-action-benchmark.