Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

fix: enzyme reverse bias needs a check on Const #160

Merged
merged 1 commit into from
Sep 16, 2024
Merged

Conversation

avik-pal
Copy link
Member

No description provided.

@avik-pal avik-pal merged commit 0df09fa into main Sep 16, 2024
64 of 71 checks passed
@avik-pal avik-pal deleted the ap/fix_check branch September 16, 2024 16:34
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 1538324 Previous: 7ba127a Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7208 ns 4667 ns 1.54
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5958 ns 6666.5 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7812.5 ns 7500 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5500 ns 5750 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 116591 ns 117321 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 835125 ns 3008750 ns 0.28
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 407004 ns 404195 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9749.5 ns 9896 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10208.5 ns 9833 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9750 ns 9979 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9646 ns 9958.5 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 532910 ns 533872 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5057208 ns 2324292 ns 2.18
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 684507 ns 674968 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 2042 ns 1437.5 ns 1.42
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1375 ns 2875 ns 0.48
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 2021 ns 2083 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1875 ns 1437.5 ns 1.30
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21601 ns 21479 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 208458 ns 190209 ns 1.10
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 29330 ns 29540 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4084 ns 4250 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3812.5 ns 4167 ns 0.91
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4541.5 ns 4145.5 ns 1.10
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4208 ns 4375 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 142653 ns 144438.5 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1611188 ns 1604875 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 147161 ns 145092 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58083 ns 55875 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46291 ns 39209 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47125 ns 46625 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82708 ns 84167 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36579 ns 36824 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1064500 ns 1333104 ns 0.80
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 83101 ns 81391 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2013666 ns 2024917 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2083625 ns 2079125 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087792 ns 2081625 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1990541 ns 1993125 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 225517.5 ns 226688 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 4569584 ns 7427958 ns 0.62
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 975579 ns 1252074 ns 0.78
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 177958 ns 174750 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146625 ns 164541.5 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 176041 ns 148812.5 ns 1.18
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 153250 ns 144375 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165529 ns 165480 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1562708.5 ns 1457521 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 208972 ns 204852 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1129895.5 ns 1117250 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1113270.5 ns 1109375.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1116688 ns 1113334 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1108000 ns 1112187.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 691122 ns 694582 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6441708.5 ns 6238375 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1025005 ns 1026961 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5083 ns 4417 ns 1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4458 ns 5041 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6708 ns 5208 ns 1.29
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5125 ns 4583 ns 1.12
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 91713 ns 93299.5 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 467917 ns 634041.5 ns 0.74
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 67821 ns 69460 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8708 ns 8375 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8667 ns 8542 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8917 ns 8833 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8292 ns 8833 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 595976 ns 604485 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6425709 ns 5669937.5 ns 1.13
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 386609 ns 388374 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17854 ns 17000 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18125 ns 17709 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20833 ns 18021 ns 1.16
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16791.5 ns 16895.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 65447 ns 66654.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1286000 ns 477833 ns 2.69
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77060 ns 78451 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213000 ns 216834 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218125 ns 219896 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 225458 ns 225583.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215709 ns 217625 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 347209 ns 356473 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5699333 ns 5644395.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 463334 ns 465005 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 958 ns 667 ns 1.44
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 792 ns 750 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 834 ns 812.5 ns 1.03
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 625 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20255 ns 20462 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 303708 ns 302625 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 31221 ns 32870 ns 0.95
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1417 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1583 ns 1458 ns 1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1583 ns 1417 ns 1.12
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1334 ns 1416 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 122871 ns 125127 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1648916 ns 1526500 ns 1.08
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 136871 ns 136521 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7417 ns 7208 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6083 ns 5416 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6167 ns 6125 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10041 ns 10666 ns 0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23633 ns 23625 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 656708 ns 356458 ns 1.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47130 ns 48881 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229750 ns 226166 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 269375 ns 265333 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 270667 ns 234854 ns 1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219458 ns 219500 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 187028 ns 192027 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8744750 ns 9046313 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 648261 ns 649247 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4084 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4083 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4125 ns 4084 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4083 ns 4083 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 22803 ns 23477 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 224083 ns 214833 ns 1.04
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 46881 ns 47261 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16584 ns 17083 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16666 ns 17000 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17375 ns 16833 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16916 ns 17334 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 191515.5 ns 195303 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 962437.5 ns 918208 ns 1.05
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 172696.5 ns 174652 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 512042 ns 508750 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 405354.5 ns 330583 ns 1.23
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 405583 ns 404666 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865958 ns 864791 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113233 ns 113620 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 462708 ns 490979 ns 0.94
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 242192 ns 242133 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2271854 ns 2313834 ns 0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2032625 ns 1747479 ns 1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2029500 ns 2035208 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3281500 ns 3272708.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 239296 ns 241207 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 2003084 ns 2011770.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 742307 ns 743443 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6875 ns 4708.5 ns 1.46
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6625 ns 7625 ns 0.87
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8417 ns 7708 ns 1.09
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6583 ns 5479.5 ns 1.20
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 90887.5 ns 92351.5 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 884937.5 ns 783479 ns 1.13
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 66440 ns 65411 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10396 ns 10333.5 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11604 ns 11875 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11958.5 ns 11750 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11729 ns 12062.5 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 634525 ns 634956 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 6103459 ns 5457291.5 ns 1.12
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 411954 ns 409979.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 541 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 22874 ns 23181 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 325500 ns 332584 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 48711 ns 47221 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2166 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2167 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2084 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 231590.5 ns 215755 ns 1.07
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 2053250 ns 1978417 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 172611.5 ns 172626.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8875 ns 8937.5 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9708 ns 9729.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10417 ns 9459 ns 1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8791 ns 8958 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 100356.5 ns 96639 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 929291 ns 876000 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 71930.5 ns 71941 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17645.5 ns 18521 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17749.5 ns 19104.5 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 19896.5 ns 17625 ns 1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17458 ns 18812.5 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 603312.5 ns 554001 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5371083.5 ns 5180916.5 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 376938.5 ns 378539 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 458 ns 1.18
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 459 ns 625 ns 0.73
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 666 ns 0.88
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 35224.5 ns 35213 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 467666 ns 466396 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 45901 ns 46270 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9854.5 ns 9312.5 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8875 ns 9916.5 ns 0.89
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9792 ns 9167 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9104.5 ns 9458.5 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 259410 ns 267136 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5178229 ns 4572250 ns 1.13
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 364483 ns 367694 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398833 ns 395333 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287791 ns 214416 ns 1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288166 ns 288292 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 755917 ns 756291 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111747.5 ns 111882 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 365292 ns 300208.5 ns 1.22
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 75320 ns 77331 ns 0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1398542 ns 1453791.5 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1133583 ns 852583 ns 1.33
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1133583 ns 1132645.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2439666 ns 2440625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 204923 ns 207032 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1643875 ns 1668041.5 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 323223 ns 324428.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6896 ns 7041.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7750 ns 7750 ns 1
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8854 ns 9396 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7604 ns 7791.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 142390 ns 144806.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 502541.5 ns 437250 ns 1.15
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 65701 ns 66071 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14979.5 ns 13083 ns 1.14
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16042 ns 14479 ns 1.11
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14791.5 ns 15709 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13812.5 ns 15354.5 ns 0.90
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 954176.5 ns 956377 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6228229 ns 5700250 ns 1.09
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 429743 ns 428955 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24521 ns 24000 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 30584 ns 24875 ns 1.23
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 30083 ns 29292 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 26042 ns 27667 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 196332 ns 199144 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 597417 ns 999584 ns 0.60
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 114751 ns 116931 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 155541 ns 103583 ns 1.50
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 149792 ns 152687 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 118812.5 ns 153583 ns 0.77
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 153083 ns 151000 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1064087 ns 1075746 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5878770.5 ns 5733792 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 588085.5 ns 590946.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75500 ns 75000 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76750 ns 77084 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 81834 ns 86333.5 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75417 ns 74875 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 203196 ns 205585 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 544541.5 ns 519187.5 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 127411.5 ns 127562 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 319167 ns 293542 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 292208 ns 308750 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 319437.5 ns 315187.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 254396 ns 304208 ns 0.84
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1106728 ns 1108118 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6853687.5 ns 6276458 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 692376 ns 695017 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16167 ns 15875 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17042 ns 17521 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 19208 ns 18500 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16625 ns 16958 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 144171.5 ns 146489 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 453916 ns 723083.5 ns 0.63
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 232502 ns 232683 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27645.5 ns 26667 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27875 ns 26687.5 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27104.5 ns 28208.5 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27209 ns 27708.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 966948 ns 982068.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6307750 ns 5743229 ns 1.10
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 687236 ns 686807.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11542 ns 11083 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11000 ns 12042 ns 0.91
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11959 ns 12334 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10708.5 ns 10791 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 123398.5 ns 124134 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 941104 ns 880000 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 235762 ns 234213 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22125 ns 21958 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 22270.5 ns 22729.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 23229 ns 21895.5 ns 1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21500 ns 22000 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 695351 ns 701831.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5572437 ns 5204750 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 677346 ns 674667 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 62708 ns 63437.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 63666 ns 65521 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 65167 ns 66750 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 63041.5 ns 63042 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104762.5 ns 106345.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1332187.5 ns 480667 ns 2.77
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 233822 ns 233433 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 445500 ns 437896 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 444000 ns 456000 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 450354 ns 450542 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 448604 ns 444000 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 508730 ns 515188 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6142645.5 ns 6095791.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 714332 ns 717017.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7000 ns 6792 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7521 ns 8000 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8313 ns 8583.5 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7937.5 ns 6917 ns 1.15
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 143368 ns 146052.5 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 776916 ns 726500 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 69001 ns 65301 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14354.5 ns 14292 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14625 ns 15292 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15500.5 ns 14084 ns 1.10
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15333 ns 16209 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 936852 ns 947670 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5938604.5 ns 5499875 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 401428 ns 399764 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6155875 ns 6131500 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6375916.5 ns 3224875 ns 1.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6374000 ns 6379229.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11901292 ns 11911084 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 350778 ns 349856 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 323393 ns 303248 ns 1.07
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19147458.5 ns 19059708.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19965187.5 ns 11090437.5 ns 1.80
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19960104 ns 20005646 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36468541.5 ns 36446770.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1066981 ns 1081781.5 ns 0.99
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1167291 ns 1153782 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 958 ns 958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1000 ns 1000 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 958 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 958 ns 917 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 22956 ns 23071 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 326979.5 ns 332541.5 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 207932 ns 207622 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3625 ns 3667 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3709 ns 3750 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3791 ns 3708 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3666 ns 3667 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 278049 ns 281551.5 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2145896 ns 2129583 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 628290.5 ns 626307 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7583 ns 8042 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8437.5 ns 8145.5 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10687 ns 9042 ns 1.18
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7604.5 ns 7937.5 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 119392 ns 121104 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 855709 ns 802541.5 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 65571 ns 65471 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12062 ns 13125 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11875 ns 12875 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 13583 ns 11417 ns 1.19
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11375 ns 12708 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 628558 ns 638151 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5354374.5 ns 4390333 ns 1.22
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 352454 ns 355644 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22280 ns 22337 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 325417 ns 207833 ns 1.57
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 47591 ns 47401 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 3042 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2833 ns 3375 ns 0.84
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3375 ns 2916 ns 1.16
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2875 ns 3333 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 200188 ns 204047 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1755125 ns 1611395.5 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 166112 ns 157641.5 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12062.5 ns 10250 ns 1.18
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11709 ns 12167 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 14167 ns 12187.5 ns 1.16
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12208 ns 10604 ns 1.15
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 120653.5 ns 121713.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 957583 ns 904791.5 ns 1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 233782 ns 233512.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22521 ns 21104.5 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21229 ns 22583 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21709 ns 21083 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21271 ns 21708 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 588112 ns 595173 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4873042 ns 4095583 ns 1.19
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 645610.5 ns 638246.5 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4417 ns 4417 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4417 ns 4375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4416 ns 4375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4416 ns 4417 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 23872 ns 24193.5 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 227375 ns 215041 ns 1.06
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 47721 ns 47690 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16458 ns 16292 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16459 ns 16291 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16750 ns 16667 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16375 ns 16416 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 328973 ns 330020.5 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1096437.5 ns 1639709 ns 0.67
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 205102 ns 206457.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 1917 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2167 ns 2167 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2125 ns 2084 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2083 ns 2084 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 35620 ns 35891 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 479292 ns 474917 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 203502 ns 204052 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 16666.5 ns 19687.5 ns 0.85
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 16625 ns 17187.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 17917 ns 17750 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16645.5 ns 16667 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 291232 ns 293976.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5011416 ns 4767354.5 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 685261 ns 686777 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 60083 ns 55771 ns 1.08
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 66500 ns 62792 ns 1.06
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 65291.5 ns 65604.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 53250 ns 51333 ns 1.04
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66478.5 ns 66418 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 114861 ns 114241 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 189708.5 ns 202896 ns 0.94
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 131041.5 ns 135104 ns 0.97
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 158375 ns 130083 ns 1.22
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 299229 ns 245666 ns 1.22
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 213976.5 ns 215296 ns 0.99
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 614765 ns 607861 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 84000 ns 79709 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 83417 ns 107104 ns 0.78
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 117459 ns 85167 ns 1.38
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 90334 ns 124166.5 ns 0.73
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192051 ns 192861 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1810583.5 ns 1816084 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 203912 ns 203512 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1915666 ns 1869895.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1913209 ns 1901084 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1915500 ns 1917666.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1911313 ns 1889333 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 534916.5 ns 531825 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8918708 ns 8859584 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 927078 ns 925670 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 291 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 291 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21774 ns 21389 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 368584 ns 336229.5 ns 1.10
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 41120 ns 42770.5 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 255691 ns 253832 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1105833 ns 1009479 ns 1.10
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 182671.5 ns 184376.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8792 ns 8000 ns 1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9750 ns 10042 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10500 ns 10375 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7833 ns 8167 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 120137.5 ns 119090.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 886041 ns 876708 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 235302 ns 232622 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9542 ns 9083 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9542 ns 10625 ns 0.90
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9667 ns 9542 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9792 ns 10125 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 530135 ns 527209 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4591292 ns 3949187.5 ns 1.16
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 631445 ns 624237 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58666 ns 56166 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46458 ns 38916 ns 1.19
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47084 ns 46125 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82542 ns 83958 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40746 ns 40233 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1137875 ns 1123667 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 74291 ns 76266 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1797833 ns 1923750 ns 0.93
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1970562.5 ns 1952750.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1981541 ns 1982854 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1786375 ns 1850708.5 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 222550 ns 221906.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11379375.5 ns 11408021 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1174290 ns 1191052 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 419270.5 ns 416333 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 424500 ns 421645.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 421958 ns 421208.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 416084 ns 417667 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 209953 ns 208798 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 539875 ns 518208 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 282843 ns 282883 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 669937.5 ns 747916.5 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 718333 ns 671583 ns 1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 773833 ns 673562.5 ns 1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 683417 ns 748021 ns 0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1056591 ns 1048327.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6669292 ns 6335208.5 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 910917 ns 914290 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3426458 ns 3428937.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3445083 ns 3384709 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3449250 ns 3435000 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3430354 ns 3417875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 176943 ns 175238.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1423917 ns 1424083 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 448819 ns 426124 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6195708.5 ns 6191270.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6031458 ns 6170041 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6208500 ns 6167416.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6217667 ns 6190792 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1004532.5 ns 994959 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7516103.5 ns 7413750 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1546963 ns 1549811 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 470709 ns 470666 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 341708 ns 252458 ns 1.35
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 342084 ns 342417 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 904895.5 ns 901125 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47024 ns 46139 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 536459 ns 368208 ns 1.46
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 244742 ns 243602 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2262542 ns 2334750 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2032625.5 ns 1752562 ns 1.16
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2039042 ns 2041187.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3286625 ns 3280124.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 272283 ns 255952 ns 1.06
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2240916 ns 2244770.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 767047 ns 770018 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58209 ns 55708 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 45792 ns 39041 ns 1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46834 ns 46020.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82833 ns 84125 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28553 ns 28321 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1155250 ns 1106875 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 76011 ns 76505.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1966583 ns 2029708 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2097041.5 ns 2082292 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2105333 ns 2090958 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1928854 ns 1949604 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 234931 ns 232547 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11838750 ns 11649979 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1200890 ns 1052311 ns 1.14
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58000 ns 55833 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46666 ns 39083.5 ns 1.19
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46333 ns 46375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82333 ns 84042 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 50602 ns 49287 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1136750 ns 1049084 ns 1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75801 ns 69820 ns 1.09
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1932208 ns 1919458 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1947417 ns 1955416.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1987250 ns 1946334 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1888333 ns 1890750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 241387 ns 239685 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10094000 ns 9788042 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1035989 ns 918859 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns 417 ns 0.80
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 35237 ns 34717 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 443125.5 ns 263500 ns 1.68
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 48850 ns 46211 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7125 ns 6333 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 7500 ns 0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7291 ns 6583 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6500 ns 7000 ns 0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 210223.5 ns 208392.5 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5397396 ns 4479667 ns 1.20
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 368313 ns 365124 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 33000.5 ns 32562 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 258666 ns 258000 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 39550 ns 37000 ns 1.07
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2834 ns 2750 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3417 ns 3625 ns 0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3625 ns 2709 ns 1.34
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 3000 ns 2917 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 191838.5 ns 189309.5 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 1338041 ns 905666.5 ns 1.48
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 161651 ns 151136.5 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 422791.5 ns 467667 ns 0.90
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 422146 ns 444750 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 435479.5 ns 425999.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 420583.5 ns 421833.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 138548 ns 137895 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2802625 ns 2386500 ns 1.17
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 369014 ns 367024 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3806000 ns 3802521 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3822896 ns 3765917 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3824375 ns 3811417 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3805604 ns 3799541.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 714199.5 ns 709425 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11413916 ns 10457896 ns 1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1476123 ns 1471404 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49897812.5 ns 49735229.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35524000 ns 25984959 ns 1.37
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35517334 ns 35560875 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 96963667 ns 96902041.5 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1624121 ns 1616773 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1037819 ns 1045271 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154672625 ns 153907333 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112393437.5 ns 89247291.5 ns 1.26
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112506958 ns 112379750 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 296345729 ns 294166500 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6489470 ns 6515848 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5543337 ns 5562255.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 19041.5 ns 14521 ns 1.31
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 18917 ns 14958 ns 1.26
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 17167 ns 16833 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15667 ns 14854.5 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 20662 ns 20539.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 323166 ns 206959 ns 1.56
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 26030 ns 26060 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11083 ns 10625 ns 1.04
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 9062.5 ns 7771 ns 1.17
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9209 ns 9208 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17417 ns 17437.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 262818 ns 260548 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1604333.5 ns 1587125 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 148741 ns 149326.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8437.5 ns 7958 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8979 ns 9292 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9895.5 ns 9500 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7938 ns 7958.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 127524.5 ns 116273.5 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 820209 ns 810375 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 233142 ns 233683 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9583 ns 9208.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10458 ns 10645.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10041.5 ns 10208 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9687.5 ns 10375 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 625567 ns 619508.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5349854 ns 4432792 ns 1.21
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 648161 ns 654786 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10937.5 ns 8291.5 ns 1.32
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10250 ns 10459 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11666.5 ns 10042 ns 1.16
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9146 ns 9250 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 122855.5 ns 120531 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 976417 ns 901792 ns 1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 70821 ns 71071 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 14667 ns 13250 ns 1.11
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 15333 ns 16042 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14208 ns 17208 ns 0.83
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13417 ns 15167 ns 0.88
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 596853 ns 592138 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4884688 ns 4027062.5 ns 1.21
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 343503 ns 345753 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 541 ns 459 ns 1.18
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 459 ns 541 ns 0.85
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 35561 ns 34521 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 474104 ns 371562.5 ns 1.28
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 204042 ns 206352 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8354.5 ns 7062.5 ns 1.18
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8375 ns 8333.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 8583 ns 0.90
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 8000 ns 0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 235390.5 ns 233771 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5839792 ns 4885833 ns 1.20
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 656975.5 ns 662116 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 16584 ns 12292 ns 1.35
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 17042 ns 13229 ns 1.29
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 14291 ns 15125 ns 0.94
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10042 ns 10167 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 22367 ns 22042 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 222667 ns 189125 ns 1.18
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 185752 ns 189132 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32458 ns 31875 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32145.5 ns 32333.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32416.5 ns 32291.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32125 ns 32000 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 277897 ns 276327 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1844750 ns 1697542 ns 1.09
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 593515 ns 595015.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 442500 ns 480875 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 443792 ns 441083 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 444209 ns 450250 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 450666 ns 490979 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194406.5 ns 194024 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2002791 ns 2629708 ns 0.76
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 367493 ns 368063.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3843396 ns 3822958 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3822375 ns 3807354 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3837021 ns 3827834 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3828583 ns 3826167 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 546587.5 ns 544349 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9324958 ns 9196542 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1364051 ns 1359983 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 778516791 ns 838219667 ns 0.93
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 542702208 ns 415052604.5 ns 1.31
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 545647834 ns 543102500 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1521122979.5 ns 1525021500 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22763268.5 ns 22764607.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14742526.5 ns 14772276 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2527539958 ns 3570164958 ns 0.71
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1822368542 ns 1502049709 ns 1.21
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 2266716458 ns 2269221042 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4816725334 ns 4773617583 ns 1.01
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 332342534.5 ns 369302709 ns 0.90
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88513087.5 ns 87924411 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 78250 ns 79646 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 77750 ns 78895.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79500 ns 78667 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76708 ns 77583 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 208775.5 ns 207237 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 543354 ns 520375 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 107265.5 ns 107601 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 191958 ns 250834 ns 0.77
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 193916 ns 294583.5 ns 0.66
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 278354 ns 285708.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 191917 ns 222333.5 ns 0.86
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1056329 ns 1049109.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6246083 ns 6122958 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 634790 ns 640576 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199782750 ns 199656458.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 139205750 ns 103769666.5 ns 1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139486250 ns 139342042 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 388433333 ns 388182208 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5833082.5 ns 5838796 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3539050 ns 3577840.5 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 620133542 ns 616451521 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 441272958 ns 351188291.5 ns 1.26
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 439771292 ns 439680896 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1178807834 ns 1178137125 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26447699.5 ns 26651952 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 22039567 ns 22092888 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7333 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6167 ns 5292 ns 1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 6084 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9792 ns 10167 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28196 ns 27714.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 596833 ns 351458 ns 1.70
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48371 ns 48481 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212937 ns 218291.5 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221500 ns 222250 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229333 ns 221209 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205916 ns 213708.5 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 224394 ns 222292 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9258896 ns 9125125 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 520850 ns 529665 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9395.5 ns 7271 ns 1.29
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9437.5 ns 9541.5 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10354.5 ns 9791 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9291.5 ns 8187.5 ns 1.13
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 118055 ns 117715.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 906375 ns 885458 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 68451 ns 69700 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 7479 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9958 ns 10479.5 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 10875 ns 0.74
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7458.5 ns 8875 ns 0.84
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 524850 ns 519786.5 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 3849333 ns 3961208 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 312977.5 ns 316073 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 541 ns 416 ns 1.30
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 584 ns 750 ns 0.78
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 459 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 416 ns 500 ns 0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 26875 ns 26338 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 469042 ns 488604.5 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 46651 ns 46820 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9333 ns 9291 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10812.5 ns 10416 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9667 ns 9208.5 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8709 ns 11583 ns 0.75
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 256059 ns 253612 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 6077334 ns 5171833.5 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 383208.5 ns 388624 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 107542 ns 104834 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 100125 ns 84834 ns 1.18
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 101479.5 ns 99500 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146625 ns 146333 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 25218 ns 24613 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 273854 ns 246062.5 ns 1.11
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 189951 ns 192062 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 495334 ns 526854 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 478291.5 ns 478875 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 516708 ns 500416.5 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 478125 ns 478958.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 234048 ns 232619 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2325583 ns 1709625 ns 1.36
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 610996 ns 610896 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5791 ns 5125 ns 1.13
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6500 ns 7167 ns 0.91
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 6916 ns 6791 ns 1.02
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6375 ns 4042 ns 1.58
batchedmm(16, Bsize=32)/forward/GPU/CUDA 16672 ns 16580 ns 1.01
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 84840 ns 79701 ns 1.06
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11709 ns 11708 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 11917 ns 11584 ns 1.03
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11041 ns 10792 ns 1.02
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16625 ns 17687.5 ns 0.94
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 215620.5 ns 214143.5 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 379153 ns 366964 ns 1.03
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39709 ns 35792 ns 1.11
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51834 ns 50791 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52458 ns 51833.5 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 14687.5 ns 13542 ns 1.08
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20332 ns 21568 ns 0.94
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 92151 ns 87241 ns 1.06
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36208 ns 38979.5 ns 0.93
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 31917 ns 30708 ns 1.04
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 37000.5 ns 30416 ns 1.22
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57166 ns 58458 ns 0.98
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 194385.5 ns 192010 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 394518.5 ns 395119 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1791.5 ns 1729.5 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1917 ns 1875 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2250 ns 2146 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1750 ns 1709 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 20812 ns 20594 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 311104 ns 326833 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 32150 ns 33120 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2167 ns 2125 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2208 ns 2333 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2292 ns 2250 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2042 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 205188 ns 204587 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1645562.5 ns 1518500 ns 1.08
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 135871 ns 136826.5 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5250 ns 4417 ns 1.19
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5270.5 ns 5250 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7334 ns 6375.5 ns 1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5812.5 ns 4041.5 ns 1.44
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 146117 ns 145077 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 458167 ns 725208 ns 0.63
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 68481 ns 69471 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 8041 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9250 ns 8958 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8459 ns 8416 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8292 ns 9208 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 880887.5 ns 875812.5 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 6096687.5 ns 5580917 ns 1.09
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 386063.5 ns 389804 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56833 ns 56792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57750 ns 56875 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57667 ns 57584 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58042 ns 58375 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 38026 ns 37054 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 486292 ns 336000 ns 1.45
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 202862 ns 203242 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 448000 ns 485813 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 464562 ns 499958.5 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 514167 ns 468208 ns 1.10
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 433250 ns 438854.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 268686 ns 268055 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8272083 ns 8122166.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 828927 ns 832729 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3318958 ns 3291250 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2343459 ns 1764708 ns 1.33
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2341458.5 ns 2339021 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6308646 ns 6260292 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA 205558 ns 204625 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 215872 ns 209992 ns 1.03
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11518666.5 ns 11332208 ns 1.02
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8313187 ns 6550833 ns 1.27
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8355021 ns 8325250 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21045041.5 ns 20937125 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 745952 ns 734916 ns 1.02
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1057229 ns 1048155.5 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6792 ns 4291 ns 1.58
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5458.5 ns 5875 ns 0.93
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6500 ns 6583 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4166 ns 4896 ns 0.85
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 138428.5 ns 137991.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 882854 ns 785625 ns 1.12
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 56141 ns 56390 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 7042 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 10562.5 ns 0.70
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7375 ns 7104.5 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6895.5 ns 7833 ns 0.88
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 761776 ns 754679 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5740125 ns 5245042 ns 1.09
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 368623 ns 371414 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 97541 ns 127625 ns 0.76
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 97291 ns 95624.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 102834 ns 100000 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 95833 ns 95708 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 152013 ns 152137 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2103917 ns 2635166.5 ns 0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 203602 ns 203242 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2028917 ns 2017959 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2020833 ns 2027771 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2026375 ns 2021167 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2015416 ns 1987167 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 713055.5 ns 703925.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10783958 ns 11055292 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1109779 ns 1255893 ns 0.88
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 33958 ns 29375 ns 1.16
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 37500 ns 34500 ns 1.09
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 34833 ns 35250 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 625 ns 583 ns 1.07
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15998 ns 15622 ns 1.02
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 78530 ns 80130 ns 0.98
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2667 ns 2542 ns 1.05
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2875 ns 3125 ns 0.92
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3042 ns 2834 ns 1.07
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2208.5 ns 3000 ns 0.74
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 142041 ns 141408 ns 1.00
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 342123 ns 343344 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7125 ns 7125 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5958 ns 5375 ns 1.11
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 6000 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10209 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37390 ns 36671 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 676167 ns 331459 ns 2.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48060 ns 48221 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212729 ns 217479 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220042 ns 229625 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 234520.5 ns 225000 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205667 ns 212875 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 246638 ns 244929 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8074875 ns 7984187.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 575124.5 ns 574266 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3958 ns 3959 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21990 ns 21419 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 252375 ns 234583 ns 1.08
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 42541 ns 42620 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14750 ns 14791 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15083 ns 14750 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14959 ns 14875 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14875 ns 14833 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 313538 ns 311492 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1028333 ns 982000 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 197971.5 ns 192231.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 101625 ns 140834 ns 0.72
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 103750 ns 127417 ns 0.81
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 106333 ns 105167 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 103459 ns 141000 ns 0.73
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 142836 ns 152595 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2101916 ns 2057334 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204827 ns 213297 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1936209 ns 1917833 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1845500 ns 1898875 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1928833 ns 1922083 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1923833 ns 1898854 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 695235 ns 692137 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10431000 ns 10436541 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1217070 ns 1217872 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17333 ns 18250 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18521 ns 18625 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21896 ns 20750 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16667 ns 17749.5 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 109599.5 ns 110137 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1364167 ns 480541.5 ns 2.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78990 ns 79421 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215333 ns 252041.5 ns 0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216916 ns 217541.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 260791.5 ns 219687.5 ns 1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215375 ns 222729.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 522595 ns 519298 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6229875 ns 6194812.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 475894 ns 478425 ns 0.99
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 23895.5 ns 23291.5 ns 1.03
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 32458 ns 28583 ns 1.14
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 29292 ns 28792 ns 1.02
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 3458 ns 1229.5 ns 2.81
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16387 ns 16210 ns 1.01
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 81460 ns 82241 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4854 ns 4292 ns 1.13
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5208 ns 4729 ns 1.10
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5395.5 ns 5042 ns 1.07
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4792 ns 5771 ns 0.83
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 208972.5 ns 207444.5 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 378003 ns 378084 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 307375 ns 305417 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 306083 ns 306250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 306834 ns 308084 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 305666 ns 305750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 229495 ns 228609 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 591083 ns 604584 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 272703 ns 273963 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 530459 ns 532917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 539625 ns 538167 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 568541.5 ns 539125 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 529458 ns 572709 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1085835 ns 1074383 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6524292 ns 6115208.5 ns 1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 862227 ns 858603.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20875 ns 19291 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19979.5 ns 20708 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23042 ns 22375.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19750 ns 19875 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 115151 ns 114907 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1456958.5 ns 593916 ns 2.45
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79391 ns 79421 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213000 ns 215708 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213542 ns 220584 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 246521 ns 213625 ns 1.15
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213167 ns 215875 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 753347 ns 762395 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7160542 ns 7232562.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 538225 ns 542290.5 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6708 ns 6125 ns 1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6667 ns 7083 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8375 ns 7917 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6917 ns 6208 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 141312.5 ns 140165.5 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 873084 ns 799291 ns 1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 65570 ns 65270 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10916 ns 9542 ns 1.14
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10625 ns 10333.5 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10312.5 ns 10375 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9500 ns 11145.5 ns 0.85
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 830338 ns 826456 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5676125 ns 5311708 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 385548 ns 387474 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6375 ns 4875 ns 1.31
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5000 ns 6917 ns 0.72
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7000 ns 7250 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5833 ns 4812.5 ns 1.21
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 145831.5 ns 144262 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 860333 ns 808375 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 66980 ns 66621 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7666 ns 7458 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 8083 ns 0.91
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7708 ns 7541.5 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7833 ns 0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 791858 ns 783702 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5957062 ns 5566229 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 392953 ns 395004 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14530458 ns 14350584 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10121208 ns 7693688 ns 1.32
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10148584 ns 10127042 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27684375 ns 27615959 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 535738 ns 548306 ns 0.98
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 397784 ns 393134 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46490166.5 ns 45943208 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33477083 ns 26437417 ns 1.27
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33537500 ns 33454833 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85247125 ns 84782667 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2681552 ns 2657066 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3309727.5 ns 3290613 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 67375 ns 66375 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 68792 ns 68584 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 69854.5 ns 69333.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 66416 ns 65979 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 122151 ns 121920.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1445250 ns 508166 ns 2.84
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 226167 ns 229397.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 440688 ns 446833 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 451750 ns 452437.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 492625 ns 446375 ns 1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 441000 ns 445834 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 732469 ns 728139 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7909125 ns 7552104 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 791857 ns 790108 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 666 ns 0.88
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 666 ns 500 ns 1.33
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 667 ns 0.75
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 33043 ns 32311 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 460958 ns 473500 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47400 ns 47340 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8958 ns 8666 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9500.5 ns 9208 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 8875 ns 8458 ns 1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8958 ns 17104 ns 0.52
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 288238 ns 286358 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5647625 ns 4681395.5 ns 1.21
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 374718.5 ns 375004 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9833 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9875 ns 9875 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9792 ns 9792 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9792 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23362.5 ns 23012 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 224083 ns 215645.5 ns 1.04
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 203972 ns 205762 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45833 ns 45958 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45959 ns 46042 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46292 ns 46041 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45958 ns 46250 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 293937 ns 290878 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 980625 ns 942542 ns 1.04
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 605265 ns 607695 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56375 ns 56250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57083 ns 56458 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57042 ns 57083 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57625 ns 57709 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 29177 ns 28552 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 672479.5 ns 663666.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 203142 ns 203541.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 448687 ns 448583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 464312.5 ns 465562 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 510417 ns 465458.5 ns 1.10
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 433854.5 ns 454041.5 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 247396 ns 245887 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9563709 ns 9545520.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 893168 ns 887779 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 544437.5 ns 645812.5 ns 0.84
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 597375 ns 575959 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 648250 ns 640542 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 624874.5 ns 646271 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 210492 ns 208584 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1402270.5 ns 1406395.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 311727.5 ns 315503 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2217500 ns 2214979 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2240396 ns 2211999.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2229708.5 ns 2220812.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2225709 ns 2227958 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 983621 ns 978439 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7187583 ns 10481646 ns 0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1224520 ns 1213952 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21375 ns 18625 ns 1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20229 ns 20729 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22833 ns 21583 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20334 ns 18875 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 113138 ns 113850.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1461708.5 ns 497958 ns 2.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78945.5 ns 79731 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221084 ns 227375 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218833.5 ns 259417 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 258896 ns 225541 ns 1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219333.5 ns 227084 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 734120 ns 729838 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7799875 ns 7560500 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 554300 ns 554315 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 541 ns 1.23
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23471 ns 23274 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 494166 ns 484250 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 47550 ns 48040 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9166 ns 9083 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9792 ns 10437.5 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9458 ns 9541 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8958 ns 9500 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 270206 ns 268183 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6261937.5 ns 5000875 ns 1.25
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 395404 ns 398234 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9292 ns 7250 ns 1.28
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9625 ns 9187.5 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10021 ns 9645.5 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8792 ns 8041 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 121366 ns 118921.5 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 911666 ns 886791.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 69440 ns 71801 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7458 ns 7604 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 8125 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 7500 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7291 ns 7562.5 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 511555 ns 507494 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4435417 ns 3782375 ns 1.17
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 317033 ns 320313 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1583 ns 1500 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1604 ns 1708.5 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2000 ns 1791 ns 1.12
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 21695 ns 21598 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 314792 ns 313375 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 189691 ns 190932 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3375 ns 3541 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3458 ns 3583 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3458.5 ns 3458 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3250 ns 3292 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 223980.5 ns 218452 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1832500 ns 1797375 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 582325 ns 583116 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 148250 ns 148104.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 128875 ns 106833 ns 1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 128958.5 ns 128562.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225250 ns 225000 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 24262 ns 23975 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 283354.5 ns 254292 ns 1.11
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 41141 ns 41470 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 143416.5 ns 157645.5 ns 0.91
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 112166.5 ns 87625 ns 1.28
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 112958 ns 112000 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 250792 ns 250708.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 218729 ns 218220.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2142416 ns 1096666 ns 1.95
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 268067 ns 269773 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7334 ns 7167 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 5333 ns 1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 6000 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 10458 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33218 ns 32755 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 655791.5 ns 330458 ns 1.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50271 ns 50720 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220458 ns 253104 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227458 ns 229041.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 270542 ns 234187.5 ns 1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212354.5 ns 227938 ns 0.93
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 266380 ns 263186.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8511103.5 ns 8237750 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 592610 ns 594190.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14833 ns 13792 ns 1.08
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15333 ns 15166 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 17291.5 ns 16499.5 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 15084 ns 14667 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 140084.5 ns 139540 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 909875 ns 786729 ns 1.16
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 231072 ns 232963 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23750 ns 23000 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24667 ns 23937.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23250 ns 23875 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23000 ns 23979.5 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 872721 ns 870094.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5802750 ns 5595708 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 678796 ns 679366 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9208 ns 8750 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9292 ns 10312.5 ns 0.90
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10834 ns 11271 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9459 ns 9584 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 124867 ns 123388.5 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 859792 ns 858292 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 72841 ns 74460 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14208 ns 13375 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13708 ns 14458.5 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 13916 ns 13958 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13708 ns 13625 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 673605 ns 667308 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5390083 ns 4997708 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 367303 ns 365743 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9208 ns 8583 ns 1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10208.5 ns 10333 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11062.5 ns 10312.5 ns 1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9416 ns 9166 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 124133 ns 121770.5 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 963958 ns 906625 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 72331 ns 75170 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12666 ns 12292 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12521 ns 13437.5 ns 0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12833 ns 12916 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11959 ns 12458 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 558576.5 ns 553718.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4777458 ns 3865125.5 ns 1.24
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 342592 ns 341293 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 29729 ns 26354.5 ns 1.13
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 34979 ns 30645.5 ns 1.14
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 31541.5 ns 31541 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1833 ns 1833 ns 1
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16426 ns 16183 ns 1.02
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 81111 ns 81001 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5625 ns 5209 ns 1.08
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 4958 ns 5021 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5416 ns 5417 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6542 ns 6604 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 141297.5 ns 140577.5 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 371714 ns 370423.5 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 291 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 26334 ns 25697 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 487083 ns 465667 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 47261 ns 47180 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6125 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6729 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6792 ns 6333 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6250 ns 6312.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 189443.5 ns 187721.5 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 6171125 ns 4952833.5 ns 1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 386868 ns 386429 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 1959 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2083 ns 2042 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2125 ns 2000 ns 1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 1958 ns 1959 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 27204 ns 26463 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 471084 ns 479625 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 205931 ns 206252 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16042 ns 16250 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16833 ns 16666 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16791 ns 16208.5 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16646 ns 16417 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 277576.5 ns 276067 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6251000 ns 5326083 ns 1.17
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 699851 ns 700836 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 148375 ns 173875 ns 0.85
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 150437.5 ns 148750 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 153292 ns 155708 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 153125 ns 147458 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 211229 ns 203847 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1433375 ns 1561917 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 214362 ns 232482 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1324083 ns 1328917 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1324958 ns 1311771 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1328917 ns 1320791 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1318208 ns 1322500 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 915681 ns 909940.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6666625 ns 7124333 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1106330 ns 995559.5 ns 1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25708 ns 22958 ns 1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24833 ns 26833 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27541 ns 27625 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 26083 ns 24667 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 237833 ns 234608.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1183708 ns 576541 ns 2.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 114451 ns 116011 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 117687.5 ns 118166.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 125583 ns 122375 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 136583.5 ns 158041.5 ns 0.86
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 128500 ns 123833.5 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1088124 ns 1073695 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6499917 ns 6127166 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 608510 ns 612925 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 250 ns 250 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23244 ns 23160 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 492313 ns 478542 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 47000 ns 47471 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6395.5 ns 6291 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6750 ns 6833.5 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6958 ns 6458 ns 1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6541 ns 6584 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 206201.5 ns 204382.5 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6167167 ns 5334937.5 ns 1.16
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 386698 ns 388703 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7000 ns 5208 ns 1.34
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6208.5 ns 7021 ns 0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8250 ns 7458 ns 1.11
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5958 ns 5667 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 146258 ns 145933.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 457750 ns 753959 ns 0.61
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 232042 ns 234802 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9917 ns 9583 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10333 ns 10375 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10334 ns 10125 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9958 ns 10042 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 909058 ns 903827 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6298000 ns 5826479 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 665070.5 ns 668457 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 667 ns 709 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 666 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22848 ns 22371 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 325125 ns 208416 ns 1.56
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 205392 ns 207552 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4791 ns 4584 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4625 ns 4833 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4750 ns 4666 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4542 ns 4584 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 231258 ns 228749 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1797833 ns 1654416.5 ns 1.09
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 580615 ns 580735 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7812.5 ns 7750 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7521 ns 9166.5 ns 0.82
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10250 ns 8834 ns 1.16
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7604.5 ns 8291 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 123509.5 ns 121959 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 806208 ns 827916 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 73451 ns 74011 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8729.5 ns 8625 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8833 ns 9041.5 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9042 ns 8583.5 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8375 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 596738 ns 591884.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4941250 ns 4264875 ns 1.16
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 340602 ns 342784 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126458 ns 122750 ns 1.03
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129666 ns 96459 ns 1.34
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 129958 ns 130187.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 180979.5 ns 180875 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46506 ns 45830 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 101881 ns 101721 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 302250 ns 328000 ns 0.92
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 315979.5 ns 166666 ns 1.90
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 345021 ns 347541.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 586229 ns 608646 ns 0.96
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 192939 ns 192063 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 509434 ns 505519.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 399292 ns 395916 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287875 ns 214250 ns 1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287875 ns 288167 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756125 ns 756500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44157 ns 43676.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 416625 ns 429792 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 80330.5 ns 82131 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1393583.5 ns 1458834 ns 0.96
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1132333 ns 857583 ns 1.32
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1134896 ns 1134333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2442729 ns 2441958.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 251447 ns 249859 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1849937 ns 1909646 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 351723 ns 352903 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 657209 ns 616500 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 644583 ns 598250 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 647625 ns 648916.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 646500 ns 642667 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 203355 ns 200586.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1396542 ns 1363291 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 308307 ns 313733 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2493938 ns 2445375 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2448666 ns 2426917 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2455417 ns 2441500 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2437875 ns 2440750 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 995041 ns 994961 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10855500 ns 9661291 ns 1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1303071 ns 1307388 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 33167 ns 28521 ns 1.16
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 37708 ns 34625 ns 1.09
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34395.5 ns 33916.5 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 792 ns 875 ns 0.91
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15536 ns 15425.5 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 78885.5 ns 79381 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3187 ns 3062.5 ns 1.04
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3208.5 ns 3416 ns 0.94
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3375 ns 3208 ns 1.05
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3167 ns 3209 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 138978.5 ns 139741 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 337663 ns 338953 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 407167 ns 404500 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 407625 ns 402125 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 408917 ns 408334 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 420083 ns 422458 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43841 ns 43145 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1162041.5 ns 1128750.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 242687 ns 239562 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3888000 ns 3863292 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3966812.5 ns 3971625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4003187.5 ns 3996791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3753604.5 ns 3757979.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 242130 ns 242826 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11627208 ns 11673750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1434622 ns 1433229 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3959 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3959 ns 3917 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3916 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3917 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34345 ns 33968 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 265875 ns 167334 ns 1.59
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 40000 ns 38620 ns 1.04
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15458 ns 15666 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 16000 ns 15750 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15875 ns 15625 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15750 ns 15625 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 255938 ns 255128 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 886916.5 ns 843520.5 ns 1.05
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 177676.5 ns 169816.5 ns 1.05
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404750 ns 402625 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295584 ns 220209 ns 1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295625 ns 295959 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760584 ns 760791.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113304 ns 113239 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 453333.5 ns 348895.5 ns 1.30
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 89391 ns 89300.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1423000 ns 1474958.5 ns 0.96
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1161417 ns 881146 ns 1.32
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1159042 ns 1159083.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2466333.5 ns 2461917 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 240454.5 ns 241292 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1919791 ns 1946459 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 331642 ns 354883 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 542 ns 542 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 26071 ns 25844 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 479125 ns 496709 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 205842 ns 209382 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7375 ns 7375 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7666.5 ns 8104.5 ns 0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7958 ns 7500 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7334 ns 7375 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 211355 ns 217033.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6383916.5 ns 5254333.5 ns 1.21
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 685985.5 ns 685977 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 831521 ns 825125.5 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 618542 ns 468584 ns 1.32
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 621583 ns 621500 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1545084 ns 1536542 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130366.5 ns 130845.5 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 230342 ns 229862 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2689875 ns 2661979 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1997791 ns 1535250.5 ns 1.30
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2003375 ns 2000792 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4944895.5 ns 4906416 ns 1.01
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 259056 ns 242304 ns 1.07
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 850477 ns 841449 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32798 ns 32216 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 459854.5 ns 464375 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 47071 ns 47630 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6208 ns 6125 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6708 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6500 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6084 ns 6375 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 224304.5 ns 224154.5 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5499188 ns 4615291 ns 1.19
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 357363 ns 357793.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2461458 ns 2392708 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2378542 ns 2371959 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2403542 ns 2404416 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2387833 ns 2370084 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 200880 ns 200035.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1494834 ns 1597041.5 ns 0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 373593 ns 373933 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4658209 ns 4648292 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4661750.5 ns 4644250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4678791.5 ns 4636708 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4636334 ns 4642750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 901278.5 ns 891890 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6685687.5 ns 6938541.5 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1397522 ns 1391633 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6916.5 ns 7187.5 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7292 ns 7542 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7166 ns 7125 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7354.5 ns 6875 ns 1.07
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23290 ns 23289 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 278083 ns 243458.5 ns 1.14
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 40180 ns 39800 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 33146 ns 46396.5 ns 0.71
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 48833.5 ns 32917 ns 1.48
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 46125 ns 45875.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 33958.5 ns 67312 ns 0.50
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 216552 ns 214725 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2082666 ns 1121562 ns 1.86
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 268892 ns 269102.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 21041.5 ns 19604.5 ns 1.07
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 26625 ns 24021 ns 1.11
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 24937.5 ns 23750 ns 1.05
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 7334 ns 5084 ns 1.44
batchedmm(2, Bsize=512)/forward/GPU/CUDA 16735 ns 17227 ns 0.97
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 84431 ns 83741 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 11916 ns 11916 ns 1
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10395.5 ns 9354.5 ns 1.11
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10583 ns 10417 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18021 ns 17958 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 227961 ns 225890 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 370163 ns 371753 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406167 ns 404000 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297250 ns 222584 ns 1.34
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 297333 ns 296875 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762625 ns 762667 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46476 ns 46288 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 449292 ns 358375 ns 1.25
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 88291 ns 89491 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1437042 ns 1480896 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1162625 ns 888250 ns 1.31
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1166208 ns 1164959 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2472895.5 ns 2465417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 282746.5 ns 288016 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2110792 ns 2117375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 378543 ns 381744 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 434250 ns 432125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 436625 ns 430333 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 436625 ns 436917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 447208 ns 448604.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54844 ns 54122.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1118291.5 ns 1059021 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 235652 ns 234952 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3915542 ns 3895042 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4019584 ns 4004458 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4030583.5 ns 4030291.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3785166.5 ns 3789979 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 262526 ns 260055 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10655166 ns 10349458.5 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1364571 ns 1223712 ns 1.12
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8791 ns 8750 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7708 ns 6917 ns 1.11
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7625 ns 7583 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12458 ns 12416 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23849 ns 23553.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 231188 ns 214667 ns 1.08
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 209912 ns 211142 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 44833 ns 44958 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45083 ns 45083 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45334 ns 45000 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 44958 ns 44958 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 348178 ns 344550 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1901333 ns 1862458 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 655620 ns 659011.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 84187.5 ns 122729 ns 0.69
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 86833.5 ns 83521 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 128541 ns 87354.5 ns 1.47
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 90229.5 ns 105375 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190096 ns 190055 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2015000 ns 1972791.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 218242 ns 214447 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2025917 ns 2012458.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1983125 ns 1980000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2025145.5 ns 2023917 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2020771 ns 2011645.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 533574 ns 529776 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9232375 ns 9305500.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 961658 ns 1088680 ns 0.88

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant