Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

fix: missing enzyme rules for matmuladd! (CUDA support) #159

Merged
merged 4 commits into from
Sep 15, 2024

Conversation

avik-pal
Copy link
Member

fixes #148. I am still seeing some failures on the end-to-end Lux case, but let's get part of the solution in for now.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: df2bfd5 Previous: 987fce9 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5917 ns 6083 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5792 ns 6250 ns 0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8042 ns 8104 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6125 ns 5333 ns 1.15
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 116710 ns 127763 ns 0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2852430 ns 2680722 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 861459 ns 817500 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 401163 ns 410844 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10375 ns 9771 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10000 ns 9958 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9687.5 ns 9834 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9979 ns 9958 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 537342 ns 539870 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 17823755 ns 18273784 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4963250 ns 2523292 ns 1.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 717326 ns 669947 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1500 ns 2812.5 ns 0.53
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1812.5 ns 1416 ns 1.28
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1542 ns 1584 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1625 ns 1333 ns 1.22
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21658 ns 21455 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1280349 ns 1323661 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 201041 ns 216625 ns 0.93
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 35810 ns 28950 ns 1.24
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3291.5 ns 4458 ns 0.74
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3750 ns 3375 ns 1.11
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 3937.5 ns 4167 ns 0.94
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4000 ns 4000 ns 1
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 144356.5 ns 142970.5 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 8866089 ns 10240879 ns 0.87
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1450646 ns 1524333 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 146231 ns 149491.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57084 ns 57833 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46708 ns 40417 ns 1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46875 ns 46375 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83917 ns 83000 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36339 ns 36725 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 818655.5 ns 558408 ns 1.47
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1057917 ns 1040458 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 80285.5 ns 81776 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2018875 ns 2036667 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2083417 ns 2086500 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2084333 ns 2090375 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2014417 ns 1993667 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 226229 ns 226490 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 8035998 ns 7533597 ns 1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 5478875 ns 8034167 ns 0.68
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1440733 ns 986919 ns 1.46
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 170417 ns 146666 ns 1.16
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 148584 ns 151000 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 149583 ns 151062.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 160562.5 ns 194750 ns 0.82
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166876 ns 166182 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7797282 ns 7689190 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1583375 ns 1596770.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 203002 ns 209312 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1102458.5 ns 1113896 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1102083 ns 1120062.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1109583 ns 1119104 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1116459 ns 1106542 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 696608 ns 695636.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33320128.5 ns 34400023 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 5999125 ns 7210396 ns 0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1026614 ns 1024730 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4458 ns 5291 ns 0.84
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4937 ns 4916 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6062.5 ns 6125 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5000 ns 4375 ns 1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 92442 ns 91792.5 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5499696 ns 5267805 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 449541 ns 474000 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 67831 ns 67381 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9000 ns 8750 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8875 ns 8917 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns 8792 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8916 ns 8687.5 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 598617.5 ns 600359 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 33059926 ns 36489972 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5980250 ns 5930125 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 387213 ns 390114 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18666.5 ns 17562.5 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18084 ns 17979 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21416.5 ns 20812.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17000.5 ns 17750 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 66296 ns 66076.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3090203 ns 3263389.5 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1290959 ns 1274334 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 76351 ns 76030 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219000 ns 212792 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218999.5 ns 213000 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220916 ns 218292 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221500 ns 254395.5 ns 0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 352074 ns 351925 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 15166593 ns 15484392 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5645208 ns 5673084 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 451749 ns 468334.5 ns 0.96
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 583 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 750 ns 708 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 875 ns 770.5 ns 1.14
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 666.5 ns 666 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20395 ns 20050 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1176380 ns 1150135.5 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 301416 ns 295625 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 30950 ns 32420 ns 0.95
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1334 ns 1459 ns 0.91
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1541.5 ns 1520.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1584 ns 1459 ns 1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1583 ns 1500 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 123898 ns 122512.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8669000 ns 8913698.5 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1653417 ns 1644687.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 135441 ns 135591 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7416 ns 7334 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 5417 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6167 ns 6042 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10459 ns 10250 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23631 ns 23888.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1190921 ns 1207370.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 601583.5 ns 446750 ns 1.35
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47310 ns 47420 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222083 ns 236834 ns 0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 235583 ns 241875 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 237708.5 ns 269875 ns 0.88
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 256042 ns 257687.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 188469 ns 191906.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 32610334.5 ns 32212683 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9041499.5 ns 8558250.5 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 644350.5 ns 645121 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4125 ns 4083 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4083 ns 4083 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4125 ns 4042 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 22829.5 ns 23307 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 1921964 ns 2000762.5 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 221334 ns 223875 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 46730 ns 48080 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16709 ns 16792 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16708 ns 16625 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16958 ns 16792 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 17041 ns 16917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 192811 ns 191629 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 11018487 ns 10282963 ns 1.07
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 941667 ns 937125 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 171082 ns 176282 ns 0.97
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 509417 ns 509292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 404708 ns 332354.5 ns 1.22
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 405084 ns 404834 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865041 ns 865333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113085 ns 113483 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 405181 ns 392476 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 493750 ns 487333 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 242012 ns 240773 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2330209 ns 2308770.5 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2027500 ns 1756875 ns 1.15
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2035083.5 ns 2033625 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3196083.5 ns 3270500 ns 0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 239374 ns 237569 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 10368785 ns 11006777.5 ns 0.94
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1964958 ns 2028666.5 ns 0.97
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 739347 ns 739942 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6354.5 ns 6062.5 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7500 ns 6584 ns 1.14
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8020.5 ns 8208.5 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6500 ns 6875 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 90359 ns 91839.5 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5333534 ns 5704966 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 870542 ns 776250 ns 1.12
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 67591 ns 65360 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11833.5 ns 11041.5 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12000 ns 11875 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12250 ns 11125 ns 1.10
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11625 ns 12187.5 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 628260 ns 637048 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 37786795 ns 37465688 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5940750 ns 5651896.5 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 415614 ns 408644 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 541 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 22889 ns 22899 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2223683 ns 1980954 ns 1.12
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 317083 ns 214375 ns 1.48
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 48540 ns 49101 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2083 ns 2084 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2083 ns 2083 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2166 ns 2208 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 223807 ns 228216 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 10821378 ns 11133138.5 ns 0.97
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 1968833 ns 2019750 ns 0.97
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 176001.5 ns 180086.5 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9625 ns 9083 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9333 ns 8500 ns 1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10750 ns 10833.5 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8542 ns 8542 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 100074.5 ns 108383.5 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3132454 ns 3207332 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 922709 ns 816208 ns 1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 71491 ns 74171 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17625 ns 16875 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18125 ns 18792 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18500 ns 18250 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17917 ns 17812.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 568177 ns 615805 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 16550550 ns 16767446 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5440042 ns 5170312.5 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 378703 ns 383838.5 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 666 ns 0.88
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 34894 ns 35553 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1202690 ns 1192710.5 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 451834 ns 293146 ns 1.54
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 45750 ns 46141 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8875 ns 8541.5 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9125 ns 8541 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9166.5 ns 9958.5 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8333 ns 9458.5 ns 0.88
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 257748.5 ns 264293 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 21823797 ns 18241947 ns 1.20
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5328916 ns 5274687.5 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 367713.5 ns 366223 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396833.5 ns 396958 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287875 ns 215500 ns 1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288250 ns 287792 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756458 ns 755333 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111609 ns 110939.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 329132 ns 326929 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 395667 ns 365521 ns 1.08
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 74991 ns 74351 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1449979 ns 1446854 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1131042 ns 859125 ns 1.32
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1133958 ns 1132854 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2356500 ns 2436292 ns 0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 204709 ns 204467 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 10635074 ns 8967194.5 ns 1.19
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1611750 ns 1574375 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 322063 ns 321063 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7104 ns 7187.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7958 ns 7270.5 ns 1.09
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8958.5 ns 8541.5 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6562.5 ns 6979.5 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 142105.5 ns 145872 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5936342 ns 5766375 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 452000 ns 448125 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 66091 ns 65611 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 12375 ns 14770.5 ns 0.84
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15041 ns 16916.5 ns 0.89
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15917 ns 15687.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 12417 ns 15562.5 ns 0.80
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 942027 ns 956937.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 44977187 ns 42931711 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5815708 ns 6186333 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 435994 ns 421904 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26250 ns 25292 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 27062 ns 25292 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 30166.5 ns 28583.5 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 27083 ns 30125 ns 0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 197974.5 ns 198270.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7858446 ns 7924119 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 679708 ns 654625 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 116281 ns 113131 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 111583 ns 157000 ns 0.71
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 152937.5 ns 118479 ns 1.29
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 153979.5 ns 118792 ns 1.30
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 149416 ns 145083.5 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1069746 ns 1072793 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42543095 ns 41512479 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5946875 ns 5879750 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 584775 ns 587055 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75459 ns 76417 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 78583 ns 74917 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 86542 ns 80458 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76458 ns 82834 ns 0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 204304 ns 204563.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7626592.5 ns 7289524 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 525584 ns 532021 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 127951 ns 126591 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 305250 ns 263209 ns 1.16
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 309917 ns 316562 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 303687.5 ns 248479.5 ns 1.22
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 300459 ns 210125 ns 1.43
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1109450.5 ns 1111658.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42357809 ns 39831914 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6335041.5 ns 6266000 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 692446 ns 691997 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16583 ns 16771 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17209 ns 16791.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18583 ns 17542 ns 1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16250 ns 16750 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 143978 ns 144759.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5682846.5 ns 5606829 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 442583.5 ns 474208 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 231662.5 ns 232022 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25771 ns 26895.5 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27895.5 ns 25167 ns 1.11
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27437.5 ns 27333 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 25146 ns 24167 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 970539 ns 972458 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41685019 ns 41939896 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5938208 ns 6295958 ns 0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 688856 ns 695306.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11771 ns 11209 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11000 ns 11333.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12625 ns 12416.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10375 ns 11042 ns 0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 123075.5 ns 122668.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3467941 ns 3386989.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 923417 ns 858500 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 233902 ns 233942 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22249.5 ns 21584 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21958 ns 22563 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 22458.5 ns 22583 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21417 ns 21291 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 700502.5 ns 697229 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 21978852 ns 21507216 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5587958.5 ns 5485375 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 671596 ns 669687 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66792 ns 63104 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 64084 ns 66479 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 67917 ns 66584 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 62708 ns 64208.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104530 ns 105012.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3393921 ns 3348443 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1274250 ns 1297624.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 232402 ns 232172 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 445958.5 ns 440625 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 445104 ns 448937.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 474833 ns 440917 ns 1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 487708 ns 438250 ns 1.11
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 511360 ns 511759 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19945666 ns 20624860 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6173209 ns 5921625 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 714611 ns 713498 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7541.5 ns 7521 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8021 ns 8084 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8458 ns 8667 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6895.5 ns 7750 ns 0.89
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 143280 ns 143457 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5507961.5 ns 5597779 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 760167 ns 446771 ns 1.70
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 64870 ns 64960 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15167 ns 14875 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15624.5 ns 15709 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14791.5 ns 16542 ns 0.89
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14625 ns 15541.5 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 937676 ns 938762 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 39900259 ns 39706040 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5852708 ns 5775541 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 399123 ns 398045 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6153208.5 ns 6154854 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6372791 ns 3224917 ns 1.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6377771 ns 6376292 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11908042 ns 11902583 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 347117 ns 347379 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 326032 ns 297978.5 ns 1.09
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19111958.5 ns 19104063 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19917167 ns 11143020.5 ns 1.79
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 20013437.5 ns 19964417 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36539583.5 ns 36518125 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1010211 ns 1020967.5 ns 0.99
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1160155 ns 1158972 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 958 ns 958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 917 ns 1000 ns 0.92
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 959 ns 1000 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 959 ns 958 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 22953 ns 22897 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2055203.5 ns 2091957.5 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 317500 ns 232500 ns 1.37
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 206642 ns 206842 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3708 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3667 ns 3709 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3792 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3666 ns 3667 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 276264.5 ns 277378 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11103867 ns 11186074 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2080563 ns 2130584 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 626735 ns 626357 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8354.5 ns 7750 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8271.5 ns 7937.5 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10375 ns 9771 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8479.5 ns 7437.5 ns 1.14
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 119642 ns 119515 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3410967 ns 3487658 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 806333 ns 816562.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 65401 ns 65701 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11916 ns 11208 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11959 ns 13416.5 ns 0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12334 ns 12834 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12062 ns 11584 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 633604.5 ns 631148 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 24085045 ns 21438278 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5097229 ns 5005375 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 354493 ns 354774 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 250 ns 333 ns 0.75
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22391 ns 22106 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2130096.5 ns 2144977 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 223604.5 ns 226937 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 46660 ns 46510 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2875 ns 2875 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 3000 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3042 ns 2917 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3292 ns 2958 ns 1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 200381.5 ns 199810.5 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9512287 ns 9182273 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1667166.5 ns 1664167 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 154861.5 ns 161676.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11188 ns 11625 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11375 ns 11979 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13583 ns 13333 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11167 ns 11604.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 120676.5 ns 120755 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3356710 ns 3560641 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 905416 ns 1031500 ns 0.88
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 232962.5 ns 233163 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20896 ns 20687.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20541 ns 20583 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24042 ns 23000 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20437.5 ns 20541.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 592980 ns 590597 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19762992.5 ns 20721086 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4815437.5 ns 4786083 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 647055 ns 646557 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4417 ns 4375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4375 ns 4417 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4417 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 23422 ns 23934 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2056687 ns 2235095.5 ns 0.92
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 223584 ns 221479.5 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 47230 ns 47181 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16500 ns 16667 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16458 ns 16541 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16833 ns 16709 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16250 ns 16708 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 328856 ns 326329 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12215144.5 ns 12543391.5 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1709708 ns 1672458 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 205042 ns 204152 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2042 ns 2084 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2125 ns 2125 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2125 ns 2083 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2083 ns 1958 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 35411 ns 35852 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1204529 ns 1224950 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 468083 ns 293583 ns 1.59
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 202926.5 ns 203142 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 16875.5 ns 18208 ns 0.93
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 17083.5 ns 17187.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 18312.5 ns 18041.5 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 17812.5 ns 17021 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 291655 ns 291174 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20862505 ns 21237766 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5608145.5 ns 5676396 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 685221 ns 684357.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 60604.5 ns 60208.5 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 64375 ns 62042 ns 1.04
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 66625 ns 65750 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51125 ns 51250 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66758 ns 66352.5 ns 1.01
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 117881 ns 112971 ns 1.04
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 203042 ns 188541.5 ns 1.08
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 108479 ns 140250.5 ns 0.77
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 153187.5 ns 124249.5 ns 1.23
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 269187.5 ns 220125 ns 1.22
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 213243 ns 213978 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 616430 ns 616297 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 83708 ns 84479 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 91041 ns 83666.5 ns 1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86250.5 ns 86167 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83958 ns 125666 ns 0.67
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193430.5 ns 193270.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5188768 ns 5699293.5 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1750270.5 ns 1963979.5 ns 0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 217472 ns 204042 ns 1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1915271 ns 1887292 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1908688 ns 1916521 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1884250 ns 1912333 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1925896 ns 1806250 ns 1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 526075 ns 528167 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 18387490 ns 24408984.5 ns 0.75
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9059833 ns 9102667 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1068719 ns 1064601.5 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 250 ns 1.16
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21289 ns 21230 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2736431 ns 2190815.5 ns 1.25
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 344083 ns 367541.5 ns 0.94
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 41530 ns 41291 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 249820 ns 249025 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9441082 ns 10051558 ns 0.94
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1569708 ns 1526271 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 178101 ns 182202 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8709 ns 8583 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8417 ns 9542 ns 0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11583 ns 10604 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9042 ns 8125 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 116492 ns 117788.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3343948 ns 3476276 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 864458 ns 921312.5 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 232712 ns 232182 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9125 ns 9000 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8541 ns 8958 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 12333 ns 11292 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8750 ns 9145.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 520944 ns 518629.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20822847 ns 19406043 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4608167 ns 4477584 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 626775 ns 626986 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57667 ns 57458 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46209 ns 39875 ns 1.16
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46917 ns 46750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83750 ns 82583 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39308 ns 39259 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1295650 ns 1309251 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1112083 ns 1121542 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78771 ns 74341 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1919478.5 ns 1867542 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1959291.5 ns 1978791 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1979917 ns 1977229 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1905563 ns 1853979.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 217725.5 ns 219172 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 34778309 ns 32964288 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11556353.5 ns 11253292 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1038339 ns 1160142 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 417958 ns 419229.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 418542 ns 435958 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 421667 ns 420208 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 417917 ns 417291.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 207355 ns 208124 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 9387178.5 ns 8033766 ns 1.17
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 516354 ns 539333.5 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 281432 ns 280723 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 752313 ns 718729.5 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 756708.5 ns 670917 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 755042 ns 681646 ns 1.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 714791.5 ns 671125 ns 1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1039645 ns 1045689 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 48331709 ns 44612818 ns 1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7339229.5 ns 6579583 ns 1.12
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 900947 ns 909619.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3463292 ns 3431646 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3411041 ns 3418041.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3450270.5 ns 3459666 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3450729 ns 3424604 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 172028 ns 172982 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8247685 ns 8225049 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1400042 ns 1418875 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 423798.5 ns 438875 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6207666.5 ns 6211958.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6191917 ns 6239125 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6203042 ns 6228166.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6155750 ns 6164812.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 983454.5 ns 989377 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49424156 ns 49957898 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7415333.5 ns 7609083 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1640284 ns 1545101 ns 1.06
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 472625 ns 470459 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 341958 ns 254333 ns 1.34
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 342229.5 ns 342000 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 901125 ns 901833 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46305 ns 45850.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 387810 ns 874511 ns 0.44
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 488417 ns 485291 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 242132 ns 241413 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2334541 ns 2331458 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2033792 ns 1762250 ns 1.15
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2039166 ns 2040791.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3201625 ns 3281083 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 263325 ns 263882 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 8472401 ns 13135947 ns 0.64
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2171166 ns 2243500 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 769296 ns 765467.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57166 ns 57083 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46000 ns 38854.5 ns 1.18
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46209 ns 46125 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83666 ns 82875 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 27911 ns 28162 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1037054 ns 1368315 ns 0.76
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1128875 ns 1138958 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 74385.5 ns 74570.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2031375.5 ns 2033792 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2049062.5 ns 2094125 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2082791.5 ns 2089041.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2007792 ns 2003042 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 230367.5 ns 231932 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36228924 ns 35712411 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11447958 ns 11300791.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1195775 ns 1044461 ns 1.14
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58062.5 ns 57500 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46667 ns 39917 ns 1.17
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46854.5 ns 46500 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83583 ns 82625 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 49082 ns 48905 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 780496 ns 744836.5 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1087542 ns 1117520.5 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 73710 ns 64946 ns 1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1920625 ns 1922750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1979271 ns 1974334 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1967333.5 ns 1956833.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1873958 ns 1889708 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 235469 ns 239067 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 16628287 ns 16476478 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9899750 ns 9755374.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1045729 ns 916609 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 34659 ns 35081.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1165228.5 ns 1290014 ns 0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 274666 ns 287438 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 48121 ns 45840 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6583 ns 6541 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6584 ns 6687.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7125 ns 7000 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6708 ns 6500 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 205946 ns 205115.5 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20242299 ns 20319441.5 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5155666.5 ns 5303083 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 368933 ns 367174 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31792 ns 31894 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1265965 ns 1192240 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 256250 ns 254292 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 38370 ns 36310 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2875 ns 3334 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2833 ns 2958 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3042 ns 3167 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2959 ns 2958 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 186536 ns 185317.5 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 7354138 ns 7518628 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 1072208 ns 1115709 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 157402 ns 149472 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 455958 ns 422083 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 420603.5 ns 423833 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 425479.5 ns 427834 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 423104.5 ns 424937.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 137903 ns 137292 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5954417 ns 5779699.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2040521 ns 2076458 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 364623 ns 366143.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3804583.5 ns 3813229.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3812292 ns 3824249.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3795875 ns 3788084 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3812541.5 ns 3812042 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 705085 ns 705310 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32654529 ns 31262641 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11097208 ns 10824937.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1465927 ns 1464005 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49850208 ns 49892959 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35526250 ns 26011834 ns 1.37
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35528084 ns 35523145.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97085042 ns 97645833 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1598092 ns 1616287 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1049079 ns 1048102 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154749104.5 ns 154680021 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112480041.5 ns 88850291.5 ns 1.27
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112311583 ns 112398500 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 295479854.5 ns 298306271 ns 0.99
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6475380 ns 6498761 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5567482 ns 5545318 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 20395.5 ns 19937.5 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 16708 ns 15167 ns 1.10
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 17083.5 ns 17041.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15125 ns 14792 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 21066 ns 20017 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1101176 ns 1149888 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 215250 ns 229541 ns 0.94
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 25670 ns 27001 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11166 ns 10417 ns 1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 9021 ns 7250 ns 1.24
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9271 ns 9104 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17167 ns 17375 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 257676 ns 257217 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 10263846 ns 9674368 ns 1.06
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1588958 ns 1641396 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 147461 ns 147861 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8750 ns 8063 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8916 ns 9125 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 11167 ns 10667 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8250 ns 8917 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 122695.5 ns 114750.5 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3611838 ns 3651219 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 810334 ns 861125 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 234252 ns 233283 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9875 ns 9792 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10042 ns 10750 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns 10917 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9666.5 ns 10271 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 614564 ns 614307 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22515968 ns 28192305 ns 0.80
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5300125.5 ns 5310750 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 650845 ns 649747 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10375 ns 9708 ns 1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10208 ns 10000 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10937.5 ns 11541 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8958 ns 9584 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 119909.5 ns 119206 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3438435.5 ns 3481764 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 940042 ns 937459 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 71350 ns 72050 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 15166 ns 17479.5 ns 0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 17249.5 ns 14375 ns 1.20
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 16500 ns 15125 ns 1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 15625 ns 14667 ns 1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 588113 ns 586931 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 18956216 ns 19607421 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4795145.5 ns 4735125 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 345433 ns 343533 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 541 ns 584 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 458 ns 459 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 34662 ns 34228 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1200054.5 ns 1215476 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 277750.5 ns 314188 ns 0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 204216.5 ns 203452 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7834 ns 9334 ns 0.84
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8209 ns 8604.5 ns 0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9291.5 ns 9041 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8250 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 231220.5 ns 230655 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21949133 ns 22072831 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5504979 ns 5460541 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 657786 ns 654892 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 17625 ns 17375 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 15458 ns 14792 ns 1.05
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 15667 ns 16000 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10375 ns 10458 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 21715 ns 21718 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1135978 ns 1102903 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 207667 ns 208666 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 185231 ns 184622 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 31895.5 ns 31542 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32312.5 ns 32000 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32187.5 ns 32208 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32250 ns 32354.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 272921 ns 271707 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11269975 ns 10769694 ns 1.05
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1796334 ns 1820875 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 590985 ns 588176 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 487000 ns 452584 ns 1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 446417 ns 441979.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 443208.5 ns 467167 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 445125 ns 438521 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194886.5 ns 194827 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5952118 ns 5920885 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1965209 ns 1997667 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 374993 ns 368184 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3824209 ns 3829250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3832854 ns 3838292 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3826979.5 ns 3802021 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3837854 ns 3830584 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 537990 ns 544632 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27884609 ns 28778535 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9399354 ns 9720812.5 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1357231 ns 1358284 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 783420645.5 ns 831986833 ns 0.94
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 542712125 ns 416264500 ns 1.30
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 544393542 ns 543217708 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1506067228.5 ns 1509789750 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22742302 ns 22539644.5 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14706275 ns 14678121 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2515795250 ns 3779013833 ns 0.67
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1789560125 ns 1885743917 ns 0.95
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 3174171250 ns 1788587042 ns 1.77
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4755520667 ns 4810183875 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 365925509 ns 364565745 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88872223 ns 88375525 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 78271 ns 75520.5 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 77042 ns 76416.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79750 ns 79958.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76333 ns 78625 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 206560.5 ns 207155.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7729887 ns 7714255 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 524875 ns 534709 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 118801 ns 106301.5 ns 1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 278708.5 ns 235667 ns 1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 193229.5 ns 283229.5 ns 0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 282667 ns 247208 ns 1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 193562 ns 210874.5 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1040522 ns 1048818 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42904324 ns 44375934 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6113958 ns 6248084 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 659445 ns 631246 ns 1.04
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199659188 ns 199488333 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 138835417 ns 103922541.5 ns 1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139066292 ns 139224666 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 389188917 ns 393811292 ns 0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5820554 ns 5835255 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3565645.5 ns 3578582 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 620296312.5 ns 620321291.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 440668667 ns 354710917 ns 1.24
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 438212625 ns 440219958 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1178525042 ns 1185414250 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26711415.5 ns 26495134 ns 1.01
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 22018976 ns 22065145 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7417 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 5417 ns 1.12
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6250 ns 6292 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10042 ns 10145.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27655.5 ns 27466 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1222067 ns 1213453.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 582458 ns 432833 ns 1.35
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48381 ns 47620 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213583 ns 213000 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220395.5 ns 223041 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220604 ns 220917 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 208021 ns 206896 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 219078 ns 223324 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31953958 ns 31525343 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9203250.5 ns 9133958 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 526564 ns 524095 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10375 ns 8854.5 ns 1.17
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8625 ns 9312.5 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11271 ns 10583 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7375 ns 9625 ns 0.77
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 117396 ns 116401 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3513776 ns 3333892 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 877687.5 ns 911750 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 69271 ns 69370 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9333 ns 7437.5 ns 1.25
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 11458 ns 8854 ns 1.29
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 7959 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9458.5 ns 9145.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 518767 ns 515224 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20215136 ns 18606821 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4550083 ns 4708917 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 317923 ns 318334 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 459 ns 709 ns 0.65
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 26366 ns 25690 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1195376 ns 1183861 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 477666.5 ns 493792 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 46620 ns 46791 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8459 ns 9000 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 8770.5 ns 10791.5 ns 0.81
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9749.5 ns 9854.5 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10500 ns 10042 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 252367 ns 251338.5 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 23130848 ns 23713128.5 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5955833.5 ns 6062250 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 388573 ns 386044 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 107187 ns 107354.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 98312 ns 84667 ns 1.16
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 100917 ns 100375 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146541 ns 146729.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 24461 ns 24618 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1106973 ns 1206806.5 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 268041.5 ns 266292 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 190491 ns 190862 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 500833 ns 478500 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 513042 ns 492271 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 505270.5 ns 481000 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 478625 ns 479145.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 230476 ns 230580 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11382686 ns 11914566 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2119250.5 ns 2188458.5 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 606815 ns 605276 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5750 ns 6042 ns 0.95
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 5770.5 ns 7000 ns 0.82
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7166 ns 7583 ns 0.95
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4062.5 ns 6000 ns 0.68
batchedmm(16, Bsize=32)/forward/GPU/CUDA 15861 ns 16947 ns 0.94
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 79281 ns 79345.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11375 ns 12062.5 ns 0.94
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10791.5 ns 10542 ns 1.02
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11125 ns 10917 ns 1.02
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 17375 ns 18208 ns 0.95
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 210987.5 ns 212062.5 ns 0.99
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 375968 ns 367674 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39958 ns 39750 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51000 ns 50708 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52854.5 ns 52625 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13625 ns 13750 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 21619 ns 19888.5 ns 1.09
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 86201 ns 87991 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36083 ns 36500 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 30937.5 ns 28959 ns 1.07
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 30937 ns 31500 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 58145.5 ns 58583 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 191196.5 ns 190552 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 397084 ns 413955 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1708 ns 1750 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1833 ns 1937.5 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 2125 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1708 ns 1792 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 20510 ns 20369 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1162226 ns 1137759 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 310521 ns 312000 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 33601 ns 32711 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2250 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2396 ns 0.87
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2250 ns 2333 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2250 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 202165 ns 201543.5 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 9408568.5 ns 9195441 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1661458 ns 1575208 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 135986.5 ns 136711 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6187.5 ns 4562.5 ns 1.36
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4708 ns 4708.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6125 ns 6834 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4917 ns 5125 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 142748.5 ns 144149.5 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5789328 ns 5753580 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 742959 ns 707854 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 70511 ns 69031 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8333 ns 8167 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8500 ns 9250 ns 0.92
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 8667 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9375 ns 9209 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 867834 ns 867994 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 39735735.5 ns 37396018.5 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5582541.5 ns 5747500 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 384434 ns 386354 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56709 ns 56917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57708 ns 56875 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57667 ns 57833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58209 ns 58125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37190 ns 37109 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1219055 ns 1131214.5 ns 1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 385729.5 ns 421167 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 215301 ns 203222.5 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 447916 ns 451020.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 463563 ns 475979 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 464458.5 ns 465354 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 433979 ns 487041.5 ns 0.89
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 263643 ns 264507 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26923755 ns 28501147 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8401250 ns 7943604 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 823927 ns 830424 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3319417 ns 3311000 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2332542 ns 1770250 ns 1.32
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2335291 ns 2337729.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6329458 ns 6302417 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 205595 ns 204131.5 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 208502 ns 211992 ns 0.98
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11418312.5 ns 11485250 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8313229.5 ns 6571812.5 ns 1.26
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8312625.5 ns 8309250 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21175145.5 ns 21151875.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 735748 ns 735481 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1053319 ns 1057071 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5333 ns 5125 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5145.5 ns 5375 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6770.5 ns 7125 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5250 ns 6208.5 ns 0.85
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 135858 ns 137212.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5802378 ns 5624260 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 848125 ns 793500 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 56221 ns 56010 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7375 ns 7000 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7250 ns 7500 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7292 ns 7458 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7416 ns 9083 ns 0.82
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 750240.5 ns 754137 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 36356150 ns 34576213 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5157167 ns 5244167 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 368223 ns 366813 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 102291 ns 103250 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 94459 ns 103875 ns 0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 98875 ns 125291 ns 0.79
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 95417 ns 101042 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 150934 ns 151348 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6128379 ns 6050689.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2030146 ns 2052375 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 203502 ns 203192 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2023916 ns 2018375 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2024250 ns 2029000 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2027646 ns 2023521 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2034083 ns 1991417 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 701876 ns 703391 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32467383 ns 31442085 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10818208 ns 11046312.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1249360 ns 1250762 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 35292 ns 34667 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 35875 ns 34750 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 35083 ns 35041.5 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 791 ns 646 ns 1.22
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15147 ns 15242 ns 0.99
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 79081 ns 79571 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2645.5 ns 2729.5 ns 0.97
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2709 ns 2917 ns 0.93
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2875 ns 3000 ns 0.96
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2083 ns 2208 ns 0.94
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 138748 ns 139866 ns 0.99
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 340393 ns 342158.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7167 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5958 ns 5417 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 6084 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10041 ns 10042 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36394 ns 36552 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1208263.5 ns 1221281.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 351562.5 ns 674708 ns 0.52
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47980 ns 48261 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212958.5 ns 213624.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220083.5 ns 221166.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220313 ns 220812.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205916 ns 205833 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 241977 ns 243393.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27639627 ns 25870086.5 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8048833 ns 7741583 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 571310 ns 575566 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3959 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21257 ns 21563 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2138204 ns 2027782.5 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 243542 ns 250542 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 43251 ns 43640 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14875 ns 14917 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14834 ns 14791 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14917 ns 14958 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14667 ns 14917 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 307482 ns 306375 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11303877 ns 11210297 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1023584 ns 1037625 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 191196 ns 194327 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 107000 ns 105583 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 100833 ns 106167 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 104000 ns 124875 ns 0.83
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 101417 ns 102583 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 146441 ns 139877 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6052419 ns 5810927 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2004875 ns 2048416 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204652 ns 208802 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1922750 ns 1878500 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1900416.5 ns 1927583.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1923479 ns 1867521 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1924771 ns 1917937.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 688454 ns 684487.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31437331 ns 30087516 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10713709 ns 10640458 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1150920 ns 1063341 ns 1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20792 ns 17583 ns 1.18
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18688 ns 19500 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20833 ns 20708 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17542 ns 18791 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 108224.5 ns 109550 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3604915 ns 3331480 ns 1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1331375 ns 1318708 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79140 ns 80701 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216291.5 ns 216271 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221625 ns 222292 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217000.5 ns 217916 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215666.5 ns 216167 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 517432 ns 516519 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 18685063.5 ns 19724665.5 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6219542 ns 6017791.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 476044 ns 477585 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 26292 ns 26583 ns 0.99
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 30000 ns 28770.5 ns 1.04
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 28875 ns 29104 ns 0.99
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1292 ns 1334 ns 0.97
batchedmm(16, Bsize=4)/forward/GPU/CUDA 15962 ns 15984 ns 1.00
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 80911 ns 81921 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4833.5 ns 4833.5 ns 1
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5000.5 ns 4833 ns 1.03
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5354.5 ns 5208.5 ns 1.03
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4209 ns 4333 ns 0.97
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 205690.5 ns 206128 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 379593 ns 379654 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 305875 ns 305792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 305541 ns 306042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 307729 ns 306833 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 306916 ns 307083 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 227217 ns 227988.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7859423.5 ns 7778230 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 633291.5 ns 1241125 ns 0.51
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 272362 ns 272793 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 534458 ns 535708 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 597417 ns 533084 ns 1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 590709 ns 538208 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 530417 ns 530917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1072354 ns 1080430 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42952925 ns 42644591.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6417187.5 ns 6182083 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 865447 ns 851073.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21875 ns 19125 ns 1.14
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20042 ns 20624.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22000 ns 21458 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19750 ns 20000 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112832 ns 112864 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3590647 ns 3473281 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1450250 ns 1444854 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79501 ns 80611 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214292 ns 220167 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 223625 ns 222791.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215459 ns 214771 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212459 ns 212625 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 736403 ns 737028 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25495544 ns 25214419 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7182729 ns 7109375 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 535594 ns 531685 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6667 ns 5916 ns 1.13
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7375 ns 7083 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8125 ns 8604.5 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6333 ns 6500 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 138851.5 ns 140088 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5554452.5 ns 5562789 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 874521 ns 803937.5 ns 1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 65381 ns 64661 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9917 ns 10000 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10562.5 ns 10937.5 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10938 ns 10750 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10334 ns 10041 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 820981 ns 822803 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 36828865 ns 36817844 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5371520.5 ns 5484583 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 382893 ns 382033 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5917 ns 4334 ns 1.37
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5042 ns 5291 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6041 ns 7333 ns 0.82
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4792 ns 5584 ns 0.86
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 142980.5 ns 142901.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5709761 ns 5758977.5 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 854000 ns 800458 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 68440 ns 66271 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7375 ns 7208 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7604.5 ns 7646 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7750 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7542 ns 7583 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 780235.5 ns 782456.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 37248785 ns 39501262 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5909667 ns 6034250 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 394663.5 ns 392794 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14461375 ns 14539375 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10111500 ns 7723291.5 ns 1.31
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10115583 ns 10145625 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27830500 ns 27763416 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 529523 ns 554910 ns 0.95
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 393383 ns 393434 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46327458.5 ns 46429208.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33413250 ns 26609416 ns 1.26
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33418167 ns 33517458 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85714292 ns 85405667 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2641351 ns 2664805 ns 0.99
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3305837.5 ns 3291838.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 67750 ns 66292 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 67188 ns 67875 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 68625 ns 68250 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 66125 ns 65917 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 120297 ns 119249 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3401525.5 ns 3647654 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1454375 ns 1440312.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 233432 ns 232702 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 442208 ns 441250 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 440500 ns 441625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 450791 ns 447167 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 440250 ns 441478.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 724096 ns 727144.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27244425 ns 26208342 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7830291 ns 7477375 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 787496 ns 793922.5 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 584 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31907 ns 31836 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1176105 ns 1180672 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 456084 ns 286667 ns 1.59
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 48991 ns 47841 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8833 ns 9458 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9166 ns 9271 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9792 ns 9750 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9417 ns 9416 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 283657 ns 283587 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20890296.5 ns 22547365 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5446750 ns 5502666.5 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 375983 ns 374188.5 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9792 ns 9792 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9833 ns 9833 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9833 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9834 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 22906 ns 22851 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2026316 ns 2120178 ns 0.96
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 223375 ns 221333 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 208602 ns 207772 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45917 ns 46167 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 46125 ns 46083 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46166 ns 46417 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45667 ns 46062.5 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 288163 ns 287950 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11404659.5 ns 12273456 ns 0.93
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 1352041 ns 1033833.5 ns 1.31
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 600205 ns 600566 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56250 ns 56167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57042 ns 56875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57167 ns 57166 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58000 ns 57875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28634 ns 28495 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1199631 ns 1157087.5 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 678271 ns 660125 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 202202 ns 202572 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 456521 ns 448229 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 464562.5 ns 464979 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 473020.5 ns 472292 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 434104 ns 474437.5 ns 0.91
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 243054 ns 244496.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 32218901.5 ns 33157318.5 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9969250 ns 9248750 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 881277 ns 888349 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 651521 ns 614125 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 642167 ns 648750 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 608000 ns 652521 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 619354.5 ns 642542 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 204222 ns 208606.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8182516.5 ns 7841403 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1372521 ns 1401250 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 304733 ns 305493 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2224041.5 ns 2245937.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2229479 ns 2247291 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2229542 ns 2238062.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2244687 ns 2241541 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 967724 ns 971988 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49487921 ns 48958299 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7073541 ns 7597458.5 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1316971.5 ns 1213901.5 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21333 ns 19333 ns 1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20895.5 ns 21646 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23000 ns 21833 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19042 ns 24291 ns 0.78
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112093 ns 111706.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3575165.5 ns 3500994.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1370042 ns 1437895.5 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 80870.5 ns 79141 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221250 ns 219459 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 253917 ns 219791.5 ns 1.16
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228958 ns 222104.5 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219542 ns 219875 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 727914 ns 728212.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27540388.5 ns 26675294 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7698208 ns 7278312 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 552954 ns 555140 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 584 ns 0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22957 ns 22972 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1275000 ns 1186538 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 478875.5 ns 461542 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 47341 ns 49541 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9792 ns 9750 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9333 ns 9333.5 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9291.5 ns 9896 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9104 ns 10000 ns 0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 266060 ns 265448 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24079846.5 ns 24827341.5 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6178000 ns 6076333 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 397054 ns 415154 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9000 ns 7917 ns 1.14
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8313 ns 10208 ns 0.81
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11229.5 ns 10542 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7791 ns 9292 ns 0.84
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 119220 ns 118520 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3267882.5 ns 3378687 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 897312 ns 891583 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 68721 ns 75371 ns 0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7312.5 ns 7291.5 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns 7875 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7916 ns 7833.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7437.5 ns 7708 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 506373 ns 503824 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 18976858 ns 17507211 ns 1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4417625 ns 4534375 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 319133 ns 318933 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1583 ns 1437.5 ns 1.10
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1479.5 ns 1667 ns 0.89
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1958 ns 1917 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1459 ns 1417 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 21067 ns 21272 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1168198 ns 1191094 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 303833.5 ns 307229 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 188261 ns 189132 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3208 ns 3292 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3459 ns 3333 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3375 ns 3500 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3250 ns 3500 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 219575 ns 216668.5 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10655031 ns 10523301.5 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1792187.5 ns 1655750 ns 1.08
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 578625 ns 579466 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 148104 ns 148229.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 127729 ns 106166.5 ns 1.20
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 129479 ns 129250 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225625 ns 225167 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 23937 ns 23640 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1210144 ns 1169047 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 272875 ns 281229 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 39341 ns 40580 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 143270.5 ns 143125 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 123208 ns 87375 ns 1.41
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 110459 ns 112875.5 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 251895.5 ns 250792 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 215563.5 ns 214898 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10368835 ns 10468792 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2024375 ns 2056708 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 267812 ns 266232 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7209 ns 7208 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6041 ns 5375 ns 1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 6083 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10208 ns 10000 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32658 ns 33010 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1187497 ns 1218913 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 564146 ns 357271 ns 1.58
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48321 ns 50911 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 226938 ns 227938 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227916 ns 228354.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228125 ns 235708 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212958.5 ns 249729 ns 0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 259894 ns 263220 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27209302 ns 28851277 ns 0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8471583 ns 8089625 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 594335 ns 591956 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14917 ns 15375 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15187 ns 14917 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 16937.5 ns 16834 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14708 ns 15583 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 138769.5 ns 138290 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5609166 ns 5390404 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 857375 ns 805167 ns 1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 231592 ns 231372.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23625 ns 23333 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24229 ns 23438 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23625 ns 24459 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23417 ns 23666 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 859944 ns 863635.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 40379406 ns 39146915 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 6046000 ns 5702250 ns 1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 690055 ns 683727 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9834 ns 8875 ns 1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9478.5 ns 10041.5 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12145.5 ns 11750 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8459 ns 9917 ns 0.85
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 123004 ns 122685 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3326674 ns 3570923 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 859125 ns 917271 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 76391 ns 75270 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13459 ns 14166 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14166.5 ns 14458.5 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14042 ns 14979.5 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13791 ns 13542 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 666299 ns 660959 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 21583968 ns 21424061 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5388292 ns 5279979 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 375715 ns 365744 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10125 ns 8417 ns 1.20
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9083 ns 10146 ns 0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11041.5 ns 12125 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9459 ns 9792 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 121924 ns 121433.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3403856.5 ns 3352559.5 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 933083 ns 952146 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 72331 ns 72460 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12167 ns 13166 ns 0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12083 ns 12938 ns 0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13125 ns 13125 ns 1
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12104.5 ns 12916 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 550706 ns 548948 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19323121 ns 18645332 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4728041 ns 4735063 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 341615 ns 340583 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 29770.5 ns 31125.5 ns 0.96
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 34792 ns 31520.5 ns 1.10
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 31646 ns 32333.5 ns 0.98
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1834 ns 1834 ns 1
batchedmm(2, Bsize=128)/forward/GPU/CUDA 15895 ns 16210 ns 0.98
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 80311 ns 80860 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5354.5 ns 5229.5 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5312.5 ns 4959 ns 1.07
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5333.5 ns 5250 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6166 ns 6334 ns 0.97
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 138354.5 ns 138594 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 385195 ns 388224 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 250 ns 375 ns 0.67
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 334 ns 0.87
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 26010 ns 25350 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1215712 ns 1199368 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 456396 ns 478250.5 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48931 ns 49490 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6291 ns 6292 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6416.5 ns 6750 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6792 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6292 ns 6584 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 186334.5 ns 186417 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 24541666 ns 23013025 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5713208.5 ns 5920458 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 388596 ns 393209 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 1958 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 1959 ns 2042 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2042 ns 2083 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2000 ns 2000 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 26508.5 ns 25999.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1230942 ns 1183440.5 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 461166.5 ns 314229 ns 1.47
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 205793 ns 206522 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16125 ns 16583.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16625 ns 15958 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16875 ns 16854 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16000 ns 16791.5 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 274700 ns 272947 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 25617930.5 ns 25132475.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6184792 ns 6200500 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 700900 ns 699897 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 195959 ns 158000 ns 1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 148437.5 ns 152895.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 156042 ns 179875 ns 0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 146563 ns 175625 ns 0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 201860 ns 205507.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7750006.5 ns 8109426 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1388208 ns 1459854.5 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 223213 ns 213437 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1322291.5 ns 1279667 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1326083 ns 1336958 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1330937.5 ns 1276333 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1337083 ns 1332729.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 903712 ns 907688 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 45331996 ns 46524861.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6766854.5 ns 6921834 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1095360 ns 1109576 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25750 ns 25937.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25875 ns 25750 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27709 ns 27437.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24125 ns 24042 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 234038 ns 236630 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7956003 ns 7924614 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 976958 ns 1195645.5 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 113361.5 ns 112891.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 117812.5 ns 117812.5 ns 1
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 176771 ns 125958 ns 1.40
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 132292 ns 130667 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 124833 ns 132625 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1069728 ns 1078111.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 46049440 ns 48454865.5 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6183167 ns 6291354 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 611759 ns 604836 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 250 ns 250 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 250 ns 375 ns 0.67
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 334 ns 0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22728 ns 22703 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1231861 ns 1228350.5 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 452750 ns 303875 ns 1.49
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 46601 ns 47155.5 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6333 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6937.5 ns 0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6792 ns 6750 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6395.5 ns 6687.5 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 202592 ns 201918.5 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25214681 ns 24022047 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6171875 ns 6154291 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391445 ns 390799 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6167 ns 5584 ns 1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5791.5 ns 6729 ns 0.86
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7583 ns 7834 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5916 ns 6333 ns 0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 143756.5 ns 144556.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5728741 ns 5802837 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 472791 ns 465083.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 232863 ns 231623 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10166.5 ns 9875 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10083 ns 10500 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10083 ns 10250 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9958 ns 10084 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 893713 ns 898422 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 42287345 ns 41540865 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6328334 ns 5925625 ns 1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 669549 ns 667721.5 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 666 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22019 ns 22281 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 1940958 ns 2048848.5 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 222667 ns 228500 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 206602.5 ns 205022 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4583 ns 4625 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4625 ns 4625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4916 ns 4791 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4625 ns 4584 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 224705 ns 224113.5 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10067778 ns 11648202 ns 0.86
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1710291 ns 1667208 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 577498 ns 578966 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8167 ns 8604.5 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8041.5 ns 9500 ns 0.85
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10500 ns 10125 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7833 ns 8125 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 121515 ns 121216 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3793308.5 ns 3493631.5 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 796583 ns 797562.5 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 73461 ns 73391 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8292 ns 8166.5 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8291.5 ns 9020.5 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8541 ns 9292 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8834 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 586857 ns 585686 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 20229154 ns 21659888 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5084312.5 ns 5138604.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 342285 ns 345673 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126646 ns 128166 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129146 ns 95895.5 ns 1.35
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 129834 ns 130416 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 180625 ns 193500 ns 0.93
batchedmm(128, Bsize=4)/forward/GPU/CUDA 45632 ns 45829 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 100411 ns 100941 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 316563 ns 335583 ns 0.94
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 326709 ns 167167 ns 1.95
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 329187.5 ns 354375 ns 0.93
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 584417 ns 609249.5 ns 0.96
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 190484 ns 190876 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 499046 ns 517555 ns 0.96
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398250 ns 397541 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288042 ns 215333 ns 1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288458 ns 288458 ns 1
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756875 ns 756458 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43262 ns 43687 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1353668 ns 1356444.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 416333.5 ns 420167 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 80431 ns 80321 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1450583 ns 1457000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1135583 ns 862125 ns 1.32
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1135479.5 ns 1134520.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2361583 ns 2444500 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 246229 ns 251807.5 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 11102894 ns 10565821 ns 1.05
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1848812.5 ns 1852750 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 352430 ns 350374 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 627958 ns 683334 ns 0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 648250 ns 650583 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 652979 ns 641791.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 649354 ns 653250 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 201989 ns 202465 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8297683 ns 8364163.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1350375 ns 1384458 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 302004 ns 302773 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2448208 ns 2447209 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2442541 ns 2468625 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2446583 ns 2446166.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2461375 ns 2452188 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 986950 ns 992979 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51627338 ns 51629265.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 12287584 ns 9882875 ns 1.24
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1385198 ns 1311863 ns 1.06
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 34583 ns 34667 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 34604.5 ns 34291.5 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34187 ns 35521 ns 0.96
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 979 ns 875 ns 1.12
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15115 ns 15660 ns 0.97
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 78861 ns 78941 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3250 ns 3125 ns 1.04
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3083.5 ns 3458.5 ns 0.89
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3292 ns 3312.5 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3000 ns 3084 ns 0.97
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 136884.5 ns 137070.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 333775 ns 338254 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 406333.5 ns 406166 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 407959 ns 404458 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 408250 ns 408458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 421625 ns 420458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 42816 ns 42995 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1337237.5 ns 1466063 ns 0.91
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1160458.5 ns 1144125 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 237128 ns 238192 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3867166 ns 3877875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3999583 ns 3990896 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3994541 ns 3992562.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3753104 ns 3778146 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 240654 ns 240990 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36363725.5 ns 36589646 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11696917 ns 11933709 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1245696 ns 1433854 ns 0.87
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3916 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3958 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3916 ns 3917 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33162 ns 33931 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1219454 ns 1232713.5 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 179458 ns 183709 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 37931 ns 38031 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15625 ns 15708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15708 ns 15750 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 16000 ns 15958 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15500 ns 15750 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 251974 ns 252887 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 10610220 ns 9179273 ns 1.16
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 863958.5 ns 893625 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 162802 ns 172862 ns 0.94
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404583 ns 404417 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295833 ns 221125 ns 1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295625 ns 296500 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760708 ns 761125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113013 ns 112867 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1012749 ns 1050270.5 ns 0.96
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 433875 ns 406792 ns 1.07
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 87531 ns 87471 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1468208 ns 1471292 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1161479 ns 884000 ns 1.31
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1161334 ns 1160146 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2383584 ns 2466083.5 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 236752 ns 238614 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 10443178 ns 9255273 ns 1.13
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1920167 ns 1932833 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 354105 ns 350549 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 459 ns 583 ns 0.79
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 25752 ns 25487 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1271548 ns 1217335.5 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 477750.5 ns 387333 ns 1.23
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 205523 ns 206202 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7292 ns 7375 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7542 ns 8020.5 ns 0.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7687.5 ns 7916 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7583 ns 7542 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 216586.5 ns 209854.5 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26033942 ns 25469136 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6012333 ns 6294375 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 686818.5 ns 684857 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 840000 ns 833124.5 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 615666 ns 467292 ns 1.32
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 621333 ns 621750 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1558000 ns 1543666 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA 133759 ns 130036 ns 1.03
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 234603 ns 230222 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2678083 ns 2684437.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 2002000 ns 1538583 ns 1.30
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2002187.5 ns 2002583 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4935667 ns 4933354 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 239631.5 ns 243369 ns 0.98
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 807670 ns 836303.5 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 250 ns 375 ns 0.67
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 334 ns 0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31737 ns 31581 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1190684.5 ns 1181114.5 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 318125 ns 425666.5 ns 0.75
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 45701 ns 49050 ns 0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6125 ns 6291 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6270.5 ns 6708.5 ns 0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6584 ns 6667 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6083 ns 6375 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 223270 ns 222549 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20675696 ns 20723673 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5388000 ns 5408500 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 363375 ns 364253.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2402167 ns 2412916 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2417416 ns 2399708 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2383750 ns 2391250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2375125 ns 2406375 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 199863.5 ns 201130.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8222543 ns 8039466.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1463104 ns 1500813 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 372574 ns 371169 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4640208 ns 4645417 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4646500 ns 4666145.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4660209 ns 4648375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4644979.5 ns 4646334 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 893947 ns 899895.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51330419 ns 47712828 ns 1.08
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6764792 ns 6893375 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1345616.5 ns 1384804 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7542 ns 7083 ns 1.06
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7083 ns 7000 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7375 ns 7750 ns 0.95
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6833 ns 6792 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23243 ns 23107 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1128838 ns 1160499 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 261584 ns 282458 ns 0.93
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 36650 ns 40431 ns 0.91
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 47854 ns 48667 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 51771 ns 57125 ns 0.91
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 48833 ns 51042 ns 0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 32833 ns 33354.5 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 215568 ns 215404 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10412929 ns 10709204 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 1911875 ns 2066833 ns 0.93
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 230092 ns 264313 ns 0.87
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 22729.5 ns 22854 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 24646 ns 24375.5 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 23458 ns 24917 ns 0.94
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5250 ns 5209 ns 1.01
batchedmm(2, Bsize=512)/forward/GPU/CUDA 16364.5 ns 16790 ns 0.97
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 83781 ns 89191 ns 0.94
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12166 ns 12250 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10375 ns 9375 ns 1.11
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10917 ns 10604.5 ns 1.03
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 17750 ns 18083 ns 0.98
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 226319.5 ns 225960 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 369134 ns 387419 ns 0.95
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406042 ns 406584 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297417 ns 223292 ns 1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296958 ns 297000 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762792 ns 762667 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46179 ns 45879 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1332014 ns 1417981 ns 0.94
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 486500 ns 424354.5 ns 1.15
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 89011 ns 89741 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1485666.5 ns 1486000.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1167500 ns 892208.5 ns 1.31
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1166000 ns 1169500 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2386416 ns 2471625 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 280427 ns 279157 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 13856466.5 ns 13109750 ns 1.06
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2074958 ns 2047333 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 379774 ns 376633 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 433667 ns 433500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 436959 ns 430292 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 436959 ns 436292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 448375 ns 446958 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 53930.5 ns 54004 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1009165 ns 1003277 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1060375 ns 1090562.5 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 233243 ns 236733 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3904625 ns 3866292 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4021125 ns 4019812.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4027208 ns 4022583.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3804375 ns 3812208.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 260750 ns 261348.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31826505 ns 32496173.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10958041.5 ns 10504750 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1342696 ns 1365148 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8750 ns 8708 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7625 ns 6958 ns 1.10
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7666 ns 7667 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12417 ns 12417 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23537 ns 23411 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2186463 ns 2120051 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 224396 ns 229334 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 211122 ns 208012 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 44916 ns 45583 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 44916 ns 45291 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45375 ns 45416 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 44709 ns 45042 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 343620.5 ns 345424.5 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13641380 ns 13588599 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1814145.5 ns 1751750 ns 1.04
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 656018 ns 653876 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 86792 ns 113812.5 ns 0.76
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 125000 ns 90020.5 ns 1.39
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 88208 ns 88625 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 86000 ns 81000 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190101 ns 190227.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6192741 ns 6167893 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1953792 ns 2705500 ns 0.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 217797.5 ns 221462 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2013000 ns 1871229 ns 1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2017875 ns 2028479 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2014000 ns 2015645.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026416 ns 2020395.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 532007 ns 534895 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27385258 ns 28188330 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9714708 ns 9724208 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1093783 ns 1078565.5 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal merged commit 7ba127a into main Sep 15, 2024
65 of 72 checks passed
@avik-pal avik-pal deleted the ap/enzyme_cuda branch September 15, 2024 21:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enzyme rules for cuBLASLt
1 participant