Training Fast Weight Programmers by backpropagating through the Delta Rule #9

proger · 2024-08-20T05:17:41Z

Introducing a kernel for training a fast weight programmer by backpropagating through the delta rule (online linear regression) with @ischlag.

Improving on top of first order recurrence with scalar hidden state, this kernel uses vector-valued updates like the transformer, allowing use of matrix multiplication hardware, and avoiding saturation of capacity of the fast weights network, thanks to the delta rule.

The implementation uses chunking provided by @sustcsonglin's equation 6 and currently excels at the head dimension of 32, perfectly fitting into the registers of a 3090 warp. The code for tensor cores uses ThunderKittens which enable effortless WMMA.

ptxas info : 19 bytes gmem ptxas info : Compiling entry function '_Z20causal_attend_kernelIN3c108BFloat16ELi4ELi2ELi4EEviPKT_S4_S4_PKfPS2_' for 'sm_86' ptxas info : Function properties for _Z20causal_attend_kernelIN3c108BFloat16ELi4ELi2ELi4EEviPKT_S4_S4_PKfPS2_ 72 bytes stack frame, 64 bytes spill stores, 116 bytes spill loads ptxas info : Used 255 registers, 400 bytes cmem[0]

…isters

…ctivations

…_state

…ort sequences for now

proger added 30 commits June 16, 2024 17:12

trying flash attention

adedad8

kitten: remove frontend scaffolding

5a4941c

load_inline: do not set -arch

e0d967c

pyproject: depend on ninja and matplotlib out of the box

06f1715

reference implementation of deltanet

f0e5418

decay_values_backward_kernel with only forward

805a8e9

minor cleanup

7abcb48

tile_layout is a notebook that shows how tiles are mapped to warp reg…

3693fa0

…isters

decay_values_backward: compute d_{w_t} / d_{b_s}

4e38a84

deltacu: sketch du_t/db_s

b699acb

shrink WK and UK expressions a bit

688342c

i hear backpropagation is a nice algorithm

498706b

backpropagation through time is great

26b7320

deltanet: massage the code

7179aea

pull d_v and d_beta out of the loop

33115c4

deltacu cleanup

ae10e4a

decay_values_backward 16x16 kernel works

d68916e

delta.cu: bump dimension to 64

f3ca475

delta.cu: type and dimension dispatch

be78214

decay_values_forward

80d2626

fuse linear attention into decay_values

1df2cb6

almost fused

8781794

inline attention into decay_values

22894f4

when stitching forward don't need to recompute y

e950632

stitch backward: when uncomputing state you don't need to store all a…

d7ad825

…ctivations

stitch_backward: shave off a bit of computation on the boundary

1a48f7f

whitespace

bd9f2c4

massage to TK assembly

410b3f1

chunk size vis

25a36f7

proger added 27 commits July 20, 2024 23:39

add fla to benchmark

e35e284

loop_impl can avoid computing output

8d7af9e

make kitten bigger

160e701

update bench

770e8d7

focus on backward

9a74775

ref: implement decay_values without tensor cores

1f4fce0

add a test for gated_rnn

6f78a85

prepare to speed up backward

af39ca0

Refactor chunk_backward function to use shared memory for state and d…

676aa85

…_state

backward: fuse forward and backward into one loop, works with only sh…

d9f5fa7

…ort sequences for now

move decay_values_backward out

951dad8

d_v is ok but the rest is not

66862e5

backward go brrr

abf0be6

api

ca9307b

no prints

4b4d2b0

start benchmarking backward

e1226b7

bench backward

c688170

backward uses less global memory

1c94d10

properly name benchmarks

9809086

backward uses less global memory and initializes well

85b8f2a

prepare values for dimension groups

a8827ea

store d_q through registers

5b9be61

try cudaLaunchCooperativeKernel for backward

602bcca

bench tweaks

d37d37c

more forgiving atol for backward

f5aca36

implement with locking

3152924

no need for reloading when no value groups

249b600

proger changed the title ~~Delta Rule Fast Weight Programmers~~ Training Fast Weight Programmers through Delta RUle Aug 20, 2024

proger changed the title ~~Training Fast Weight Programmers through Delta RUle~~ Training Fast Weight Programmers through Delta Rule Aug 20, 2024

proger changed the title ~~Training Fast Weight Programmers through Delta Rule~~ Training Fast Weight Programmers by backpropagating through the Delta Rule Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Fast Weight Programmers by backpropagating through the Delta Rule #9

Training Fast Weight Programmers by backpropagating through the Delta Rule #9

proger commented Aug 20, 2024 •

edited

Loading

Training Fast Weight Programmers by backpropagating through the Delta Rule #9

Are you sure you want to change the base?

Training Fast Weight Programmers by backpropagating through the Delta Rule #9

Conversation

proger commented Aug 20, 2024 • edited Loading

proger commented Aug 20, 2024 •

edited

Loading