Understanding 1MB per token calculation #25

hlamba-dm · 2023-09-21T18:57:14Z

I am finding the 1MB GPU ram usage per token while inferencing calculation a bit hard to understand --- also not what I am seeing in practice.

Any insights on how this number was computed ?

zhuangxy · 2023-10-25T01:38:41Z

I think this is a very rough estimation , the actual value should depend on batch size, token length and the embedding size(or hidden layer dimension).
For example a 13B model, has 40 layers and the token length is 4096 , embedding size is 8192, if using batchsize 1, it needs 1 (batchsize) * 8192 (embedding size) * 2 (byets, FP16) * 4096 (token length) * 40 (layer) ~= 2560M, about 0.625 M per token

ftdzh1994 · 2024-12-17T04:44:41Z

I think the number here is about memory for KV Cache instead of the memory for activation.
the total number of parameters needed to stored in KV Cache is:
$N = 2 * B * L * l * n_{head} * dim_{head} $

e.g. with llama-13B, which has 40 layers, 5120 hidden size, 40 attention heads(and hence 128 head dim) and loaded with FP16, inferencing one batch size and one token(L = 1, B = 1), the consumed memory is:
$M = (2 * 1 * 1 * 40 * 40 * 128) * 2 bytes= 0.8M \approx 1M$

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding 1MB per token calculation #25

Understanding 1MB per token calculation #25

hlamba-dm commented Sep 21, 2023

zhuangxy commented Oct 25, 2023 •

edited

Loading

ftdzh1994 commented Dec 17, 2024 •

edited

Loading

Understanding 1MB per token calculation #25

Understanding 1MB per token calculation #25

Comments

hlamba-dm commented Sep 21, 2023

zhuangxy commented Oct 25, 2023 • edited Loading

ftdzh1994 commented Dec 17, 2024 • edited Loading

zhuangxy commented Oct 25, 2023 •

edited

Loading

ftdzh1994 commented Dec 17, 2024 •

edited

Loading