Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
RahulSChand authored Oct 29, 2023
1 parent 8625954 commit 85d1d18
Showing 1 changed file with 33 additions and 17 deletions.
50 changes: 33 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Can my GPU run this LLM? & what token/s can I get?
# Can my GPU run this LLM? & at what token/s?

![Made with](https://img.shields.io/badge/logo-javascript-blue?logo=javascript)

Expand All @@ -11,33 +11,23 @@ Link: **https://rahulschand.github.io/gpu_poor/**



## Features
## Use case/Features

#### 1. Calculate vRAM memory requirement
#### 1. Calculate vRAM memory requirement 💾

<img width="643" alt="image" src="https://github.com/RahulSChand/gpu_poor/assets/16897807/29577394-0efd-42fb-aaf4-282e9a45d5db">

#### 2. Calculate ~token/s you can get
#### 2. Calculate ~token/s you can get ⏱️

<img width="647" alt="image" src="https://github.com/RahulSChand/gpu_poor/assets/16897807/77627c9b-5fdd-44cf-8b7d-452ff0563a8a">

#### 3. Approximate time for finetuning (ms per iteration)
#### 3. Approximate time for finetuning (ms per iteration) ⌛️

<img width="764" alt="image" src="https://github.com/RahulSChand/gpu_poor/assets/16897807/e5fd08a1-abb9-4e00-ad45-ba9bb15ec546">


### Purpose

I made this to check if you can run a particular LLM on your GPU. Useful to figure out the following

1. What quantization will fit on my GPU?
2. Max context length & batch-size my GPU can handle?
3. Which finetuning? Full? LoRA? QLoRA?
5. What is consuming my GPU memory? What to change to fit the LLM on GPU?



The output is the total vRAM & the breakdown of where the vRAM goes (in MB). It looks like below
For memory, output is total vRAM & its breakdown. It looks like below

```
{
Expand All @@ -50,8 +40,32 @@ The output is the total vRAM & the breakdown of where the vRAM goes (in MB). It
}
```


For token/s, output is token/s & additional info, looks like below

```
{
"Token per second": 50,
"ms per token": 20,
"Prompt process time (s)": 5 s,
"memory or compute bound?": Memory,
}
```

---


### Purpose

I made this to check if you can run a particular LLM on your GPU. Useful to figure out the following

1. What quantization will fit on my GPU?
2. Max context length & batch-size my GPU can handle?
3. Which finetuning? Full? LoRA? QLoRA?
5. What is consuming my GPU memory? What to change to fit the LLM on GPU?



### Can't we just look at the model size & figure this out?

Finding which LLMs your GPU can handle isn't as easy as looking at the model size because during inference (KV cache) takes susbtantial amount of memory. For example, with sequence length 1000 on llama-2-7b it takes 1GB of extra memory (using hugginface LlamaForCausalLM, with exLlama & vLLM this is 500MB). And during training both KV cache & activations & quantization overhead take a lot of memory. For example, llama-7b with bnb int8 quant is of size ~7.5GB but it isn't possible to finetune it using LoRA on data with 1000 context length even with RTX 4090 24 GB. Which means an additional 16GB memory goes into quant overheads, activations & grad memory.
Expand Down Expand Up @@ -110,11 +124,13 @@ Sometimes the answers might be very wrong in which case please open an issue her
2. Updated config list with new Huggingface trending models (Llava/Mistral/Trismegistus etc.)

3. Fixed bitsandbytes quantization overhead calculation (before it was linear in terms of context length, fixed it to be more accurate)

4. **Added token/s**
---

### TODO
1. Add support for exLlama
2. ~Add QLora~
3. Add way to measure approximste tokens/s you can get for a particular GPU
3. ~Add way to measure approximste tokens/s you can get for a particular GPU~
4. ~Improve logic to get hyper-params from size~ (since hidden layer/intermediate size/number of layers can vary for a particular size) ✅
5. Add AWQ

0 comments on commit 85d1d18

Please sign in to comment.