Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Max Token Limit for Generation #1078

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

N8python
Copy link
Contributor

Allows the throttling of the generation process to a maximum number of tok/sec - meaning that the user can control what percent of their GPU power goes into LLM generation. Avoids thermal throttling.

@awni
Copy link
Member

awni commented Oct 31, 2024

Avoids thermal throttling.

Can you elaborate on that? What behavior is different if you set the max toks per sec?

@N8python
Copy link
Contributor Author

The model cannot decode faster than the maximum tokens per second.

@awni
Copy link
Member

awni commented Oct 31, 2024

I meant when you say "avoids thermal throttling" what are you referring to and how do you detect that it is being "avoided"?

@N8python
Copy link
Contributor Author

When too much power is exerted, laptops with M-series chip drop to very low performance. Users can manually set the throughput of the model lower to prevent this.

@awni
Copy link
Member

awni commented Oct 31, 2024

When you say drop to very low performance what does that look like? I’m just trying to understand what’s happening here because maybe there is a deeper issue and manually sleeping in the generation loop could be suboptimal.

@N8python
Copy link
Contributor Author

Has this never happened to you? Set up a long generation, it draws ~30W, and then the computer overheats and drops to like ~1W of power draw for 2 minutes to cool down. Throttling helps.

@awni
Copy link
Member

awni commented Oct 31, 2024

Set up a long generation, it draws ~30W, and then the computer overheats and drops to like ~1W of power draw for 2 minutes to cool down.

🤔 no it hasn't. I'd like to reproduce it, roughly how long of a generation with what size model do you experience that?

@awni
Copy link
Member

awni commented Oct 31, 2024

the computer overheats and drops to like ~1W of power draw

Does that happen during the generation? Then it slows down?

@N8python
Copy link
Contributor Author

Yes! It does - have you not experienced it?? I can provide a video!

(MLX generation for more than ~30 seconds at full throttle results in my 14-inch M3 Max throttling itself so aggressively the screen stutters)

@N8python
Copy link
Contributor Author

This works for ANY model btw - as long as the computer is running full throttle!

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Thoughts?

@awni
Copy link
Member

awni commented Nov 1, 2024

I just ran this (with no stop condition):

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit -m 30000 --prompt "What is generative AI"

It generated 30k tokens in about 11 mins. The fan was going full speed and the power draw was consistently 30-35 watts on an M3 max.

Here's the stats:

Prompt: 15 tokens, 50.681 tokens-per-sec
Generation: 30000 tokens, 46.960 tokens-per-sec
Peak memory: 8.126 GB

@awni
Copy link
Member

awni commented Nov 1, 2024

Now I'm wondering what we are doing differently?

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Is it a 14 inch or 16inch m3 max...?

@awni
Copy link
Member

awni commented Nov 1, 2024

16 inch
64 GB
OS 15.0.1

MLX on main
MLX LM on main

@awni
Copy link
Member

awni commented Nov 1, 2024

I'm wondering how much RAM you have? Maybe it's swapping and that's what accounts for the cliff?

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

64GB 14inch M3 Max MLX LM (pretty much latest version) It's thermal throttling that occurs in smaller macs!

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

I ran your exact thing on my 14-inch. It has now dropped to ~1.8W and is stuttering horrbbly as it desperately tries to cool down.

@awni
Copy link
Member

awni commented Nov 1, 2024

Huh, so what happens if you try to train on it? Does it hit the same perf cliff?

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Oh training does the exact same thing - LORAing always makes the computer throttle brutally.

@awni
Copy link
Member

awni commented Nov 1, 2024

Oh training does the exact same thing - LORAing always makes the computer throttle brutally.

Could you share some rough numbers on toks/sec pre and post throttling?

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Im still waiting for the benchmark to complete. It runs at the same tok/sec you report when non-throttled. But I'll report the avg tok/sec on the 30000 tok generation task.

@awni
Copy link
Member

awni commented Nov 1, 2024

Thanks! Also curious for LoRA fine-tuning if you have anything readily available. No worries if not.

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Here's the benchmark:

Prompt: 15 tokens, 138.944 tokens-per-sec
Generation: 30000 tokens, 16.352 tokens-per-sec
Peak memory: 8.346 GB

(The slowdown on LORA is similar - roughly 1/3rd of what the max throughput is)

@N8python
Copy link
Contributor Author

N8python commented Nov 2, 2024

So yeah - do you think this would be a welcome change?

@awni
Copy link
Member

awni commented Nov 2, 2024

Let’s keep the PR open for now. I’m not done investigating this yet. We may or may not merge it depending.. but I appreciate you helping us figure out the underlying issue.

@N8python
Copy link
Contributor Author

N8python commented Nov 2, 2024

Makes sense! Thanks for your openness in this investigation :D

@N8python
Copy link
Contributor Author

N8python commented Nov 2, 2024

Example of what happens during LORA (or in this case full) finetuning of SmolLM2 135M:

Iter 560: Train loss 2.217, Learning Rate 3.000e-05, It/sec 2.906, Tokens/sec 1507.738, Trained Tokens 391391, Peak mem 9.893 GB
Iter 570: Train loss 2.091, Learning Rate 3.000e-05, It/sec 1.618, Tokens/sec 1547.780, Trained Tokens 400957, Peak mem 9.893 GB
Iter 580: Train loss 1.820, Learning Rate 3.000e-05, It/sec 1.877, Tokens/sec 1077.392, Trained Tokens 406696, Peak mem 9.893 GB
Iter 590: Train loss 1.919, Learning Rate 3.000e-05, It/sec 1.490, Tokens/sec 1374.364, Trained Tokens 415923, Peak mem 9.893 GB
Iter 600: Train loss 2.071, Learning Rate 3.000e-05, It/sec 2.043, Tokens/sec 1326.902, Trained Tokens 422418, Peak mem 9.893 GB
Iter 600: Saved adapter weights to adapters/adapters.safetensors and adapters/0000600_adapters.safetensors.
Iter 610: Train loss 2.017, Learning Rate 3.000e-05, It/sec 1.077, Tokens/sec 791.562, Trained Tokens 429771, Peak mem 9.893 GB
Iter 620: Train loss 2.397, Learning Rate 3.000e-05, It/sec 1.575, Tokens/sec 985.628, Trained Tokens 436027, Peak mem 9.893 GB
Iter 630: Train loss 2.019, Learning Rate 3.000e-05, It/sec 1.401, Tokens/sec 817.147, Trained Tokens 441861, Peak mem 9.893 GB
Iter 640: Train loss 1.931, Learning Rate 3.000e-05, It/sec 1.037, Tokens/sec 708.138, Trained Tokens 448692, Peak mem 9.893 GB
Iter 650: Train loss 2.295, Learning Rate 3.000e-05, It/sec 0.938, Tokens/sec 483.847, Trained Tokens 453853, Peak mem 9.893 GB
Iter 660: Train loss 1.740, Learning Rate 3.000e-05, It/sec 0.540, Tokens/sec 368.923, Trained Tokens 460687, Peak mem 9.893 GB
Iter 670: Train loss 1.884, Learning Rate 3.000e-05, It/sec 0.218, Tokens/sec 176.933, Trained Tokens 468785, Peak mem 9.893 GB
Iter 680: Train loss 2.026, Learning Rate 3.000e-05, It/sec 0.264, Tokens/sec 206.046, Trained Tokens 476577, Peak mem 9.893 GB
Iter 690: Train loss 2.112, Learning Rate 3.000e-05, It/sec 0.230, Tokens/sec 211.577, Trained Tokens 485780, Peak mem 9.893 GB

@ivanfioravanti
Copy link
Contributor

I confirm that 16" is not affected with fan at Max speed, while 14" is really impacted. Both M3 Max and M4 Max models. Slowing down generation can help to reduce temp.

@ivanfioravanti
Copy link
Contributor

Not true, MBP 16" is impacted too, being able to slowdown MLX would help to avoid throttling and keep Mac less noisy

@fredrik-smedberg
Copy link

fredrik-smedberg commented Dec 28, 2024

I can confirm that something odd/throttling is happening on my system as well.
Today I ran multiple fine-tuning tests of a 8 bit MLX quant I created of the Llama 3.1 8B model.

My system
14" M3 Max, 1 TB, 64 GB RAM
macOS 15.2 (24C101)
Python 3.12.1

Log
The laptop is either throttling or something else is going on. I think the reason for slow amount of tokens between iter 1 and iter 10 is because I started the test below just right after another finished, not letting the laptop cool down first. I've attached a second log below (Log 2) that shows the output of running config.yaml with iterations set to 6, starting from when the laptop was cold.

I've attached my config.yaml (as config.txt)
config.txt

❯ time mlx_lm.lora --config config.yaml
Loading configuration file config.yaml
Loading pretrained model
Loading datasets
Training
Trainable parameters: 0.042% (3.408M/8030.261M)
Starting training..., iters: 100
Iter 1: Val loss 2.151, Val took 30.546s
Iter 10: Train loss 1.958, Learning Rate 1.000e-05, It/sec 0.081, Tokens/sec 159.269, Trained Tokens 19678, Peak mem 20.819 GB
Iter 20: Val loss 1.521, Val took 32.879s
Iter 20: Train loss 1.480, Learning Rate 1.000e-05, It/sec 0.789, Tokens/sec 1995.916, Trained Tokens 44973, Peak mem 25.704 GB
Iter 20: Saved adapter weights to adapters/adapters.safetensors and adapters/0000020_adapters.safetensors.
Iter 30: Train loss 1.405, Learning Rate 1.000e-05, It/sec 0.072, Tokens/sec 135.074, Trained Tokens 63820, Peak mem 25.704 GB
Iter 40: Val loss 1.314, Val took 27.889s
Iter 40: Train loss 1.224, Learning Rate 1.000e-05, It/sec 1.040, Tokens/sec 2538.290, Trained Tokens 88223, Peak mem 28.847 GB
Iter 40: Saved adapter weights to adapters/adapters.safetensors and adapters/0000040_adapters.safetensors.
Iter 50: Train loss 1.189, Learning Rate 1.000e-05, It/sec 0.071, Tokens/sec 155.063, Trained Tokens 110106, Peak mem 28.847 GB
Iter 60: Val loss 1.289, Val took 29.403s
Iter 60: Train loss 1.210, Learning Rate 1.000e-05, It/sec 0.397, Tokens/sec 625.982, Trained Tokens 125870, Peak mem 28.847 GB
Iter 60: Saved adapter weights to adapters/adapters.safetensors and adapters/0000060_adapters.safetensors.
Iter 70: Train loss 1.161, Learning Rate 1.000e-05, It/sec 0.091, Tokens/sec 153.289, Trained Tokens 142797, Peak mem 28.847 GB
Iter 80: Val loss 1.273, Val took 32.493s
Iter 80: Train loss 1.185, Learning Rate 1.000e-05, It/sec 0.332, Tokens/sec 809.305, Trained Tokens 167193, Peak mem 28.847 GB
Iter 80: Saved adapter weights to adapters/adapters.safetensors and adapters/0000080_adapters.safetensors.
Iter 90: Train loss 1.280, Learning Rate 1.000e-05, It/sec 0.052, Tokens/sec 136.213, Trained Tokens 193565, Peak mem 28.847 GB
Iter 100: Val loss 1.265, Val took 32.304s
Iter 100: Train loss 1.212, Learning Rate 1.000e-05, It/sec 0.453, Tokens/sec 1140.015, Trained Tokens 218748, Peak mem 28.847 GB
Iter 100: Saved adapter weights to adapters/adapters.safetensors and adapters/0000100_adapters.safetensors.
Saved final weights to adapters/adapters.safetensors.
mlx_lm.lora --config config.yaml  13.25s user 176.72s system 11% cpu 27:18.28 total

Log 2

Iter 1: Val loss 2.151, Val took 23.063s
Iter 6: Val loss 1.899, Val took 28.941s
Iter 6: Train loss 1.977, Learning Rate 1.000e-05, It/sec 2.087, Tokens/sec 2323.759, Trained Tokens 11134, Peak mem 20.819 GB
Saved final weights to adapters/adapters.safetensors.
mlx_lm.lora --config config.yaml  2.71s user 19.75s system 17% cpu 2:08.49 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants