Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Accelerating quantization computation by weight compression (octoml#45)
This PR enables weight compression in GPU. Previously the weight compression is run in CPU because the uncompressed weight is too large to fit in GPU, and running on CPU is pretty slow in fp16 case. Now we switch to GPU. The technique we use to fit the uncompressed weight into GPU memory is lazy loading. We load the weight right before the first use, and instantly free it after the last use. By testing, this PR reduces the quantization computation time for Vicuna-v1-7b under `q3f16_0` quantization setting by **6 min** on Linux machine with **RTX 4090, Ryzen 3970X and 64GB of RAM**, and reduces the time by 40 sec on Mac Studio with 32 GB of memory. At this moment, to build the vicuna-7b-v1 model, for Linux machines we only need more than 28GB of memory in total (compared with over 50GB previously). We are continuously working on reducing the memory size requirement.
- Loading branch information