-
Notifications
You must be signed in to change notification settings - Fork 45
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into pickle_scaling_tensor
- Loading branch information
Showing
44 changed files
with
28,524 additions
and
233 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
name: GitHub Pages | ||
|
||
on: | ||
push: | ||
branches: | ||
- main | ||
pull_request: | ||
branches: | ||
- main | ||
- release/* | ||
|
||
jobs: | ||
docs-build: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v2 | ||
- name: Setup nodejs | ||
uses: actions/setup-node@v2 | ||
with: | ||
node-version: '14' | ||
- name: Test docs build | ||
run: | | ||
cd website | ||
npm ci | ||
npm run build | ||
- name: Prepare ssh key | ||
uses: webfactory/[email protected] | ||
if: ${{ github.event_name == 'push' }} | ||
with: | ||
ssh-private-key: ${{ secrets.GH_PAGES_KEY }} | ||
- name: Publish to GitHub Pages | ||
if: ${{ github.event_name == 'push' }} | ||
env: | ||
GIT_USER: ${{ secrets.GH_PAGES_USERNAME }} | ||
USE_SSH: true | ||
run: | | ||
git config --global user.email "[email protected]" | ||
git config --global user.name "GitHub Actions" | ||
cd website | ||
npm run deploy |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
rules "~MD013", "~MD033", "~MD046" | ||
rules "~MD013", "~MD033", "~MD046", "~MD034" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,191 +1,10 @@ | ||
# MS-AMP: Microsoft Automatic Mixed Precision | ||
|
||
MS-AMP is an automatic mixed precision package for deep learning developed by Microsoft. | ||
__MS-AMP__ is an automatic mixed precision package for deep learning developed by Microsoft. | ||
|
||
Features: | ||
📢 [v0.2.0](https://github.com/Azure/MS-AMP/releases/tag/v0.2.0) has been released! | ||
|
||
- Support O1 optimization: Apply FP8 to weights and weight gradients and support FP8 in communication. | ||
- Support O2 optimization: Support FP8 for two optimizers(Adam and AdamW). | ||
- Support O3 optimization: Support FP8 in DeepSpeed ZeRO optimizer. | ||
- Provide four training examples using FP8: Swin-Transformer, DeiT, RoBERTa and GPT-3. | ||
|
||
MS-AMP has the following benefit comparing with Transformer Engine: | ||
|
||
- Support the new FP8 feature that is introduced by latest accelerators (e.g. H100). | ||
- Speed up math-intensive operations, such as linear layers, by using Tensor Cores. | ||
- Speed up memory-limited operations by accessing one byte compared to half or single-precision. | ||
- Reduce memory requirements for training models, enabling larger models or larger minibatches. | ||
- Speed up communication for distributed model by transmitting lower precision gradients. | ||
|
||
## Get started | ||
|
||
### Prerequisites | ||
|
||
- Latest version of Linux, you're highly encouraged to use Ubuntu 18.04 or later. | ||
- Nvidia GPU(e.g. V100/A100/H100) and compatible drivers should be installed correctly. | ||
Driver version can be checked by running `nvidia-smi`. | ||
- Python version 3.7 or later (which can be checked by running `python3 --version`). | ||
- Pip version 18.0 or later (which can be checked by running `python3 -m pip --version`). | ||
- CUDA version 11 or later (which can be checked by running `nvcc --version`). | ||
- PyTorch version 1.13 or later (which can be checked by running `python -c "import torch; print(torch.__version__)"`). | ||
|
||
We strongly recommend using [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). For example, to start PyTorch 1.14 container, run the following command: | ||
|
||
``` | ||
sudo docker run -it -d --name=msamp --privileged --net=host --ipc=host --gpus=all nvcr.io/nvidia/pytorch:22.12-py3 bash | ||
sudo docker exec -it msamp bash | ||
``` | ||
|
||
### Install MS-AMP | ||
|
||
You can clone the source from GitHub. | ||
|
||
```bash | ||
git clone https://github.com/Azure/MS-AMP.git | ||
cd MS-AMP | ||
git submodule update --init --recursive | ||
``` | ||
|
||
If you want to train model with multiple GPU, you need to install MSCCL to support FP8. | ||
|
||
```bash | ||
cd third_party/msccl | ||
|
||
# V100 | ||
make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70" | ||
# A100 | ||
make -j src.build NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" | ||
# H100 | ||
make -j src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90" | ||
|
||
apt-get update | ||
apt install build-essential devscripts debhelper fakeroot | ||
make pkg.debian.build | ||
dpkg -i build/pkg/deb/libnccl2_*.deb | ||
dpkg -i build/pkg/deb/libnccl-dev_2*.deb | ||
|
||
cd - | ||
``` | ||
|
||
Then, you can install MS-AMP from source. | ||
|
||
```bash | ||
python3 -m pip install --upgrade pip | ||
python3 -m pip install . | ||
make postinstall | ||
``` | ||
|
||
Before using MS-AMP, you need to preload msampfp8 library and it's depdencies: | ||
|
||
```bash | ||
NCCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libnccl.so # Change as needed | ||
export LD_PRELOAD="/usr/local/lib/libmsamp_dist.so:${NCCL_LIBRARY}:${LD_PRELOAD}" | ||
``` | ||
|
||
After that, you can verify the installation by running: | ||
|
||
```bash | ||
python3 -c "import msamp; print(msamp.__version__)" | ||
``` | ||
|
||
### Usage | ||
|
||
Enabling MS-AMP is very simple when traning model on single GPU, you only need to add one line of code `msamp.initialize(model, optimizer, opt_level)` after defining model and optimizer. | ||
|
||
Example: | ||
|
||
```python | ||
import msamp | ||
|
||
# Declare model and optimizer as usual, with default (FP32) precision | ||
model = torch.nn.Linear(D_in, D_out).cuda() | ||
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3) | ||
|
||
# Allow MS-AMP to perform casts as required by the opt_level | ||
model, optimizer = msamp.initialize(model, optimizer, opt_level="O1") | ||
... | ||
``` | ||
|
||
For distributed training job, you need to add `optimizer.all_reduce_grads(model)` after backward to reduce gradients in process group. | ||
|
||
Example: | ||
|
||
```python | ||
scaler = torch.cuda.amp.GradScaler() | ||
for batch_idx, (data, target) in enumerate(train_loader): | ||
data, target = data.to(device), target.to(device) | ||
optimizer.zero_grad() | ||
output = model(data) | ||
loss = loss(output, target) | ||
scaler.scale(loss).backward() | ||
optimizer.all_reduce_grads(model) | ||
scaler.step(optimizer) | ||
``` | ||
|
||
For applying MS-AMP to DeepSpeed ZeRO, add a "msamp" section in deepspeed config file: | ||
|
||
```json | ||
"msamp": { | ||
"enabled": true, | ||
"opt_level": "O3" | ||
} | ||
``` | ||
|
||
Runnable, comprehensive examples demonstrating good practices can be found [here](./examples). | ||
For more examples, please go to [MS-AMP-Examples](https://github.com/Azure/MS-AMP-Examples). | ||
|
||
### Optimization Level | ||
|
||
Currently MS-AMP supports two optimization levels: O1 and O2. Try both, and see what gives the best speedup and accuracy for your model. | ||
|
||
- O1: We found that directly transitioning weight gradients from FP32 to FP8 in the Transformer Engine leads to a decrease in accuracy. However, this issue is resolved in O1 through the implementation of FP8 for weight gradients and AllReduce communication. This optimization also has the added benefits of saving GPU memory and reducing communication bandwidth. | ||
|
||
- O2: From O1 to O2, our main focus is on enabling the use of low-bit data formats for auxiliary tensors in the Adam/AdamW optimizer without any loss in accuracy. Specifically, we are able to maintain accuracy by representing the first-order optimizer state in FP8 and the second-order state in FP16. This optimization has the potential to save up to 62.5% of GPU memory for the optimizer when the model size is particularly large. | ||
|
||
- O3: This optimization level is specifically designed for ZeRO-optimizer in advanced distributed traning framework DeepSpeed. ZeRO separates model weights into regular weights and master weights, with the former used for network forward/backward on each GPU, and the latter used for model updating in the optimizer. This separation allows us to use 8-bit data precision for regular weights and weight broadcasting, which reduces GPU memory and bandwidth usage even further. | ||
|
||
Here are details of different MS-AMP optimization levels: | ||
| Optimization Level | Computation(GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States | | ||
| ------------------- | ----------- | ----- | ------ | ------------- | --------------- | ---------------- | | ||
| FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 | | ||
| Nvidia TE | FP8 | FP32 | FP32 | N/A | FP32 | FP32+FP32 | | ||
| MS-AMP O1 | FP8 | FP8 | FP16 | N/A | FP8 | FP32+FP32 | | ||
| MS-AMP O2 | FP8 | FP8 | FP16 | N/A | FP8 | FP8+FP16 | | ||
| MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16 | | ||
|
||
## Performance | ||
|
||
### Accuracy: no loss of accuracy | ||
|
||
We evaluated the training loss and validation performance of three typical models, Swin-Transformer, DeiT and RoBERTa, using both MS-AMP O2 and FP16 AMP. Our observations showed that the models trained with MS-AMP O2 mode achieved comparable performance to those trained using FP16 AMP. This demonstrates the effectiveness of the Mixed FP8 O2 mode in MS-AMP. | ||
|
||
Here are the results for Swin-T, DeiT-S and RoBERTa-B: | ||
|
||
![image](./docs/assets/performance.png) | ||
|
||
### Memory | ||
|
||
MS-AMP preserves 32-bit accuracy while using only a fraction of the memory footprint on a range of tasks, including the DeiT model and Swin Transformer for ImageNet classification. For example, comparing with FP16 AMP, MS-AMP with O2 mode can achieve 44% memory saving for Swin-1.0B and 26% memory saving for ViT-1.2B. The proportion of memory saved will be more obvious for larger models. | ||
|
||
Here are the results for Swin-1.0B and ViT-1.2B. | ||
|
||
![Image](./docs/assets/gpu-memory.png) | ||
|
||
For detailed setting and results, please go to [MS-AMP-Example](https://github.com/Azure/MS-AMP-Examples). | ||
|
||
## Contributing | ||
|
||
This project welcomes contributions and suggestions. Most contributions require you to agree to a | ||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us | ||
the rights to use your contribution. For details, visit [CLA](https://cla.opensource.microsoft.com). | ||
|
||
When you submit a pull request, a CLA bot will automatically determine whether you need to provide | ||
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions | ||
provided by the bot. You will only need to do this once across all repos using our CLA. | ||
|
||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). | ||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or | ||
contact [[email protected]](mailto:[email protected]) with any additional questions or comments. | ||
## _Check [aka.ms/msamp/doc](https://aka.ms/msamp/doc) for more details._ | ||
|
||
## Trademarks | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
--- | ||
id: contributing | ||
--- | ||
|
||
# Contributing | ||
|
||
## Contributor License Agreement | ||
|
||
This project welcomes contributions and suggestions. Most contributions require you to agree to a | ||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us | ||
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. | ||
|
||
When you submit a pull request, a CLA bot will automatically determine whether you need to provide | ||
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions | ||
provided by the bot. You will only need to do this once across all repos using our CLA. | ||
|
||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). | ||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or | ||
contact [[email protected]](mailto:[email protected]) with any additional questions or comments. | ||
|
||
## How to Contribute | ||
|
||
### Contribute New Feature | ||
|
||
MS-AMP is an open-source project. Your participation and contribution are highly appreciated. There are several important things you need know before contributing new feature to this project: | ||
|
||
#### What content can be added to MS-AMP | ||
|
||
1. Bug fixes for existing features. | ||
1. Performance improvement. | ||
1. New features such as support for new distributed training framework. | ||
|
||
If you would like to contribute a new feature on MS-AMP, please submit your proposal first. In [GitHub Issues](https://github.com/azure/MS-AMP/issues) module, choose `Enhancement Request` to finish the submission. If the proposal is accepted, you can submit pull requests to origin `main` branch. | ||
|
||
#### Contribution steps | ||
|
||
If you would like to contribute to the project, please follow below steps of joint development on GitHub. | ||
|
||
1. `Fork` the repo first to your personal GitHub account. | ||
1. Checkout from main branch for feature development. | ||
1. When you finish the feature, please fetch the latest code from origin repo, merge to your branch and resolve conflict. | ||
1. Submit pull requests to origin main branch. | ||
1. Please note that there might be comments or questions from reviewers. It will need your help to update the pull request. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
--- | ||
id: development | ||
--- | ||
|
||
# Development | ||
|
||
If you want to develop new feature, please follow below steps to set up development environment. | ||
|
||
We suggest you to use [Visual Studio Code](https://vscode.github.com/) and install the recommended extensions for this project. | ||
You can also develop online with [GitHub Codespaces](https://github.com/codespaces). | ||
|
||
## Check Environment | ||
|
||
Follow [System Requirements](../getting-started/installation.mdx). | ||
|
||
## Set up | ||
|
||
Clone code. | ||
|
||
```bash | ||
git clone --recurse-submodules https://github.com/azure/MS-AMP | ||
cd MS-AMP | ||
``` | ||
|
||
Install MS-AMP. | ||
|
||
```bash | ||
python3 -m pip install --upgrade pip | ||
python3 -m pip install -e .[test] | ||
make postinstall | ||
``` | ||
|
||
Install MSCCL and preload msamp_dist library. | ||
|
||
```bash | ||
cd third_party/msccl | ||
# H100 | ||
make -j src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90" | ||
apt-get update | ||
apt install build-essential devscripts debhelper fakeroot | ||
make pkg.debian.build | ||
dpkg -i build/pkg/deb/libnccl2_*.deb | ||
dpkg -i build/pkg/deb/libnccl-dev_2*.deb | ||
|
||
cd - | ||
NCCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libnccl.so # Change as needed | ||
export LD_PRELOAD="/usr/local/lib/libmsamp_dist.so:${NCCL_LIBRARY}:${LD_PRELOAD}" | ||
``` | ||
|
||
## Lint and Test | ||
|
||
Format code using yapf. | ||
|
||
```bash | ||
python3 setup.py format | ||
``` | ||
|
||
Check code style with mypy and flake8. | ||
|
||
```bash | ||
python3 setup.py lint | ||
``` | ||
|
||
Run unit tests. | ||
|
||
```bash | ||
python3 setup.py test | ||
``` | ||
|
||
Open a pull request to main branch on GitHub. |
Oops, something went wrong.