Skip to content

Commit

Permalink
Merge pull request #3 from TorchMoE/dev
Browse files Browse the repository at this point in the history
First Release of README
  • Loading branch information
luomai authored Feb 27, 2024
2 parents 8220948 + f296fad commit 4509b87
Showing 1 changed file with 145 additions and 1 deletion.
146 changes: 145 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,147 @@
# MoE-Infinity

Coming out soon...
MoE-Infinity is a cost-effective, fast, and easy-to-use library for Mixture-of-Experts (MoE) inference and serving.

MoE-Infinity is cost-effective yet fast:

- Offloading MoE's experts to host memory, allowing memory-constrained GPUs to serve MoE models.
- Minimizing the expert offloading overheads through several novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching.
- Supporting LLM acceleration techniques (such as [FlashAttention](https://github.com/Dao-AILab/flash-attention)).
- Supporting multi-GPU environments with numeorous OS-level performance optimizations.
- Achieving SOTA latency and throughput performance when serving MoEs in a resource-constrained GPU environment (in comparison with HuggingFace [Accelerate](https://github.com/huggingface/accelerate), [DeepSpeed](https://github.com/microsoft/DeepSpeed), [Mixtral-Offloading](https://github.com/dvmazur/mixtral-offloading), and [Ollama/LLama.cpp](https://github.com/ollama/ollama)).

MoE-Infinity is easy-to-use:

- HuggingFace model compatible, and HuggingFace programmer friendly.
- Supporting all available MoE checkpoints (including [Google Switch Transformers](https://huggingface.co/google/switch-large-128), [Meta NLLB-MoE](https://huggingface.co/facebook/nllb-moe-54b), and [Mixtral](mistralai/Mixtral-8x7B-Instruct-v0.1)).

Note that: The open-sourced MoE-Infinity has been redesigned for making it HuggingFace-users friendly. This version is different from the version reported in the paper, which takes extreme performance as the top priority. As a result, distributed inference is currently not supported in this open-sourced version.

## Contents
- [Performance](#performance)
- [Installation](#installation)
- [Prerequisites](#prerequisites)
- [Install from PyPI](#install-from-pypi)
- [Install from Source](#install-from-source)
- [Usage and Examples](#usage-and-examples)
- [Sample Code of Huggingface LLM Inference](#sample-code-of-huggingface-llm-inference)
- [Running Inference](#running-inference)
- [Roadmap](#roadmap)

## Performance

Single GPU A5000 (24GB Memory), per-token-latency (seconds) for generation with a mixed dataset that includes [FLAN](https://huggingface.co/datasets/Muennighoff/flan), [BIG-Bench](https://huggingface.co/datasets/bigbench) and [MMLU](https://huggingface.co/datasets/lukaemon/mmlu) datasets.

| | switch-large-128 | NLLB-MoE-54B | Mixtral-7x8b |
| :---: | :---: | :---: | :---: |
| *MoE-Infinity* | *0.230* | *0.239* | *0.895* |
| Accelerate | 1.043 | 3.071 | 6.633 |
|DeepSpeed | 4.578 | 8.381 | 2.486 |
|Mixtral Offloading| X | X | 1.752 |
|Ollama | X | X | 0.903 |

Single GPU A5000, throughput (token/s) for generation at batch size 32.

| | switch-large-128 | NLLB-MoE-54B | Mixtral-7x8b |
| :---: | :---: | :---: | :---: |
| *MoE-Infinity* | *69.105* | *30.300* | *12.579* |
| Accelerate | 5.788 | 4.344 | 1.245 |
|DeepSpeed | 7.416 | 4.334 | 7.727 |
|Mixtral Offloading| X | X | 7.684 |
|Ollama | X | X | 1.107 |

> The Mixtral Offloading experiment was carried out with a batch size of 16, as utilizing a batch size of 32 would result in Out of Memory errors on the GPU.
## Installation

We recommend installing MoE-Infinity in a virtual environment. To install MoE-Infinity, you can either install it from PyPI or build it from source.

### Prerequisites
MoE-Infinity is currently only supported on Linux, Ensure the following dependencies are installed on your system:

```bash
# example of installing dependencies on ubuntu
sudo apt install build-essential curl libaio-dev libspdlog-dev
```

Pytorch (>=2.0), libstdcxx-ng (>=12.0) and Python (>=3.8) required for MoE-Infinity, please refer to [Pytorch](https://pytorch.org/get-started/locally/) for installation instructions.

### Install from PyPI

```bash
pip install moe-infinity
```

### Install from Source

```bash
git clone https://github.com/TorchMoE/MoE-Infinity.git
cd MoE-Infinity
pip install -e .
```

## Usage and Examples

We provide a simple API for diverse setups, including single GPU, multiple GPUs, and multiple nodes. The following examples show how to use MoE-Infinity to run generation on a Huggingface LLM model.

### Sample Code of Huggingface LLM Inference

```python
import torch
import os
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = 'TheBloke/Mixtral-8x7B-v0.1-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

config = {
"offload_path": os.path.join(user_home, "moe-infinity"),
"device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}

model = MoE(checkpoint, config)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda:0")

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)
```

### Running Inference

This command runs the script on selected GPUs.
```bash
CUDA_VISIBLE_DEVICES=0,1 python script.py
```

We provide a simple example to run inference on a Huggingface LLM model. The script will download the model checkpoint and run inference on the specified input text. The output will be printed to the console.

```bash
CUDA_VISIBLE_DEVICES=0 python example/interface_example.py --model_name_or_path "mistralai/Mixtral-8x7B-Instruct-v0.1" --offload_dir <your local path on SSD>
```

## Release Plan

We plan to release two functions in the following months:

* We currently support PyTorch as the default inference engine, and we are in the process of supporting vLLM as another inference runtime, which includes the support of KV cache offloading.
* Supporting expert parallelism for distributed MoE inference.
* More (We welcome contributors to join us!)

## Citation

If you use MoE-Inifity for your research, please cite our [paper](https://arxiv.org/abs/2401.14361):
```bibtex
@inproceedings{moe-infinity2024,
title={MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving},
author={Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina},
booktitle={https://arxiv.org/abs/2401.14361},
year={2024}
}
```

0 comments on commit 4509b87

Please sign in to comment.