Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge jan09 #151

Merged
merged 275 commits into from
Jan 9, 2024
Merged

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 9, 2024

@vinx13 Please verify that Mixtral support is not broken.

MasterJH5574 and others added 30 commits October 15, 2023 11:02
PR mlc-ai#1048 updated the signature of softmax in the built model library
and changed the temperature buffer shape in ChatModule. This causes
some existing demo unable to run since we did not do a round of model
library update.

This PR reverts the ChatModule change, and adds back the softmax
function in non-batching case. With this PR, the regression should
be fixed.
…ai#1074)

This PR lifts the device string parsing (just a few of lines)
to a standalone function, so that on the serving side the serving
can make use of this function as well.

Tested Python API and it does not seem to incur regression.
The pass `fuse-split-rotary` assumes the compute dtype is fp16, which
usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the
compute is based on fp32 instead. This PR strengthens the check guard.
This PR establishes the compiler components in MLC-Chat Python API,
which currently includes two primary components: models and parameters.

The models are `nn.Module`-based definition of an LLM, which, as the
very first stab, contains only `LlamaForCasualLM`. It is decomposed into
three files:
- `llama_config.py`: common configurations for Llama, where we define
  relevant configurations of its architecture, as well as include
  standard config file for Llama2-7B/13B/70B for convenient testing;
- `llama.py`: the model architecture of Llama, based on the PyTorch-like
`nn.Module` API;
- `llama_parameter.py`: defines the mapping between MLC parameters and
  pytorch parameters.

The parameters contains the basic functionality of parameter mapping,
and the loaders that effectively convert parameters from PyTorch to MLC
according to the mapping specified. Currently, only `HFTorchLoader` is
implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite
straightforward according to the existing design.

On top of this PR, on-the-fly quantization could be defined as a loading
time transformation on MLC parameters, while pre-quantized parameter
loading is effectively parameter loading after MLC's `nn.Module` is
quantized.

Two unittests examplify how the infrastructure works:
- `./tests/python/model/test_llama.py` shows how to create an `nn.Module`
using the new infra, and then convert it to TVM IRModule;
- `./tests/python/parameter/hf_torch_loader.py` shows how to load
parameters from HuggingFace PyTorch format.

Besides, `mlc_chat.support` is established for utility functions, which
now contains two utils:
- `config.py` which supports reading configurations into dataclasses
from JSON file or Python dict. On top of Python dataclass, it throws
irrelevant fields into `cls.kwargs`, which is helpful when loading
HuggingFace configuration file;
- `tqdm.py` which contains tqdm-related utilities, primarily redirecting
logging and printing to work nicely with tqdm.
…ages (mlc-ai#1086)

* Support lib_path option in C++ CLI. Disable ChatConfig.model_lib override in Python API. Improvements on helper messages and error messages

* Update docs

* Rename lib_path -> model_lib_path
[Format] Apply isort and black on `python/`

The commands I am using are:

```
isort --profile black python/
black python/
```

It is always recommended to format the code before submission, given we
don't have a linter CI yet.
This PR enables two Python formatters "black" and "isort" on the following directory:
- `./python/`
- `./tests/python/`

Enabling pylint and mypy is left for future work
Add pylint/mypy tooling into pyproject.toml

This PR establishes the initial Python tooling infra with Pylint and
Mypy. Currently only the newest modules, i.e. `mlc_chat.support` and
`mlc_chat.compiler` are covered, and we expect to cover the entire
package, as being tracked in mlc-ai#1101.
…1052)

Prior to this commit, `mlc_llm.transform.rewrite_attention` updated a
single function.  This commit modifies it to instead be a transform
operating on any pattern matches within an `IRModule`.
…#1056)

* [ParamManager] Use BundleModelParams for transform_quantize

Prior to this commit, `ParamManager.transform_quantize` function took
as input functions with separate parameters for each weight tensor,
and produced output functions with a tuple parameter for all weights.
Because `LiftTransformParams` had the same convention, neither could
be applied as part of the same build flow.

This commit updates `ParamManager.transform_quantize` pass to produce
outputs with separate tensor parameters, using the `BundleModelParams`
transform to later combine them into a single tuple parameter.  The
analogous change was also performed for `LiftTransformParams` as part
of apache/tvm#15657.

In addition, prior to this commit, the
`ParamManager.transform_dequantize` function operated directly on a
`IRModule` object.  As a result, any debug instrumentation
(e.g. before/after printouts for each pass, before/after verification
with `relax.analysis.well_formed`, etc.) did not apply to this
`transform_dequantize`.  This commit updates
`ParamManager.transform_dequantize` to return a `ir.transform.Pass`.

* Correct type annotation
fix error introduced by recent code changes

fixes mlc-ai#1116
…lc-ai#1119)

* Add doc for max and mean gen len, shift factor

* Update python docs for BuildArgs
mlc-ai#1120)

Revert "[ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056)"

This reverts commit e5927ce.

This causes a regression impacting all MLC LLM nightlies as it violates the existing calling convention in MLC Chat runtime. An example: mlc-ai#1060 (comment)
This PR removes an inaccurate warning from mlc-ai#1086, which warns about
`model_lib` overriding regardless of whether or not it's actually
overridden. With this commit, we only warn if its value is not None.
* add presence and frequency penalty

* Added support for passing conversation history in /v1/chat/completions endpoint

* Added support for RestAPI parameters max_gen_len, n, and stop_str

* * add presence and frequency penalty to generation config
* refactor generation config

* Added documentation for parameters

* replace lib_path with model_lib_path in rest.py

* fixed black isort issues

* fix lib_path
…lc-ai#1127)

Prior to this commit, `ParamManager.transform_quantize` function took
as input functions with separate parameters for each weight tensor,
and produced output functions with a tuple parameter for all weights.
Because `LiftTransformParams` had the same convention, neither could
be applied as part of the same build flow.

This commit updates `ParamManager.transform_quantize` pass to produce
outputs with separate tensor parameters, using the `BundleModelParams`
transform to later combine them into a single tuple parameter.  The
analogous change was also performed for `LiftTransformParams` as part
of apache/tvm#15657.

In addition, prior to this commit, the
`ParamManager.transform_dequantize` function operated directly on a
`IRModule` object.  As a result, any debug instrumentation
(e.g. before/after printouts for each pass, before/after verification
with `relax.analysis.well_formed`, etc.) did not apply to this
`transform_dequantize`.  This commit updates
`ParamManager.transform_dequantize` to return a `ir.transform.Pass`.

This commit is a repeat of the reverted PR
mlc-ai#1056.  This PR resolves the bug
in the earlier implementation by removing the call to
`.without_attr("num_input")` in `ParamReplacer.rewrite_func`.  This
follows an analogous update in `LiftTransformParams`, preserving the
`"num_input"` attribute for use in `BundleModelParams`.
32bit version of the zstd.dll library was causing issues, so updated the doc to be more specific and download the 64bit version.
CharlieFRuan and others added 29 commits December 29, 2023 12:08
Integrate fused rope into model gpt_neox and phi. 

Add an optional parameter `rotary_dim` to `llama_rope`. `rotary_dim` indicates the number of dimensions in the embedding that RoPE is applied to. By default `rotary_dim` is the same as `head_dim`. In model `Phi`, `rotary_dim` is set to a different number based on the config.
This PR addresses a package name conflict issue introduced by mlc-ai#1502,
where `mlc_chat.operator` collides with python's official `operator`
library.

More details:
mlc-ai#1502 (comment).
A minor path fix in the Android Doc, as the file `prepare_libs.sh` is
under `library` folder.
…-ai#1522)

This PR introduces an environment variable `MLC_JIT_POLICY` as a
follow-up item to PR [mlc-ai#1508](mlc-ai#1508 (comment)).
It allows to enable/disable the JIT behavior by:
- `OFF`: never JIT, and will throw an error if `model_lib`
is missing;
- `ON` (default): JIT whenever the model lib is missing and there's
a cache miss;
- `REDO`: whenever the model lib is missing, always do JIT
compilation even if cache hits;
- `READONLY`: never do JIT compilation but look up the JIT cache
whenever the model lib is missing.

It also dissolves the newly-introduced `JITOption` into `ChatConfig` so
that it can be used more seamlessly with exactly the existing APIs.
By doing so, users can simply specify `context_window_size`,
`prefill_chunk_size` to control the VRAM used in each model without
having to recompile the model lib themselves.

Example: If one focuses on developing compiler/runtime rather than
quantization, we could simply run

```bash
MLC_JIT_POLICY=REDO python main.py
```

to test if the compiler/runtime work smoothly together, where `main.py`
is:

```python
from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging

logging.enable_logging()
MODEL="HF://junrushao/Llama-2-7b-chat-hf-q4f16_1-MLC",

cm = ChatModule(
    MODEL,
    device="cuda",
    chat_config=ChatConfig(
        context_window_size=1024,
        prefill_chunk_size=1024,
    ),
)
cm.generate(
    "What is the meaning of life?",
    progress_callback=callback.StreamToStdout(callback_interval=2),
)
```
* Add support for loading weights from a safetensor file

* Set pylint to ignore the import error

* Move pylint-disable line

Co-authored-by: Junru Shao <[email protected]>

---------

Co-authored-by: Junru Shao <[email protected]>
This PR introduces a command that reports the estimated upper-bound
memory usage based on the metadata section of an SLM-compiled model.

Example:

```bash
>> python -m mlc_chat.cli.model_metadata /path/to/model_lib.so --memory-only
[2023-12-31 18:40:43] INFO model_metadata.py:49: Parameter size: 3885.14 MB
[2023-12-31 18:40:43] INFO model_metadata.py:58: Temporary buffer size: 7184.15 MB
[2023-12-31 18:40:43] INFO model_metadata.py:71: KVCache size when context/sliding window size is 4096: 512.00 MB
[2023-12-31 18:40:43] INFO model_metadata.py:79: Total memory usage: 11581.29 MB
[2023-12-31 18:40:43] INFO model_metadata.py:84: Tweaking `prefill_chunk_size`, `context_window_size` and `sliding_window_size` to reduce memory usage
```

Addresses both B1 and B2 in mlc-ai#1516 (comment).

Another demo using Python API:

```python
from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging

logging.enable_logging()

MODEL="HF://junrushao/NeuralHermes-2.5-Mistral-7B-q4f16_1-MLC"

cm = ChatModule(
    MODEL,
    device="cuda",
    chat_config=ChatConfig(
        sliding_window_size=4096,
        prefill_chunk_size=1024,
        opt="O2",
    ),
)
cm.generate(
    "What is the meaning of life?",
    progress_callback=callback.StreamToStdout(callback_interval=2),
)
```

```bash
>>> MLC_JIT_POLICY=REDO python main.py
```

<img width="958" alt="image" src="https://github.com/mlc-ai/mlc-llm/assets/22515877/8fcf1fb2-53b3-4768-91b4-89f90712dea8">
1. support n-dimension tensor sharding
2. remove unnecessary `row`, `col` and `group` field
This PR turns on FlashInfer in O2 mode given it has been relatively
stable over the past few weeks.

This commits also brings a few misc improvements:
- Pass in scratch memory managed by RelaxVM's memory pool - this change
  depends on TVM's [PR #16327](apache/tvm#16327)
  and FlashInfer's [PR mlc-ai#43](flashinfer-ai/flashinfer#43)
- Enable FlashInfer for group size = 4, which is a setting used in
  Mistral models;
- Slightly shorten and clarify the log message on memory usage on model
  lib loading.
- Integrate FlashInfer into GPT-BigCode models.

With this PR, FlashInfer is integrated into Mistral, Llama, GPT-NeoX,
GPT-BigCode, Phi. The only left out is GPT2, which has a special flag
`scale_attn_by_inverse_layer_idx` which applies an elementwise
normalization term `1.0 / layer_id` to attn scores before masked
softmax.
This PR enbales the FasterTransformer quantization of `q4f16_ft`.
This PR includes two minor fixes to support TinyLlama:

- Fix BF16 loading via SafeTensor - it was broken because numpy does not
  support bf16, which leads to an exception in safetensor internally.
- FlashInfer doesn't support `head_dim == 64`, which we skipped in this
  PR.

After this PR, the following snippet runs TinyLlama pretty conveniently:

```python
from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging

logging.enable_logging()

MODEL = "HF://junrushao/TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC"

def main():
    cm = ChatModule(
        MODEL,
        device="metal",
        chat_config=ChatConfig(context_window_size=1024),
    )
    cm.generate(
        "What is the meaning of life?",
        progress_callback=callback.StreamToStdout(callback_interval=2),
    )

if __name__ == "__main__":
    main()
```
```
MODEL = "HF://junrushao/Mistral-7B-Instruct-v0.2-q4f16_1-MLC"
TP_SHARDS = 2

from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging

logging.enable_logging()

cm = ChatModule(
    MODEL,
    device="cuda",
    chat_config=ChatConfig(
        context_window_size=1024,
        prefill_chunk_size=1024,
        tensor_parallel_shards=TP_SHARDS,
        opt="flashinfer=0;cublas_gemm=1;cudagraph=0",
    ),
)
cm.generate(
    "What is the meaning of life?",
    progress_callback=callback.StreamToStdout(callback_interval=2),
)
```
This PR introduces the batched llama modeling with Paged KV cache
in SLM flow.
This is a quick fix to mlc-ai#1547. Sorry for missing the init file
in the nn subpackage.
This PR enables FasterTransformer dequantize matmul epilogue fusion.
Introduce Mixtral MoE Model

This PR introduces support for Mixtral MoE models with MLC's latest SLM
quantization/compilation pipeline. It includes the following pieces of
changes:

**Operators.** We implemented a list of operators in TIR's TVMScript
format in two files `moe_misc` and `moe_matmul`. Those TIR kernels
implement "transpose indices" and "blocked-CSR-COO" as described in
MegaBlock [1].

`moe_misc.py` primarily concerns sparsity-related operators, including:
- `get_indices`, `get_indptr` and `scatter_output`: CSR-style index
  manipulation and array shuffling that makes the input ranges each
  expert has to deal with contiguous.
- `moe_sum`, `moe_cumsum`, `topk` which are standard operators but
  specialized for MoE usecases, e.g. #experts and #activated-experts are
  small.

`moe_matmul.py` includes non-quantized and quantized GEMV and GEMV
operators used in MoE model serving. Typically, in single batch
decoding, GEMV operators should suffice, but group GEMM is a necessary
dependency in both prefilling and batched decoding.

**Model architecture.** We reuse the attention blocking block from
Mistral, and implemented MLP MoE in `mixtral_model.py`. In Mixtral,
there are three groups of experts in each MLP, where `e1` and `e3` are
gate/up projections (project-in) and `e2` is down project (project-out).

**Weight quantization.** We batch all experts of the same kind into a
single tensor, whose shape is `(Ne, N, K)`, where `Ne` is the total
number of experts, `N` is out features and `K` is in-features. Applying
group quantization, we compress along the `K` dimension as consistent
with the rest of the project.

**Performance.** The current TIR is highly optimized for non-tensor core
scenarios (Metal, WebGPU, non-TensorCore CUDA, AMD, etc) and tensor core
performance is left for a PR in the nearest future.

**Try out MLC's Mixtral Model.** The int4-quantized Mixtral model has
24.5G of parameters.

```python
from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging
logging.enable_logging()

MODEL = "HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC"
NUM_GPU = 1

def main():
    cm = ChatModule(MODEL, device="cuda:0", chat_config=ChatConfig(
        sliding_window_size=1024,
        tensor_parallel_shards=NUM_GPU,
    ))
    cm.generate("What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2))

if __name__ == "__main__":
    main()
```

Quantization formats:
- 3-bit (19.662 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC)
- 4-bit (24.466 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC)

The 3-bit version can be run comfortably using a 24G GPU (e.g. 4090,
3090Ti).

**Convert Mixtral to MLC format from scratch.** The following instructions
are only needed for advanced users to quantize Mixtral from scratch.

```bash
SRC_DIR=/path/to/Mixtral-8x7B-v0.1 # raw model downloaded from HuggingFace
MODEL_DIR=/mlc_models/mixtral-q4f16_1 # destination directory

mlc_chat gen_config $SRC_DIR -o $MODEL_DIR --quantization q4f16_1 \
  --conv-template LM  # "LM" (lang model) means no conversation template yet

mlc_chat convert_weight $SRC_DIR --quantization q4f16_1 -o $MODEL_DIR
```

[1] Gale, Trevor, Deepak Narayanan, Cliff Young, and Matei Zaharia.
"MegaBlocks: Efficient Sparse Training with Mixture-of-Experts."
Proceedings of MLSys 2023.

Co-authored-by: Junru Shao <[email protected]>
A follow-up of my previous PR (mlc-ai#1529).

This PR makes Mixtral work on Metal GPUs that macOS comes with. There
are honestly no much change needed, except for that Metal doesn't
support fp64 data types.

A python script to run Mixtral:

```python
from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging
logging.enable_logging()

MODEL = "HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC"
NUM_GPU = 1

def main():
    cm = ChatModule(MODEL, chat_config=ChatConfig(
        sliding_window_size=1024,
        tensor_parallel_shards=NUM_GPU,
    ))
    cm.generate("What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2))

if __name__ == "__main__":
    main()
```

Quantization formats:
- 3-bit (19.662 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC)
- 4-bit (24.466 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC)
…i#1555)

We recently noticed that when FlashInfer is not built due to
unsupported cuda architecture or platform, running single-sequence
ChatModule will hit VM function initialization error, where the
function is used in `create_flashinfer_paged_kv_cache`, which
won't actually be invoked in single-sequence flow.

This is due to relax VM eagerly initializes all used PackedFunc
at initialization stage (instead of lazy load). Therefore, even
when the `create_flashinfer_paged_kv_cache` is not invoked, the
PackedFuncs will be looked up. So whenever FlashInfer is not
available, the issue will happen.

This PR adds a compiler pass which removes
`create_flashinfer_paged_kv_cache` (and also other similar functions
that may be introduced in the future) based on the target. This
pass can effectively address the issue.
@masahi masahi merged commit 5efaa53 into octoml:batch-serving Jan 9, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.