Olive-ai 0.4.0
Examples
The following examples are added
- Llama2 optimization with ONNX Runtime Tools #641
- Llama2 finetuning with QLoRA and optimization with ONNX Runtime Tools #703
- Llama2 shard to multiple GPUs #694
- DirectML Llama2 #701
- DirectML phi #693
- phi-1.5 finetuning with QLoRA #689
Passes (optimization techniques)
- OrtPerTuning
- Raises known failure exceptions to immediately stop tuning.
- Default values for
device
andproviders_list
is based on the accelerator spec.
- OrtTransformersOptimization
- Checks that
model_type
is provided in the pass configs or available in the model attributes.None
is invalid. fp16
related arguments are better documented.
- Checks that
- Introduce LoRA pass for finetuning pytorch models with Low-Rank Adaptation
- Introduce OnnxMatMul4Quantizer pass to quantize onnx models to 4-bit integers.
- Introduce OnnxBnb4Quantization pass to quantize onnx models to 4-bit data types from bitsandbytes (FP4, NF4).
- Onnx external data configuration supports
size_threshold
andconvert_attribute
parameters. - LlamaPyTorchTensorParallel pass to split Llama model into a tensor parallel distributed pytorch model.
- OnnxConversion
- Support DistributedPyTorchModel.
use_device
andtorch_dtype
options to specify device ("cpu", "cuda") and data type ("float16", "float32") for the model before conversion.
- DeviceSpecificOnnxConversion removed in favor or OnnxConversion pass with
use_device
option. - LoRA/QLoRA
- Support training using ONNX Runtime Training.
- Mixed-precision training when
torch_dtype=float16
for numerical stability.
Engine
- Make
engine/evaluator
config optional in olive run config. With this default way, user can just run optimization without search and evaluation in simplest pass config. evaluate_input_model
is optional in engine config in no-search model. It is forced toFalse
when no evaluator is provided.ort_py_log_severity_level
option to control logging level for onnxruntime python logs.- CLI option
--tempdir
to use a custom directory as the root directory for tempfile. - IO-Binding:
- New method to efficiently bind inputs and outputs to the session using either the CPU or GPU depending on the device.
shared_kv_buffer
option to enable key value buffer sharing between input (past key values) and output (present key values)
Model
- DistributedOnnxModel file structure updated to use resource paths. Can be saved from cache to destination directory.
- Introduce DistributedPyTorchModel that is analogous to DistributedOnnxModel for pytorch model.
trust_remote_code
added to HFConfig model_loading_args.
Metrics
- Option to provide kwargs to user_script functions through
func_kwargs
Dependencies:
- Support onnxruntime 1.16.2