Releases: huggingface/optimum-habana
v1.7: Llama 2, Falcon, LoRA, Transformers v4.31, SynapseAI v1.11
Transformers v4.31
Transformers v4.31 (latest stable release) is fully supported.
SynapseAI v1.11
SynapseAI v1.11 (latest stable release) is fully supported.
Optimizations for Llama 2, Falcon, StarCoder, OPT, GPT-NeoX, CodeGen
- Added support for OPT-66B #285 @ZhaiFeiyue
- Llama #296 @yeonsily
- Improve Llama2 and gpt_neox performance with Habana fused RoPE and RMSNorm #321 @mandy-li
- Enable Falcon-7b #326 @schoi-habana
- Fix inference with Llama-2-70B #342 @regisss
- Add model optimizations for codegen and gpt_bigcode #322 @PhillipHoward
Torch Autocast
Torch Autocast is becoming the default for managing mixed-precision runs.
- Fix autocast for BERT-like models #287 @ANSHUMAN87
- Add support for autocast in gradient checkpointing #307 @regisss
Improved text-generation example
- Added constrained beam search #281 @vivekgoe
- Fix padding error #282 @sywangyi
- Various improvements for faster checkpoint downloading #284 #286 #294 @regisss
- Add deepspeed TP policy for llama #303 @sywangyi
- Add token and model_revision args for the text-generation example #331 @regisss
LoRA examples
Two new LoRA examples for fine-tuning and inference.
LDM3D
New Stable Diffusion pipeline that enables to generate images and depth maps.
- Support for Ldm3d #304 @estelleafl
Added support for Text Generation Inference (TGI)
TGI is now supported on Gaudi.
GaudiGenerationConfig
Transformers' GenerationConfig
has been extended to be fully compatible with Gaudi. It adds two fields to better control generation with static shapes.
Various fixes and improvements
- Fix generation sampling when using
repetition_penalty
#301 @sywangyi - Remove kv cache wa #302 @ZhaiFeiyue
- Fix T5 inference performance regression #310 @libinta
- Fix gptj HCCL issue occured in DDP #318 @sywangyi
- Revert partially Enable/Optimize flan t5 xxl on deepspeed z3 #320 @hsubramony
- Modify flan-t5 deepspeed configuration #328 @yeonsily
- Add commands for gptj and gptneox #325 @ankurhabana
- Disable FusedRMSNorm for training #343 @hsubramony
- Enable hpu rms fused kernel for t5 #344 @ZhaiFeiyue
- Remove two workarounds on esmfold #334 @bzhu-habana
v1.6: Fast DDP, Torch Autocast, SynaspeAI v1.10 and various model optimizations
Fast DDP
A new distribution strategy is introduced. It is lighter, simpler and usually faster than Torch DDP. You can enable it in your runs with --distribution_strategy fast_ddp
.
- Improve performance and scalability of BERT FT training #200 @mlapinski-habana
Torch Autocast
It is now possible to use Torch Autocast as mixed precision backend. You can easily enable it in your runs with --bf16
(i.e. exactly like in Transformers).
- Enable usage of PyTorch autocast on Gaudi during training #226 @jwieczorekhabana
- Add Torch autocast and full bf16 to GaudiStableDiffusionPipeline #278 @regisss
SynapseAI v1.10
This release is fully compatible with SynapseAI v1.10.0.
HPU graphs for training
You can now use HPU graphs for training your models.
- Improve performance and scalability of BERT FT training #200 @mlapinski-habana
Check out the documentation for more information.
Various model optimizations
- Update BLOOM modeling for SynapseAI 1.10 #277
- Optimize conv1d forward #231 @ZhaiFeiyue
- Add static key-value cache for OPT, GPT-J, GPT-NeoX #246 #248 #249 @ZhaiFeiyue
- Optimizations for running FLAN T5 with DeepSpeed ZeRO-3 #257 @libinta
Asynchronous data copy
You can now enable asynchronous data copy between the host and devices during training using --non_blocking_data_copy
.
- Enable asynchronous data copy to get a better performance #211 @jychen-habana
Check out the documentation for more information.
Profiling
It is now possible to profile your training relying on GaudiTrainer
. You will need to pass --profiling_steps N
and --profiling_warmup_steps K
.
- Enable profiling #250 @ZhaiFeiyue
Adjusted throughput calculation
You can now let the GaudiTrainer
compute the real throughput of your run (i.e. not counting the time spent while logging, evaluating and saving the model) with --adjust_throughput
.
Check SynapseAI version at import
A check is performed when importing optimum.habana
to let you know if you are running the version of SynapseAI for which Optimum Habana has been tested.
Enhanced examples
Several examples have been added or improved. You can find them here.
v1.5: BLOOM(Z), SynapseAI v1.9.0 and various speedups
BLOOM(Z)
BLOOM is introduced in this release with HPU-optimized tweaks to perform fast inference using DeepSpeed. A text-generation example is provided here so that you can easily try it.
Check out the blog post we recently released for a benchmark comparing BLOOMZ performance on Gaudi2 and A100.
SynapseAI v1.9.0
This release is fully compatible with SynapseAI v1.9.0.
Transformers v4.28 and Diffusers v0.15
This release is fully compatible with the recently released Transformers v4.28 and Diffusers v0.15.
Improved data sampling for training in lazy mode
This release enables to make sure that all batches will have the same size in lazy mode to prevent extra graph compilations.
HPU graphs for distributed runs and generation
This release enables HPU graphs for distributed runs and text generation.
Recommend dataloader_num_workers
for CV model training
ViT and Swin examples have been updated to add the dataloader_num_workers
that enables to speed up training.
- Adding dataloader_num_workers into example command for better performance #188 @ZhaiFeiyue
Enable to pipeline forward and backward passes
The argument pipelining_fwd_bwd
enables to trigger the HPU compution of the forward pass while the CPU interprets the backward pass. This enables to speed up CV models.
- Add mark_step between fwd and bwd for better performance #189 @ZhaiFeiyue
More information in the documentation.
v1.4: multi-node training and inference mode
Multi-node training
This release adds support for multi-node training through DeepSpeed. This enables you to scale out up to thousands of nodes to speed up your trainings even more!
- Add support for multi-node training #116
Check out the documentation to get started.
Inference through HPU graphs
You can now perform inference faster on Gaudi with HPU graphs.
- Add support for inference through HPU graphs in GaudiTrainer #151
HPU graphs are currently only supported for single-device runs. Check out the documentation for more information.
Synapse AI 1.8
This release is fully compatible with SynapseAI 1.8.0, which is the latest version. Check out Habana's documentation for more information about the new features.
DeepSpeed's gradient checkpointing
DeepSpeed's gradient checkpointing is now automatically used when setting gradient_checkpointing=True
in a DeepSpeed run.
- Enable DeepSpeed activation checkpointing #142
v1.3: Stable Diffusion and Wav2Vec2
Stable Diffusion
This release adds a new interface for the 🤗 Diffusers library which enables to support the Stable Diffusion pipeline for inference. Thus, you can now generate images from text on Gaudi relying on the user-friendliness of 🤗 Diffusers.
- Add support for Stable Diffusion #131
Check out the documentation and this example for more information.
Wav2Vec2
After text and image models, a third modality is now supported with the addition of Wav2Vec2.
- Add suport for Wav2Vec2 #120
Check out the audio classification and speech recognition examples to see how to use it.
SynapseAI 1.7
This release is fully compatible with SynapseAI 1.7.0, which is the latest version. Check out Habana's documentation for more information about the new features.
Memory stats
Memory stats are now logged every logging_steps
steps to give more information about the memory consumption of HPUs.
- Memory stats #89
DeepSpeed demo notebook with GPT2-XL
This repository now has a notebook displaying how to use DeepSpeed to pre-train/fine-tune GPT2-XL on GAUDI. You can find it here.
- Add DeepSpeed demo notebook #112
Fix gradient checkpointing for BERT/RoBERTa/ALBERT
An error used to be raised by PyTorch when running BERT-like models with gradient checkpointing. This has been fixed.
- Fix gradient checkpointing for BERT/RoBERTa/ALBERT #118
v1.2: DeepSpeed and CV Models
DeepSpeed
This release brings support for DeepSpeed. It is now possible to train bigger models on Gaudi with Optimum Habana!
- Add support for DeepSpeed #93
Check the documentation here to know how to use it.
Computer Vision Models
Two computer-vision models have been validated for performing image classification in both single- and multi-cards configurations:
- ViT #80
- Swin
You can see how to use them in this example.
SynapseAI 1.6.0
This release is fully compatible with SynapseAI 1.6.0.
- Update to SynapseAI 1.6.0 #91
It is recommended to use SynapseAI 1.6.0 for optimal performance.
Documentation
Optimum Habana now has a dedicated documentation. you can find it here.
It shows how to quickly make a Transformers-based script work with the library. It also contains guides explaining how to do distributed training, how to use DeepSpeed or how to make the most of HPUs to accelerate training.
Masked Language Modeling
A new example script has been added to perform masked language modeling. This is especially useful if you want to pretrain models such as BERT or RoBERTa.
- Add run_mlm.py in language-modeling examples #83
v1.1.2: Patch Release
This patch release fixes a bug where it is possible to initialize processes multiple times in distributed mode, leading to an error.
V1.1.1 Patch Release
This patch release fixes a bug where the loss is equal to NaN from the first training iteration with Transformers 4.21.0.
v1.1.0: GPT2, T5 and SynapseAI 1.5.0
GPT2
You can now train or fine-tune GPT2 for causal language modeling on up to 8 HPUs. An example of fine-tuning on WikiText-2 is provided here.
- Add support for language modeling (GPT2) #52
You can also use GPT2 for text generation in lazy mode.
- Accelerate generation #61
T5
Encoder-decoder architectures are now supported. In particular, examples relying on T5 for the following tasks are available:
- summarization, with an example of fine-tuning T5 on the CNN/DailyMail dataset,
- translation, with an example of fine-tuning T5 on the WMT16 dataset for translating English to Romanian.
You can also use T5 for text generation in lazy mode.
- Accelerate generation #61
Support for SynapseAI 1.5.0
The newly released SynapseAI 1.5.0 is now supported. You can find more information about it here.
- Add support for SynapseAI 1.5.0 #65
This is a breaking change, you should update your version of SynapseAI as described here in order to use this new release.
GaudiConfig instantiation is not mandatory anymore
If the name of your Gaudi configuration is given in the training arguments, you do not have to instantiate it and provide it to the trainer anymore. This will be automatically taken care of. You can still instantiate a Gaudi configuration and provide it to the trainer.
- Enable GaudiConfig instantiation from inside the trainer #55
Refined throughput computation in lazy mode
In lazy mode, the first two steps are warmup steps used for graph compilation. In order to discard them from the throughput computation, you can just add the following training argument: --throughput_warmup_steps 2
.
- Add a new argument for taking warmup steps into account in throughput computation #48
Optimum Habana v1
With this release, we enable easy and fast deployment of models from the Transformers library on Habana Gaudi Processors (HPU).
- The class
GaudiTrainer
is built on top of the original classTrainer
and enables to train and evaluate models from the Transformers library on HPUs. - The class
GaudiTrainingArguments
is built on top of the original classTrainingArguments
and adds 3 new arguments:use_habana
to deploy on HPUuse_lazy_mode
to use lazy mode instead of eager modegaudi_config_name
to specify the name of or the path to the Gaudi configuration file
- The class
GaudiConfig
enables to specify a configuration for deployment on HPU, such as the use of Habana Mixed Precision, the use of custom ops,... - Multi-card deployment is enabled
- Examples are provided for question answering and text classification in both single- and multi-card settings.
- The following models have been validated:
- BERT base/large
- RoBERTa base/large
- ALBERT large/XXL
- DistilBERT