Skip to content

Releases: huggingface/optimum-habana

v1.7: Llama 2, Falcon, LoRA, Transformers v4.31, SynapseAI v1.11

17 Aug 11:20
Compare
Choose a tag to compare

Transformers v4.31

Transformers v4.31 (latest stable release) is fully supported.

SynapseAI v1.11

SynapseAI v1.11 (latest stable release) is fully supported.

Optimizations for Llama 2, Falcon, StarCoder, OPT, GPT-NeoX, CodeGen

Torch Autocast

⚠️ Habana Mixed Precision is deprecated and will be removed in SynapseAI v1.12.
Torch Autocast is becoming the default for managing mixed-precision runs.

Improved text-generation example

LoRA examples

Two new LoRA examples for fine-tuning and inference.

LDM3D

New Stable Diffusion pipeline that enables to generate images and depth maps.

Added support for Text Generation Inference (TGI)

TGI is now supported on Gaudi.

GaudiGenerationConfig

Transformers' GenerationConfig has been extended to be fully compatible with Gaudi. It adds two fields to better control generation with static shapes.

Various fixes and improvements

v1.6: Fast DDP, Torch Autocast, SynaspeAI v1.10 and various model optimizations

26 Jun 09:41
Compare
Choose a tag to compare

Fast DDP

A new distribution strategy is introduced. It is lighter, simpler and usually faster than Torch DDP. You can enable it in your runs with --distribution_strategy fast_ddp.

Torch Autocast

It is now possible to use Torch Autocast as mixed precision backend. You can easily enable it in your runs with --bf16 (i.e. exactly like in Transformers).

SynapseAI v1.10

This release is fully compatible with SynapseAI v1.10.0.

HPU graphs for training

You can now use HPU graphs for training your models.

Check out the documentation for more information.

Various model optimizations

Asynchronous data copy

You can now enable asynchronous data copy between the host and devices during training using --non_blocking_data_copy.

  • Enable asynchronous data copy to get a better performance #211 @jychen-habana

Check out the documentation for more information.

Profiling

It is now possible to profile your training relying on GaudiTrainer. You will need to pass --profiling_steps N and --profiling_warmup_steps K.

Adjusted throughput calculation

You can now let the GaudiTrainer compute the real throughput of your run (i.e. not counting the time spent while logging, evaluating and saving the model) with --adjust_throughput.

  • Added an option to remove save checkpoint time from throughput calculation #237 @libinta

Check SynapseAI version at import

A check is performed when importing optimum.habana to let you know if you are running the version of SynapseAI for which Optimum Habana has been tested.

  • Check Synapse version when optimum.habana is used #225 @regisss

Enhanced examples

Several examples have been added or improved. You can find them here.

  • the text-generation example now supports sampling and beam search decoding, and full bf16 generation #218 #229 #238 #251 #258 #271
  • the contrastive image-text example now supports HPU-accelerated data loading #256
  • new Seq2Seq QA example #221
  • new protein folding example with ESMFold #235 #276

v1.5: BLOOM(Z), SynapseAI v1.9.0 and various speedups

17 Apr 18:07
Compare
Choose a tag to compare

BLOOM(Z)

BLOOM is introduced in this release with HPU-optimized tweaks to perform fast inference using DeepSpeed. A text-generation example is provided here so that you can easily try it.

  • Add text-generation example for BLOOM/BLOOMZ with DeepSpeed-inference #190 @regisss

Check out the blog post we recently released for a benchmark comparing BLOOMZ performance on Gaudi2 and A100.

SynapseAI v1.9.0

This release is fully compatible with SynapseAI v1.9.0.

Transformers v4.28 and Diffusers v0.15

This release is fully compatible with the recently released Transformers v4.28 and Diffusers v0.15.

Improved data sampling for training in lazy mode

This release enables to make sure that all batches will have the same size in lazy mode to prevent extra graph compilations.

  • Improve data sampling for training in lazy mode #152 @regisss

HPU graphs for distributed runs and generation

This release enables HPU graphs for distributed runs and text generation.

  • Enable HPU graphs for distributed runs and generation #179 @regisss

Recommend dataloader_num_workers for CV model training

ViT and Swin examples have been updated to add the dataloader_num_workers that enables to speed up training.

  • Adding dataloader_num_workers into example command for better performance #188 @ZhaiFeiyue

Enable to pipeline forward and backward passes

The argument pipelining_fwd_bwd enables to trigger the HPU compution of the forward pass while the CPU interprets the backward pass. This enables to speed up CV models.

  • Add mark_step between fwd and bwd for better performance #189 @ZhaiFeiyue

More information in the documentation.

v1.4: multi-node training and inference mode

12 Feb 23:05
Compare
Choose a tag to compare

Multi-node training

This release adds support for multi-node training through DeepSpeed. This enables you to scale out up to thousands of nodes to speed up your trainings even more!

  • Add support for multi-node training #116

Check out the documentation to get started.

Inference through HPU graphs

You can now perform inference faster on Gaudi with HPU graphs.

  • Add support for inference through HPU graphs in GaudiTrainer #151

HPU graphs are currently only supported for single-device runs. Check out the documentation for more information.

Synapse AI 1.8

This release is fully compatible with SynapseAI 1.8.0, which is the latest version. Check out Habana's documentation for more information about the new features.

DeepSpeed's gradient checkpointing

DeepSpeed's gradient checkpointing is now automatically used when setting gradient_checkpointing=True in a DeepSpeed run.

  • Enable DeepSpeed activation checkpointing #142

v1.3: Stable Diffusion and Wav2Vec2

01 Dec 10:32
Compare
Choose a tag to compare

Stable Diffusion

This release adds a new interface for the 🤗 Diffusers library which enables to support the Stable Diffusion pipeline for inference. Thus, you can now generate images from text on Gaudi relying on the user-friendliness of 🤗 Diffusers.

  • Add support for Stable Diffusion #131

Check out the documentation and this example for more information.

Wav2Vec2

After text and image models, a third modality is now supported with the addition of Wav2Vec2.

  • Add suport for Wav2Vec2 #120

Check out the audio classification and speech recognition examples to see how to use it.

SynapseAI 1.7

This release is fully compatible with SynapseAI 1.7.0, which is the latest version. Check out Habana's documentation for more information about the new features.

Memory stats

Memory stats are now logged every logging_steps steps to give more information about the memory consumption of HPUs.

  • Memory stats #89

DeepSpeed demo notebook with GPT2-XL

This repository now has a notebook displaying how to use DeepSpeed to pre-train/fine-tune GPT2-XL on GAUDI. You can find it here.

  • Add DeepSpeed demo notebook #112

Fix gradient checkpointing for BERT/RoBERTa/ALBERT

An error used to be raised by PyTorch when running BERT-like models with gradient checkpointing. This has been fixed.

  • Fix gradient checkpointing for BERT/RoBERTa/ALBERT #118

v1.2: DeepSpeed and CV Models

12 Sep 09:19
Compare
Choose a tag to compare

DeepSpeed

This release brings support for DeepSpeed. It is now possible to train bigger models on Gaudi with Optimum Habana!

  • Add support for DeepSpeed #93

Check the documentation here to know how to use it.

Computer Vision Models

Two computer-vision models have been validated for performing image classification in both single- and multi-cards configurations:

You can see how to use them in this example.

SynapseAI 1.6.0

This release is fully compatible with SynapseAI 1.6.0.

  • Update to SynapseAI 1.6.0 #91

It is recommended to use SynapseAI 1.6.0 for optimal performance.

Documentation

Optimum Habana now has a dedicated documentation. you can find it here.

It shows how to quickly make a Transformers-based script work with the library. It also contains guides explaining how to do distributed training, how to use DeepSpeed or how to make the most of HPUs to accelerate training.

Masked Language Modeling

A new example script has been added to perform masked language modeling. This is especially useful if you want to pretrain models such as BERT or RoBERTa.

  • Add run_mlm.py in language-modeling examples #83

v1.1.2: Patch Release

12 Aug 08:23
Compare
Choose a tag to compare

This patch release fixes a bug where it is possible to initialize processes multiple times in distributed mode, leading to an error.

V1.1.1 Patch Release

02 Aug 07:44
Compare
Choose a tag to compare

This patch release fixes a bug where the loss is equal to NaN from the first training iteration with Transformers 4.21.0.

v1.1.0: GPT2, T5 and SynapseAI 1.5.0

15 Jul 10:33
Compare
Choose a tag to compare

GPT2

You can now train or fine-tune GPT2 for causal language modeling on up to 8 HPUs. An example of fine-tuning on WikiText-2 is provided here.

  • Add support for language modeling (GPT2) #52

You can also use GPT2 for text generation in lazy mode.

  • Accelerate generation #61

T5

Encoder-decoder architectures are now supported. In particular, examples relying on T5 for the following tasks are available:

  • summarization, with an example of fine-tuning T5 on the CNN/DailyMail dataset,
  • translation, with an example of fine-tuning T5 on the WMT16 dataset for translating English to Romanian.

You can also use T5 for text generation in lazy mode.

  • Accelerate generation #61

Support for SynapseAI 1.5.0

The newly released SynapseAI 1.5.0 is now supported. You can find more information about it here.

  • Add support for SynapseAI 1.5.0 #65

This is a breaking change, you should update your version of SynapseAI as described here in order to use this new release.

GaudiConfig instantiation is not mandatory anymore

If the name of your Gaudi configuration is given in the training arguments, you do not have to instantiate it and provide it to the trainer anymore. This will be automatically taken care of. You can still instantiate a Gaudi configuration and provide it to the trainer.

  • Enable GaudiConfig instantiation from inside the trainer #55

Refined throughput computation in lazy mode

In lazy mode, the first two steps are warmup steps used for graph compilation. In order to discard them from the throughput computation, you can just add the following training argument: --throughput_warmup_steps 2.

  • Add a new argument for taking warmup steps into account in throughput computation #48

Optimum Habana v1

26 Apr 10:45
20e83cd
Compare
Choose a tag to compare

With this release, we enable easy and fast deployment of models from the Transformers library on Habana Gaudi Processors (HPU).

  • The class GaudiTrainer is built on top of the original class Trainer and enables to train and evaluate models from the Transformers library on HPUs.
  • The class GaudiTrainingArguments is built on top of the original class TrainingArguments and adds 3 new arguments:
    • use_habana to deploy on HPU
    • use_lazy_mode to use lazy mode instead of eager mode
    • gaudi_config_name to specify the name of or the path to the Gaudi configuration file
  • The class GaudiConfig enables to specify a configuration for deployment on HPU, such as the use of Habana Mixed Precision, the use of custom ops,...
  • Multi-card deployment is enabled
  • Examples are provided for question answering and text classification in both single- and multi-card settings.
  • The following models have been validated:
    • BERT base/large
    • RoBERTa base/large
    • ALBERT large/XXL
    • DistilBERT