Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

Intel® Extension for Transformers v1.2 Release

Compare
Choose a tag to compare
@kevinintel kevinintel released this 26 Sep 18:53
· 1052 commits to main since this release
8fbcceb

Highlights
Features
Productivity
Examples
Bug Fixing
API Modification
Documentation

Highlights

  • NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next '23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors. The chatbot solution has been integrated into LLM as a service (LLMaaS), providing the smooth user experience to build GenAI/LLM applications by leveraging Intel latest Xeon Scalable Processors and Gaudi2 from Intel Developer Cloud.
  • NeuralChat offers a comprehensive pipeline to build an end-to-end chatbot applications with a rich set of pluggable features such as speech cloning & interaction (EN/CN), knowledge retrieval, query caching, and security guardrail. These features allow you to create a custom chatbot from scratch within minutes, therefore significantly improving the chatbot development productivity.
  • LLM runtime extends Transformers API to provide seamless weight-only low precision inference for Hugging Face transformer-based models including LLMs. We improve LLM runtime with more comprehensive kernel support on low precisions (INT8/FP8/INT4/FP4/NF4), while keeping full compatibility with GGML. LLM runtime delivers 25ms/token with int4 llama, 22ms/token with int4 GPT-J on Intel Xeon Scalable Processors. Therefore, providing a complementary and extremely optimized LLM runtime solution for Intel architectures.

Features

  • Neural Chat
    • Support ASR/TTS on CPU and HPU (fb619e5 56685a )
    • Added docker for chatbot on Xeon SPR and Habana Gaudi (59fc92e ad2ee1)
    • Refine Chatbot workflow and use NeuralChat API (53bed4 e95fc32 )
    • Implement API python sdk, weight only quantization and AMP for Neural-Chat. (08ba5d85 )
  • Model Optimization
    • Add GPTQ/TEQ/WOQ quantization with plenty examples (b4b2fcc 1bcab14 )
    • Enhance the ITREX quantization API as well as LLMRuntime, users can now obtain a quantized model using AutoModelForCausalLM.from_pretrained. (be651b f4dc78 )
    • Support GPT-J pruning (802ec0d2 )
  • LLM Runtime
    • Enable FFN fusion LLMs (277108 )
    • Enabled Tensor Parallelism for 4bits GPT-J on 2 sockets (fe0d65c )
    • Implement AMX INT8/BF16 MHA (c314d6c )
    • Support asymmetric models in LLM Runtime (93ca55 )
    • Jblas and Qbits support nf4 and s4-fullrange weight compression. (ff7af86 )
    • Enhance Beam-search early-stopping mechanisms (cd4c33d )

Productivity

  • ITREX moved to fully public development, welcome to contribute to us. (90ca31 )
  • Support streaming mode for neuralchat (f5892ec )
  • Support Direct Preference Optimization to improve accuracy. (50b5b9 )
  • Support query cache for chatbot (1b4463 )
  • Weight-only support for Pytorch Framework (3a064fa )
  • Provide mix INT8 & BF16 inference mode for stable diffusion. (bd2973 )
  • Supported Stable Diffusion v1.4/v1.5/v2.1 and QAT inference on Linux/windows (02cc59 )
  • Update Onednn to v3.3 (e6d8a4 )
  • Weight-only kernel support INT8 quantization (6ce8b13 )
  • Enable flash attention like kernel in weight-only (0ef3942)
  • Weight-only kernel ISA based dispatcher (ff7af86 )
  • Support 4bits per-channel quantization (4e164a8 )

Examples

Bug Fixing

  • Fixed issues from Cobalt, the 3-party company is hired by Intel to do Penetration testing. (51a1b88 )
  • Fix windows compile issues (bffa1b0 )
  • Fix ordinals and conjunctions in tts normalizer (0892f8a )
  • Fix Habana finetuning issues (2bbcf51 )
  • Fix bugs in RAG code for converting the prompt (bfad5c )
  • Fix normalizer: year, punctuation after number, end token (775a12 )
  • Fix Graph Model quantization on AVX2-only Platforms (3c84ec6 )

API Modification

  • Update the input of the 'device' parameter in the NeuralChat fine-tuning API, changing it from 'habana' to 'hpu' (96dabb0)
  • Change default values of do_lm_eval, lora_all_linear and use_fast_tokenizer in ModelArguments from False to True. (52f9f74)

Documentation

Validated Configurations

  • Python 3.8, 3.9, 3.10
  • Centos 8.4 & Ubuntu 20.04 & Windows 10
  • Intel® Extension for TensorFlow 2.12.0, 2.11.0
  • PyTorch 2.0.0+cpu, 1.13.1+cpu
  • Intel® Extension for PyTorch 2.0.0+cpu, 1.13.100+cpu