This repository has been archived by the owner on Oct 25, 2024. It is now read-only.
Intel® Extension for Transformers v1.2 Release
Highlights
Features
Productivity
Examples
Bug Fixing
API Modification
Documentation
Highlights
- NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next '23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors. The chatbot solution has been integrated into LLM as a service (LLMaaS), providing the smooth user experience to build GenAI/LLM applications by leveraging Intel latest Xeon Scalable Processors and Gaudi2 from Intel Developer Cloud.
- NeuralChat offers a comprehensive pipeline to build an end-to-end chatbot applications with a rich set of pluggable features such as speech cloning & interaction (EN/CN), knowledge retrieval, query caching, and security guardrail. These features allow you to create a custom chatbot from scratch within minutes, therefore significantly improving the chatbot development productivity.
- LLM runtime extends Transformers API to provide seamless weight-only low precision inference for Hugging Face transformer-based models including LLMs. We improve LLM runtime with more comprehensive kernel support on low precisions (INT8/FP8/INT4/FP4/NF4), while keeping full compatibility with GGML. LLM runtime delivers 25ms/token with int4 llama, 22ms/token with int4 GPT-J on Intel Xeon Scalable Processors. Therefore, providing a complementary and extremely optimized LLM runtime solution for Intel architectures.
Features
- Neural Chat
- Model Optimization
- LLM Runtime
- Enable FFN fusion LLMs (277108 )
- Enabled Tensor Parallelism for 4bits GPT-J on 2 sockets (fe0d65c )
- Implement AMX INT8/BF16 MHA (c314d6c )
- Support asymmetric models in LLM Runtime (93ca55 )
- Jblas and Qbits support nf4 and s4-fullrange weight compression. (ff7af86 )
- Enhance Beam-search early-stopping mechanisms (cd4c33d )
Productivity
- ITREX moved to fully public development, welcome to contribute to us. (90ca31 )
- Support streaming mode for neuralchat (f5892ec )
- Support Direct Preference Optimization to improve accuracy. (50b5b9 )
- Support query cache for chatbot (1b4463 )
- Weight-only support for Pytorch Framework (3a064fa )
- Provide mix INT8 & BF16 inference mode for stable diffusion. (bd2973 )
- Supported Stable Diffusion v1.4/v1.5/v2.1 and QAT inference on Linux/windows (02cc59 )
- Update Onednn to v3.3 (e6d8a4 )
- Weight-only kernel support INT8 quantization (6ce8b13 )
- Enable flash attention like kernel in weight-only (0ef3942)
- Weight-only kernel ISA based dispatcher (ff7af86 )
- Support 4bits per-channel quantization (4e164a8 )
Examples
- Add Falcon, ChatGLM CLM examples (c3b196 )
- Enabled code-generation example with Docker and integrated bigcode/lm-eval (c569fd5 0b3450 )
- Weight-only ChatGLM-V1/V2, BLOOM-7B, MPT-30B, Llama2-7B, Llama2-70B, Falcon-40B, Dolly-V2, Starcoder-15B and OPT series examples (9a2cfa 793629 ac5744f 96d424 d4fb27 2a82ee0 e4eb09f d4fb27 578162 f5df02 )
- Support Intel/neural-chat-7b-v1-1 model in ChatBot (126d07b )
- Add fine-tuning for Text-to-Speech(TTS) task in NeuralChat (1dac9c6 e39fec90 )
- Support GPT-J NeuralChat in Habana (9ef6ad8 )
- Enable MPT peft LORA finetune in Gaudi (3dc184e )
- Add code-generation finetuning pipeline (c070a8 )
- E2E Talking Bot example on Windows PC (be2a267 )
Bug Fixing
- Fixed issues from Cobalt, the 3-party company is hired by Intel to do Penetration testing. (51a1b88 )
- Fix windows compile issues (bffa1b0 )
- Fix ordinals and conjunctions in tts normalizer (0892f8a )
- Fix Habana finetuning issues (2bbcf51 )
- Fix bugs in RAG code for converting the prompt (bfad5c )
- Fix normalizer: year, punctuation after number, end token (775a12 )
- Fix Graph Model quantization on AVX2-only Platforms (3c84ec6 )
API Modification
- Update the input of the 'device' parameter in the NeuralChat fine-tuning API, changing it from 'habana' to 'hpu' (96dabb0)
- Change default values of do_lm_eval, lora_all_linear and use_fast_tokenizer in ModelArguments from False to True. (52f9f74)
Documentation
- Add notebooks for optimization of NeuralChat on SPR, HPU, and A100. (52f9f74 7218806 d156e9a )
- Add Readme and UT for NeuralChat. (daff796 49336d3 9b81f05 )
Validated Configurations
- Python 3.8, 3.9, 3.10
- Centos 8.4 & Ubuntu 20.04 & Windows 10
- Intel® Extension for TensorFlow 2.12.0, 2.11.0
- PyTorch 2.0.0+cpu, 1.13.1+cpu
- Intel® Extension for PyTorch 2.0.0+cpu, 1.13.100+cpu