Releases: intel/intel-extension-for-transformers
Intel® Extension for Transformers v1.4.2 Release
Highlights
Improvements
Examples
Bug Fixing
Highlights
- Support vLLM CPU and IPEX CPU WOQ with Transformer-like API
- Support Streamingllm on Habana Gaudi
Improvements
- Add true_sequential for WOQ GPTQ (091f564 )
- Refine GPU scripts to support OOB mode (f4b3a7b )
- QBits adapt to the latest BesTLA (c169bec )
- Improve CPU WOQ scheme setting (fd3ee5cf1 )
- Enhance voicechat API with multilang tts streaming support (98daf37d )
- Add bias internal convertion in qbits (7c29f6f1 )
- Add DynamicQuantConfig, QuantAwareTrainingConfig and StaticQuantConfig (6a15b48, e1f4666d)
Examples
- Integrate EAGLE with ITREX (e559929d )
Bug Fixing
- Fix for token latency (ae7a4ae )
- Fix phi3 quantization scripts (2af19c7a6 )
- Fix is_intel_gpu_available (47d5024 )
- Fix QLoRA CPU issue due to internal API change (699ffca )
- Add scale and weight dtype check for quantization config (307c1a8b )
- Fix tf autodistill bug of transformers>=4.37 (8116fbb2 )
Validated Configurations
- Python 3.10
- Ubuntu 22.04
- PyTorch 2.2.0+cpu
- Intel® Extension for Torch 2.2.0+cpu
Intel® Extension for Transformers v1.4.1 Release
Highlights
Improvements
Examples
Bug Fixing
Highlights
- Support Weight-only Quantization on MTL iGPU
- Upgrade lm-eval to 0.4.2
- Support Llama3
Improvements
- Support TPP for Xeon Tensor Parallel (5f0430f )
- Refine Model
from_pretrained
Whenuse_neural_speed
(39ecf38e )
Examples
- Add vision front-end demo (1c6550 )
- Add example for table extraction, and enabled multi-page table handling pipeline (db9e6fb )
- Adapted textual inversion distillation for quantization example to latest transformers and diffusers packages (0ec83b1 )
- Update NeuralChat Notebooks (83bb65a, 629b9d4 )
Bug Fixing
- Fix QBits actshuf buf overflow under large batch (a6f3ab3 )
- Fix TPP support for single socket (a690072 )
- Fix retrieval dependency (281b0a3 )
- Fix loading issue of woq model with parameters (37f9db25 )
Validated Configurations
- Python 3.10
- Ubuntu 22.04
- PyTorch 2.2.0+cpu
- Intel® Extension for Torch 2.2.0+cpu
Intel® Extension for Transformers v1.4 Release
Highlights
Features
Productivity
Examples
Bug Fixing
Highlights
- AutoRound is SOTA weight-only quantization (WOQ) algorithm for low-bit LLM inference on typical LLMs. This release includes support for AutoRound quantization and inference with INT4 models quantized by AutoRound.
Features
- LLM Workflow/Neural Chat
- Support Triton Serving/Deployment on HPU/GPU (4657036, c57c17e )
- Enable HF/TGI Endpoint (5b84e5, 525ea8, 34b3e9 )
- Enable RAG + ChatGPT flow (de8800 )
- [UI] Customized side by side (5835c3 )
- Support Multi-language TTS (260155a )
- Support language detection & translation for RAG chat (99df35d8 )
- Add file management in RAG API (b7fc01de )
- Support deepspeed for Textchat API (7b0b995 )
- Transformers Extension for LLM Optimization
Productivity
- Add bm25 algorithm into retrievers (a19467d0 )
- Add evaluation perplexity during training (2858ed1 )
- Enhance embedding to support jit model (588c60 )
- Update the character checking function to enable the Chinese character (0da63fe1 )
- Enlarge the context window for HPU graph recompile (dcaf17ac )
- Support IPEX bf16 & fp32 optimization for emebedding model (b51552 )
- Enable lm_eval during training. (2de883 )
- Refine setup.py and requirements.txt (436847 )
- Improve WOQ model saving and loading (30d9d10, 1065d81c )
- Add layerwise for WOQ RTN & GPTQ (15a848f3 )
- Update sparseGPT example (3ae0cd0 )
- Changed regular expression to add support of the unicode characters (fd2516b )
- Check and convert contiguous tensor when model saving (d21bb3e )
- Support load model from modelscope using NeuralSpeed (20ae00 )
Examples
- Support microsoft/biogpt model (3e7e35 )
- Add finetuning example for gemma-2b on ARC. (ffa8f3c6 )
- Add example to use RAG+OpenAI LLM (3c5959 )
- Enable mistralai/Mixtral-8x7B-v0.1 LORA finetuning on Gaudi2 (7539c35 )
- Enable image2text finetuning example on CPU (ef94aeaa )
- Add LLaVA-NeXT (feff1ec0 )
Bug Fixing
- Fix CLM tasks when transformers >= 4.38.1 (98bfcf8 )
- Fix distilgpt2 TF signature issue (a7c15a9f )
- Add User input + max tokens requested exceeds model context window error response (ae91bf8 )
- Fix audio plugin sample code issue and provide a way to set tts/asr model path (db7da09 )
- Fix modeling_auto trust_remote_code issue (3a0987 )
- Fix lm-eval neuralspeed loading model (cd6e488 )
- Fixed weight-only config save issue (5c92fe31 )
- Fix index error in Child-parent retriever (8797cfe )
- Fix WOQ int8 unpack weight (edede4 )
- Fix gptq desc_act and static_group (528d7de )
- Fix request.client=None issue (494a571 )
- Fix WOQ huggingface model loading (01b1a44 )
- Fix SQ model restore loading (1e00f29 )
- Remove redundant parameters for WOQ saving config and fix GPTQ issue (ef0882f6 )
- Fixed exmple error for Intel GPU WOQ (8fdde06 )
- Fix woq autoround last layer quant issue (d21bb3e )
- Fix code-generation params (ab2fd05 )
Validated Configurations
- Python 3.8, 3.9, 3.10, 3.11
- Ubuntu 20.04 & Windows 10
- Intel® Extension for TensorFlow 2.13.0, 2.14.0
- PyTorch 2.2.0+cpu 2.1.0+cpu
- Intel® ...
Intel® Extension for Transformers v1.3.2 Release
Highlights
- Support NeuralChat-TGI serving with Docker (8ebff39)
- Support Neuralchat-vLLM serving with Docker (1988dd)
- Support SQL generation in NeuralChat (098aca7)
- Enable llava mmmu evaluation on Gaudi2 (c30353f)
- Improve LLM INT4 inference on Intel GPUs
Improvements
- Minimize dependencies for running a chatbot (a0c9dfe)
- Remove redundant knowledge id in audio plugin API (9a7353)
- Update parameters for NeuralSpeed (19fec91)
- Integrate backend code of Askdoc (c5d4cd)
- Refine finetuning data preprocessing with static shape for Gaudi2 (3f62ceb)
- Sync RESTful API with latest OpenAI protocol (2e1c79)
- Support WOQ model save and load (1c8078f)
- Extend API for GGUF (7733d4)
- Enable OpenAI compatible audio API (d62ff9e)
- Add pack_weight info acquire interface (18d36ef)
- add customized system prompts (04b2f8)
- Support WOQ scheme asym (c7f0b70)
- update code_lm_eval to bigcode_eval (44f914e)
- enable Retrieval PDF figure to text (d6a66b3)
- enable retrieval then rerank pipeline (15feadf)
- enable gramma check and query polish to enhance RAG performance (a63ec0)
Examples
- Add Rank-One Model Editing (ROME) implementation and example (8dcf0ea7)
- Support GPTQ, AWQ model in NeuralChat (5b08de)
- Add Neural Speed example scripts (6a97d15, 3385c42)
- Add langchain extension example and update notebook (d40e2f1)
- Support deepseek-coder models in NeuralChat (e7f5b1d)
- Add autoround examples (71f5e84)
- BGE embedding model finetuning (67bef24)
- Support DeciLM-7B and DeciLM-7B-instruct in NeuralChat (e6f87ab)
- Support GGUF model in NeuralChat (a53a33c)
Bug Fixing
- Add trust_remote_code args for lm_eval of WOQ example.( 9022eb)
- Fix CPU WOQ accuracy issue (e530f7)
- Change the default value for XPU weight-only quantization (4a78ba)
- Fix whisper forced_decoder_ids error (09ddad)
- Fix off by one error on masking (525076d)
- Fix backprop error for text only examples (9cff14a)
- Use unk token instead of eos token (6387a0)
- Fix errors in trainer save (ff501d0)
- Fix Qdrant bug caused by langchain_core upgrade (eb763e6)
- Set trainer.save_model state_dict format to safetensors (2eca8c)
- Fix text-generation example accuracy scripts (a2cfb80)
- Resolve WOQ quantization error when running neuralchat (6c0bd77)
- Fix response issue of model.predict (3068496)
- Fix pydub library import issues (c37dab)
- Fix chat history issue (7bb3314)
- Update gradio APP to sync with backend change (362b7af)
Validated Configurations
- Python 3.10
- Ubuntu 22.04
- Intel® Extension for TensorFlow 2.13.0
- PyTorch 2.1.0+cpu
- Intel® Extension for Torch 2.1.0+cpu
Intel® Extension for Transformers v1.3.1 Release
Highlights
Improvements
Examples
Bug Fixing
Validated Configurations
Highlights
- Support experimental INT4 inference on Intel GPU (ARC and PVC) with Intel Extension for PyTorch as backend
- Enhance LangChain to support new vectorstore (e.g., Qdrant)
Improvements
- Improve error code handling coverage (dd6dcb4 )
- NeuralChat document refine (aabb2fc )
- Improve Text-generation API (a4aba8 )
- Refactor transformers-like API to adapt to latest transformers version (4e6834a )
- NeuralChat integrate GGML INT4 (29bbd8 )
- Enable Qdrant vectorstore (f6b9e32 )
- Support llama series model for llava finetuning (d753cb )
Examples
- Support GGUF Q4_0, Q5_0 and Q8_0 models from HuggnigFcae (1383c7)
- Support GPTQ model inference on CPU (f4c58d0 )
- Support SOLAR-10.7B-Instruct-v1.0 model (77fb81 )
- Support magicoder model and refine load model (f29c1e )
- Support Mixstral-8x7b model (9729b6 )
- Support Phi-2 model (04f5ef6c )
- Evaluate Perplexity of NeuralSpeed (b0b381)
Bug Fixing
- Fix GPTQ load in issue ( 226e08 )
- Fix tts crash with messy retrieval input and enhance normalizer (4d8d9a )
- Support compatible stats format (c0a89c5a )
- Fix RAG example for retrieval plugin parameter change (c35d2b )
- Fix magicoder tokenizer issue and streaming redundant end format (2758d4 )
Validated Configurations
- Python 3.10
- Centos 8.4 & Ubuntu 22.04
- Intel® Extension for TensorFlow 2.13.0
- PyTorch 2.1.0+cpu
- Intel® Extension for Torch 2.1.0+cpu
Intel® Extension for Transformers v1.3 Release
Highlights
Publication
Features
Examples
Bug Fixing
Incompatible change
Highlights
- LLM Workflow/Neural Chat
- Achieved Top-1 7B LLM Hugging Face Open Leaderboard in Nov’23
- Released DPO dataset to Hugging Face Space for fine-tuning
- Published the blog and fine-tuning code on Gaudi2
- Supported fine-tuning and inference on Gaudi2 and Xeon
- Updated notebooks for chatbot development and deployment
- Provided customizable RAG-based chatbot applications
- Published INT4 chatbot on Hugging Face Space
- Transformer Extension for Low-bit Inference and Fine-tuning
- Supported INT4/NF4/FP4/FP8 LLM inference
- Improved StreamingLLM for efficient endless text generation
- Demonstrated up to 40x better performance than llama.cpp on Intel Xeon Scalable Processors
- Supported QLoRA fine-tuning on CPU
Publications
- NeurIPS'2023 on Efficient Natural Language and Speech Processing: Efficient LLM Inference on CPUs
- NeurIPS'2023 on Diffusion Models: Effective Quantization for Diffusion Models on CPUs
- Arxiv: TEQ: Trainable Equivalent Transformation for Quantization of LLMs
- Arxiv: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Features
- LLM Workflow/Neural Chat
- Support Gaudi model parallelism serving (7f0090)
- Add PEFT model support in deepspeed sharded mode (370ca3)
- Support return error code (ea173a)
- Enhance NeuralChat security (ab43c7, 43e8b9, 6e0386)
- Support assisted generation for NeuralChat (5ba797)
- Add codegen restful API in NeuralChat (0c77b1)
- Support multi cards streaming inference on Gaudi (9ad75c)
- Support multi CPU restful API serving (fec4bb4)
- Support IPEX int8 model (e13363)
- Enable retrieval with URL as inputs (9d90e1d)
- Add NER plugin to NeuralChat (aa5d8a)
- Integrate PhotoAI backend into NeuralChat (da138c, d7a1d8)
- Support image to image plugin as service (12ad4c)
- Support optimized SadTalker to Video plugin in NeuralChat (7f24c79)
- Add askdoc retrieval API & example (89cf76)
- Add sidebyside UI (dbbcc2b)
- Transformer Extension for Low-bit Inference and Fine-tuning
- Support load_in_nbit in llm runtime (4423f7)
- Extend langchain embedding API (80a779)
- Support QLoRA on CPU device (adb109)
- Support PPO rl_training (936c2d2, 8543e2f)
- Support multi-model training (ecb448)
- Transformers Extension for Low-bit Inference Runtime support GPTQ models (8145e6)
- Enable Beam Search post-processing (958d04, ae95a2, ae95a2, 224656, 958d04, 6ea825)
- Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4) (f49f2d, 9f96ae)
- Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas (43e30b)
- Support attention block TP and add jblas split weight interface (2c31dc, 22ceda4)
- Enabing streaming LLM for Runtime (ffc73bb5)
- Support starcoder MHA fusion (841b29a)
- SmoothQuantConfig support recipes (1e0d7e)
- Add SetFit API in ITREX (ffb7fd8)
- Support full parameters finetuning (2b541)
- Support SmoothQuant auto tune (2fde68c)
- Use python logging instead of print (60942e)
- Support Falcon and unify int8 API (0fb2da8)
- Support ipex.optimize_transformers feature (d2bd4d, ee855, 3f9ee42)
- Optimized dropout operator (7d276c)
- Add Script for PPL Evaluation (df40d5)
- Refine Python API (91511d, 6e32ca6)
- Allow CompileBF16 on GCC11 ([d9e95d](d9e95da382a9f2b81cace55825...
Intel® Extension for Transformers v1.2.2 Release
Bug Fixing & Improvements
- Replace test dataset with validation dataset when do_eval.(e764bb5)
- fix save issue of deepspeed zero3.(cf5ff82)
- Fix UT issues on Nvidia GPU.(464962e)
- Fix Ner nightly ut bug.(9e5a6b3)
- Escape sql string for SDL.(43e8b9a)
- Fix added_tokens error.(fd74a9a)
Validated Configurations
- Python 3.9, 3.10
- Centos 8.4 & Ubuntu 22.04
- Intel® Extension for TensorFlow 2.13.0
- PyTorch 2.1.0+cpu
- Intel® Extension for PyTorch 2.1.0+cpu
- Transformers 4.34.1
Intel® Extension for Transformers v1.2.1 Release
- Examples
- Bug Fixing & Improvements
Examples
- Add docker for code-generation (dd3829 )
- Enable Qwen-7B-Chat for NeuralChat (698e58 )
- Enable Baichuan & Baichuan2 CPP inference (98e5f9 )
- Add sidebyside UI for NeuralChat (dbbcc2 )
- Support Falcon-180B CPP inference (900ebf )
- Support starcoder finetuning example (073bdd )
- Enable text-generation using qwen (8f41d4 )
- Add docker for neuralchat (a17d952 )
Bug Fixing & Improvements
- Fix bug for woq with AWQ due to not set calib_iters if calib_dataloader is not None.( 565ab4)
- Fix init issue of langchain chroma (fdefe2)
- Fix NeuralChat starcoder mha fusion issue (ce3d24)
- Fix setuptools version limitation for build (2cae32)
- Fix post process with topk topp of python api (7b4730)
- Fix msvc compile issues (87b00d)
- Refine notebook and fix restful api issues (d8cc11)
- Upgrade qbits backend (45e03b )
- Fix starcoder issues for IPEX int8 and Weight Only int4 (e88c7b )
- Fix ChatGLM2 model loading issue (4f2169 )
- Remove OneDNN graph env setting for BF16 inference (59ab03 )
- Improve database by escape sql string (be6790 )
- fix qbits backend get wrong workspace malloc size (6dbd0b )
Validated Configurations
- Python 3.9, 3.10
- Centos 8.4 & Ubuntu 22.04
- Intel® Extension for TensorFlow 2.13.0
- PyTorch 2.1.0+cpu
- Intel® Extension for PyTorch 2.1.0+cpu
- Transformers 4.34.1
Intel® Extension for Transformers v1.2 Release
Highlights
Features
Productivity
Examples
Bug Fixing
API Modification
Documentation
Highlights
- NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next '23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors. The chatbot solution has been integrated into LLM as a service (LLMaaS), providing the smooth user experience to build GenAI/LLM applications by leveraging Intel latest Xeon Scalable Processors and Gaudi2 from Intel Developer Cloud.
- NeuralChat offers a comprehensive pipeline to build an end-to-end chatbot applications with a rich set of pluggable features such as speech cloning & interaction (EN/CN), knowledge retrieval, query caching, and security guardrail. These features allow you to create a custom chatbot from scratch within minutes, therefore significantly improving the chatbot development productivity.
- LLM runtime extends Transformers API to provide seamless weight-only low precision inference for Hugging Face transformer-based models including LLMs. We improve LLM runtime with more comprehensive kernel support on low precisions (INT8/FP8/INT4/FP4/NF4), while keeping full compatibility with GGML. LLM runtime delivers 25ms/token with int4 llama, 22ms/token with int4 GPT-J on Intel Xeon Scalable Processors. Therefore, providing a complementary and extremely optimized LLM runtime solution for Intel architectures.
Features
- Neural Chat
- Model Optimization
- LLM Runtime
- Enable FFN fusion LLMs (277108 )
- Enabled Tensor Parallelism for 4bits GPT-J on 2 sockets (fe0d65c )
- Implement AMX INT8/BF16 MHA (c314d6c )
- Support asymmetric models in LLM Runtime (93ca55 )
- Jblas and Qbits support nf4 and s4-fullrange weight compression. (ff7af86 )
- Enhance Beam-search early-stopping mechanisms (cd4c33d )
Productivity
- ITREX moved to fully public development, welcome to contribute to us. (90ca31 )
- Support streaming mode for neuralchat (f5892ec )
- Support Direct Preference Optimization to improve accuracy. (50b5b9 )
- Support query cache for chatbot (1b4463 )
- Weight-only support for Pytorch Framework (3a064fa )
- Provide mix INT8 & BF16 inference mode for stable diffusion. (bd2973 )
- Supported Stable Diffusion v1.4/v1.5/v2.1 and QAT inference on Linux/windows (02cc59 )
- Update Onednn to v3.3 (e6d8a4 )
- Weight-only kernel support INT8 quantization (6ce8b13 )
- Enable flash attention like kernel in weight-only (0ef3942)
- Weight-only kernel ISA based dispatcher (ff7af86 )
- Support 4bits per-channel quantization (4e164a8 )
Examples
- Add Falcon, ChatGLM CLM examples (c3b196 )
- Enabled code-generation example with Docker and integrated bigcode/lm-eval (c569fd5 0b3450 )
- Weight-only ChatGLM-V1/V2, BLOOM-7B, MPT-30B, Llama2-7B, Llama2-70B, Falcon-40B, Dolly-V2, Starcoder-15B and OPT series examples (9a2cfa 793629 ac5744f 96d424 d4fb27 2a82ee0 e4eb09f d4fb27 578162 f5df02 )
- Support Intel/neural-chat-7b-v1-1 model in ChatBot (126d07b )
- Add fine-tuning for Text-to-Speech(TTS) task in NeuralChat (1dac9c6 e39fec90 )
- Support GPT-J NeuralChat in Habana (9ef6ad8 )
- Enable MPT peft LORA finetune in Gaudi (3dc184e )
- Add code-generation finetuning pipeline (c070a8 )
- E2E Talking Bot example on Windows PC (be2a267 )
Bug Fixing
- Fixed issues from Cobalt, the 3-party company is hired by Intel to do Penetration testing. (51a1b88 )
- Fix windows compile issues (bffa1b0 )
- Fix ordinals and conjunctions in tts normalizer ([0892f8a](https://github.com/intel/intel-extension-for-transformers/...
Intel® Extension for Transformers v1.1.1 Release
- Highlights
- Bug Fixing & Improvements
- Tests & Tutorials
Highlights
In this release, we improved NeuralChat, a customizable chatbot framework under Intel® Extension for Transformers. NeuralChat is now available for you to create your own chatbot within minutes on multiple architectures.
Bug Fixing & Improvements
- Fix the code structure and the plugin in NeuralChat (commit 486e9e)
- Fix bug in retrieval chat (commit d2cee0)
- NeuralChat Inference return correct input len without pad to user (commit 18be4c)
- Fix MPT not support left padding issue (commit 24ae58)
- Fix double remove dataset columns when concatenation (commit 67ce6e)
- Fix DeepSpeed and use cache issue (commit 4675d4)
- Fix bugs in predict_stream (commit e1da7e)
- Fix docker CPU issues (commit 8fa0dc)
- Fix read HuggingFaceH4/oasst1_en dataset issue (commit 76ee68)
- Modify Dockerfile for finetuning (commit 797aa2)
- Fix the perf of LLaMA2 by static_shape in optimum Habana (commit 481f38)
- Remove NeuralChat redundant code and hard codes. (commit 0e1e4d, 037ce8, 10af3c)
- Refined NeuralChat finetuning config (commit e372cf)
Tests & Tutorials
- Add inference test for LLaMA2 and MPT with HPU (commit 5c4f5e)
- Add inference test for LLaMA2 and MPT with Intel CPUs (commit ad4bec, 2f6188)
- Add finetuning test for MPT (commit 72d81e, 423242)
- Add GHA Unit Tests (commit 49336d)
- NeuralChat finetuning tutorial for LLaMA2 and MPT (commit d156e9)
- NeuralChat deployment on Intel CPU/ Habana HPU/ Nvidia tutorial (commit b36711)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.9
- PyTorch 2.0.0
- TensorFlow 2.12.0
Acknowledgements
Thanks for the contributions from sywangyi, jiafuzha and itayariel. Thanks to all the participants to Intel Extension for Transformers.