Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization
This repository contains the code for the paper "Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization" by A F M Saif, Xiaodong Cui, Han Shen, Songtao Lu, Brian Kingsbury, and Tianyi Chen.
In this paper, we present a novel bilevel optimization-based training approach for acoustic models in automatic speech recognition (ASR) tasks, termed bi-level joint unsupervised and supervised training (BL-JUST). BL-JUST employs lower and upper level optimizations with unsupervised and supervised losses respectively, leveraging recent advances in penalty-based bilevel optimization to address this challenging ASR problem with manageable complexity and rigorous convergence guarantees. Extensive experiments on the LibriSpeech and TED-LIUM v2 datasets demonstrate that BL-JUST outperforms the commonly used pre-training followed by fine-tuning strategy.
- BL-JUST Framework: Introduces a feedback loop between unsupervised and supervised training, unlike the conventional PT+FT strategy.
- Bilevel Optimization: Utilizes penalty-based bilevel optimization for joint training with convergence guarantees.
- Empirical Results: Demonstrates superior performance on LibriSpeech and TED-LIUM v2 datasets, reducing word error rates (WERs) and improving training efficiency.
- Code: Implementation of the BL-JUST training framework.
- Experiments: Scripts and configurations for reproducing the experiments presented in the paper.
- Datasets: Instructions for downloading and preparing the LibriSpeech and TED-LIUM v2 datasets.
- Dependencies:
- Python=3.9
- Pytorch=2
- Installation: Step-by-step guide to setting up the environment.
git clone https://github.com/afmsaif/Joint-self-supervised-and-supervised-training-for-speech-models.git cd bl-just-asr pip install -r requirements.txt
- Running Experiments: Detailed instructions to run the training and evaluation scripts.
- Training: Example commands for training the ASR models using the BL-JUST framework.
- Evaluation: Commands to evaluate the trained models and reproduce the results from the paper.
- Performance Metrics: Summary of the ASR performance on different datasets.
- Comparative Analysis: Comparison with the traditional PT+FT approach.
If you find this code useful in your research, please consider citing our paper:
@article{saif2024joint, title={Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization}, author={Saif, AFM and Cui, Xiaodong and Shen, Han and Lu, Songtao and Kingsbury, Brian and Chen, Tianyi}, journal={arXiv preprint arXiv:2401.06980}, year={2024} }
This work was supported by the Rensselaer-IBM AI Research Collaboration, part of the IBM AI Horizons Network and Cisco research grant.