Note

This project is moved to https://github.com/egorsmkv/asr-cc

ASR Corpus Creator

Overview

This app is intended to automatically create a corpus for ASR systems using pseudo-labeling.

Features

Send links of YouTube content or even an entire channel
Send direct links to video/audio from remote servers
Send local folders
Collect metadata
- Loudness
- Label language detection
- Audio language detection
- SRMR ratio (measure reverberation)
- Audio type (Speech, Music, etc)
Export labeled data using a console
whisper, wav2vec2, or NeMo as an ASR backend

Installing

Requirements

Rust compiler
CMake
Protobuf compiler
Docker
wget
youtube-dl or yt-dlp
ffmpeg
Python >= 3.6

Clone the project

git clone https://github.com/egorsmkv/asr-corpus-creator
cd asr-corpus-creator

Installation

On MacOS

Install grpcio like the following:

pip install https://github.com/pietrodn/grpcio-mac-arm-build/releases/download/1.50.0/grpcio-1.50.0-cp310-cp310-macosx_11_0_arm64.whl

NOTE: Visit https://github.com/pietrodn/grpcio-mac-arm-build repository's releases to see different wheels for your Python version.

Install dependencies & activate virtual environment

# install pipenv if you do not use it already
pip install pipenv

# activate virtual environment
pipenv shell

# install Cython first
pip install Cython==0.29.32

# install dependencies
pipenv install

# or install dependencies with ability to see the installation process
pip install Django==4.1.3 django-bootstrap4==22.2 celery==5.2.7 redis==4.3.3 librosa==0.9.1 torch==1.13.0 torchaudio==0.13.0 pyzmq==23.1.0 transformers==4.20.1 loguru==0.6.0 psutil==5.9.1 pyctcdecode==0.3.0 'nemo-toolkit[asr]==1.12.0' git+https://github.com/openai/whisper.git lingua-language-detector==1.1.3 git+https://github.com/csteinmetz1/pyloudnorm

Apply migrations

python source/manage.py migrate

Create an account

python source/manage.py createsuperuser

Running

WebUI

python source/manage.py runserver

After starting the WebUI, open http://127.0.0.1:8000/ in your browser and log in with credentials you have created on the previous step.

Celery

Run Redis broker via docker:

docker run -d -p 6390:6379 redis

Run the worker (with youtube-dl):

cd source
HF_TOKEN=xxxx WGET_PATH=/opt/homebrew/bin/wget YOUTUBE_DL=/opt/homebrew/bin/youtube-dl FFMPEG_PATH=/Users/yehorsmoliakov/opt/miniconda3/bin/ffmpeg celery -A app worker -l INFO --concurrency 1

Run the worker (with yt-dlp):

cd source
HF_TOKEN=xxxx WGET_PATH=/opt/homebrew/bin/wget YT_DLP=/opt/homebrew/bin/yt-dlp FFMPEG_PATH=/Users/yehorsmoliakov/opt/miniconda3/bin/ffmpeg celery -A app worker -l INFO --concurrency 1

Run the worker with audio language detection:

cd source
HF_TOKEN=xxxx DETECT_AUDIO_LANG=yes WGET_PATH=/opt/homebrew/bin/wget YOUTUBE_DL=/opt/homebrew/bin/youtube-dl FFMPEG_PATH=/Users/yehorsmoliakov/opt/miniconda3/bin/ffmpeg celery -A app worker -l INFO --concurrency 1

You can set a higher value to the concurrency argument if you need the system to be more performant.

ZeroMQ server

Choose one server.

Whisper

WHISPER_LANG=en WHISPER_MODEL=base python zmq_server_whisper.py

wav2vec2

python zmq_server_w2v2.py

NeMo

python zmq_server_nemo.py

With a language model

Download unigrams.txt and lm.binary files.

Then, run as:

USE_LM=yes LM_UNIGRAMS_FILE=unigrams.txt LM_FILE=lm.binary NEMO_MODEL=theodotus/stt_uk_squeezeformer_ctc_ml python zmq_server_nemo.py

Push new files from a folder to the processing

Structure of the command:

python source/manage.py push_to_processing <collection_key> <lang> <folder_path>

Example:

python source/manage.py push_to_processing cv10 uk /Users/yehorsmoliakov/Downloads/test-folder

You can configure a CRON command to push new files with the above command.

Task monitoring

If you want to monitor running tasks, then install Flower:

pip install flower

and run it like the following:

cd source
nohup celery -A app flower -l INFO --concurrency 1 --port=5566 > /tmp/flower.log 2> /tmp/flower_errors.log < /dev/null &

then go to http://localhost:5566 to browse Flower.

Export the data

FILES_DIR=/Users/yehorsmoliakov/Work/asr-corpus-creator/source/content/media/audios/ python source/manage.py export_utterances_as_jsonl test_it2 > records.jsonl

test_it2 is the collection_key argument.

Classify utterances to get content audio type

python source/manage.py classify_utterances <collection_type> <device_id>

# for example:
python source/manage.py classify_utterances yt-split7 cpu
python source/manage.py classify_utterances yt-split7 cuda:0

Or to classify all utterances:

python source/manage.py classify_all_utterances <device_id>

# for example:
python source/manage.py classify_all_utterances cpu
python source/manage.py classify_all_utterances cuda:0

This command will classify utternaces using the AST model to get content type (Speech, Music, Bicycle bell, etc) from audio files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ASR Corpus Creator

Overview

Features

Installing

Requirements

Clone the project

Installation

On MacOS

Install dependencies & activate virtual environment

Apply migrations

Create an account

Running

WebUI

Celery

ZeroMQ server

Whisper

wav2vec2

NeMo

With a language model

Push new files from a folder to the processing

Task monitoring

Export the data

Classify utterances to get content audio type

Files

README.md

Latest commit

History

README.md

File metadata and controls

ASR Corpus Creator

Overview

Features

Installing

Requirements

Clone the project

Installation

On MacOS

Install dependencies & activate virtual environment

Apply migrations

Create an account

Running

WebUI

Celery

ZeroMQ server

Whisper

wav2vec2

NeMo

With a language model

Push new files from a folder to the processing

Task monitoring

Export the data

Classify utterances to get content audio type