To generate a dataset for voice synthesis, we need labelled audio clips of the target. A good starting point to get this data from an audiobook as the audio is clear and we can typically extract several hours of data.
Steps:
- Gather audio/text from an audiobook
- Force align the text and audio
- Generate clips
- Analyse dataset (optional)
Firstly we need to get an audiobook and extract it's audio & text. The two best sources I've found for audiobooks are Audible and LibriVox.
Audible books are licenced by audible and need to be purchased before use. For this project you will need to look for Kindle books with audio narration.
Once you get one, we need to convert the Audible AAX audio into a WAV file. To do this, find where the Audible app has saved this file and then use a tool such as AaxAudioConverter to convert it.
Then extract the text using the chrome extension in the 'extension' folder. Steps on how to use this can be found in step 1 of the app.
Whilst LibriVox is open source, it's quality is generally less consistent. However, if you find a book with audio and text you can use it just the same as the audible method.
Once we have the text and audio of an audiobook, we need to align the two and produce snippets of speech with labels. To do this you can run create_dataset.py
.
python create_dataset.py --audio_path book.wav --text_path book.txt --output_path wavs --label_path metadata.csv
To see a breakdown of the key stats of your dataset run
python analysis.py --wavs wavs --metadata metadata_clean.csv