Skip to content

Latest commit

 

History

History
35 lines (23 loc) · 2.02 KB

dataset.md

File metadata and controls

35 lines (23 loc) · 2.02 KB

Building the dataset

To generate a dataset for voice synthesis, we need labelled audio clips of the target. A good starting point to get this data from an audiobook as the audio is clear and we can typically extract several hours of data.

Steps:

  1. Gather audio/text from an audiobook
    1. Audible method
    2. Librivox method
  2. Force align the text and audio
  3. Generate clips
  4. Analyse dataset (optional)

Gather audio/text

Firstly we need to get an audiobook and extract it's audio & text. The two best sources I've found for audiobooks are Audible and LibriVox.

Audible

Audible books are licenced by audible and need to be purchased before use. For this project you will need to look for Kindle books with audio narration.

Once you get one, we need to convert the Audible AAX audio into a WAV file. To do this, find where the Audible app has saved this file and then use a tool such as AaxAudioConverter to convert it.

Then extract the text using the chrome extension in the 'extension' folder. Steps on how to use this can be found in step 1 of the app.

LibriVox

Whilst LibriVox is open source, it's quality is generally less consistent. However, if you find a book with audio and text you can use it just the same as the audible method.

Force align text and audio

Once we have the text and audio of an audiobook, we need to align the two and produce snippets of speech with labels. To do this you can run create_dataset.py.

python create_dataset.py --audio_path book.wav --text_path book.txt --output_path wavs --label_path metadata.csv

Optional: Analyse dataset

To see a breakdown of the key stats of your dataset run

python analysis.py --wavs wavs --metadata metadata_clean.csv