T-FOLEY: A Controllable Waveform-Domain Diffusion Model for Temporal-Event-Guided Foley Sound Synthesis
Yoonjin Chung*, Junwon Lee*, Juhan Nam
This repository contains the implementation of the paper, T-FOLEY: A Controllable Waveform-Domain Diffusion Model for Temporal-Event-Guided Foley Sound Synthesis, accepted in 2024 ICASSP.
In our paper, we propose T-Foley, a Temporal-event guided waveform generation model for Foley sound synthesis, which can generate high-quality audio considering both sound class and when sound should be arranged.
To get started, please prepare the codes and python environment.
-
Clone this repository:
$ git clone https://github.com/YoonjinXD/T-foley.git $ cd ./T-foley
-
Install the required dependencies by running the following command:
# (Optional) Create a conda virtual emvironment $ conda create -n tfoley python=3.8.0 $ conda activate tfoley # Install dependency with pip. Choose appropriate cuda version $ pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118 $ pip install -r requirements.txt
To train and evaluate our model, we used DCASE 2023 Challenge Task 7 which was constructed for Foley Sound Synthesis. To evaluate our mode, we used the subsets of VocalImitationSet and VocalSketch. These vocal imitating sets consist of vocal audios that mimick event-based or environmental sounds. Click the link above links to download the corresponding datasets.
To perform inference using our model, follow these steps:
-
Download the pre-trained model weights and configurations from the following link: prertrained.zip.
$ wget https://zenodo.org/records/10826692/files/pretrained.zip
-
Unzip and place the downloaded model weights and config json file in the
./pretrained
directory.$ unzip pretrained.zip
-
Run the inference script by executing the following command:
$ python inference.py --class_name "DogBark"
The class_name must be one of the class name of 2023 DCASE Task7 dataset. The list of the class name:
"DogBark", "Footstep", "GunShot", "Keyboard", "MovingMotorVehicle", "Rain", "Sneeze_Cough"
-
The generated samples would be saved in the
./results
directory. -
For FAD evaluation, we utilized this toolkit: FAD tookit
To train the T-Foley model, follow these steps:
-
Download and unzip the DCASE 2023 task 7 dataset. Due to the mismatch between the provided csv and actual data files, please make valid filelists(.txt) using the provided scripts:
$ wget http://zenodo.org/records/8091972/files/DCASE_2023_Challenge_Task_7_Dataset.tar.gz $ tar -zxvf DCASE_2023_Challenge_Task_7_Dataset.tar.gz $ sh rename_dirs.sh $ sh make_filelist.sh
If you use other dataset, prepare file path list of your training data as .txt format and configure to
params.py
. -
Run the training:
$ python train.py
This will start the training process and save the trained model weights in the
logs/
directory.To see the training on tensorboard, run:
$ tensorboard --logdir logs/
@inproceedings{t-foley,
title={T-FOLEY: A Controllable Waveform-Domain Diffusion Model for Temporal-Event-Guided Foley Sound Synthesis},
author={Chung, Yoonjin and Lee, Junwon and Nam, Juhan},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2024},
organization={IEEE}
}
This project is licensed under the MIT License. See the LICENSE file for more information.