Source of Dataset: Free Spoken Digit Dataset (FSDD)
The construction of ObfNet involves three phases:
-
Preparation phase
The coordinator trains an inference model using public data samples or it can simply get a well-trained inference deep model from elsewhere. -
Training phase
The coordinator trains the concatenation of the ObfNet and the pre-trained inference model,during which the parameters of the pre-trained inference model are fixed. The coordinator can train multiple ObfNets and send them all to each participant. -
Inference phase
Each participant randomly can choose one ObfNet,execute it and then send the obfuscated results to the coordinator for inference.
In the FSDD dataset, there are 2000 audio samples (wav files). As for the data processing, we firstly use Mel-Frequency Cepstrum Coefficient (MFCC) to convert the wav files to 2-D images with the size 20*45. We suppose that the inference model at the coordinator takes the MFCC representations (images) as the input. ObfNet is then used to obfuscate the original MFCC representations (images). We split the data as 90% train-10% test. We tried two different structures of ObfNet when using two distinct inference models:
-
The structures of ObfNets:
i. ObfNet_conv: it includes two convolutional layers, one maxpooling layer and one dense layer with ReLu activation.
ii. ObfNet_fc: it includes two dense layers with ReLu activation. -
The structures of inference models:
i. infmodel_conv: it includes three convolutional layers, one maxpooling layer and three dense layers with ReLu activation.
ii. infmodel_fc: it includes five dense layers with ReLu activation.
In order to make a comparison between the original MFCC representations and the obfuscated ones, we do the MFCC inverse using the Python package LibROSA to convert MFCC representations back to audio (wav files). The audios converted from the original MFCC representations can still be recognized by humans despite some noise. In contrast, the audios converted from the obfuscated MFCC representations are unrecognizable by humans. Here, the MFCC inversed audios of 20 ramdom original MFCC representations in the test set and the corresponding obfuscated MFCC representations are shown.
For the inversed audios from original MFCC representations, files are named in the following format:mfcc_inverse_test_{index} _ label _ {digit label}.wav. They are in the folder mfcc_inverse_original.
For the inversed audios from obfuscated MFCC representations, files are named in the following format: inverse_obfnet_{conv/fc}+infmodel_{conv/fc}_ test _ {index} _ label _ {digit label}.wav.
For example, inverse_obfnet_conv+infmodel_conv_test_0_label_8.wav means that:
a) The involved ObfNet has the structure of 'ObfNet_conv';
b) In the training phase, the ObfNet is concatenated with the inference model with the structure of 'infmodel_conv';
c) The index of the MFCC representation in the test set is 0;
d) The true label of the MFCC representation is 8;
The obfuscated results of the four obfNets are shown in the following four different folders:
The audio inversed from the original MFCC representations:
mfcc_inverse_test_0_label_8.wav
In the beginning part of the audio inversed from the raw MFCC representations, we can hear the content "eight". Note that there are also noises thereafter due to the necessary constant padding to the MFCC representations in the data processing step in order to unify the sizes of all the MFCC representations.
The audios inversed from the corresponding obfuscated MFCC representations by the four ObfNets:
inverse_obfnet_conv+infmodel_conv_test_0_label_8.wav
inverse_obfnet_conv+infmodel_fc_test_0_label_8.wav
inverse_obfnet_fc+infmodel_conv_test_0_label_8.wav
inverse_obfnet_fc+infmodel_fc_test_0_label_8.wav
These audios inversed from the corresponding obfuscated MFCC representations are totally unrecognizable by humans.