This project trained a neural network model using LSTM RNN with 54 hours of speech from 6 different languages to classify speech samples. LSTM RNN = Long Short Term Memory Recurrent Neural Networks
- To determine which language is being spoken in a speech sample
- Humans recognize it through perceptual process inherent in auditory system
- The aim is to replicate human ability through computational means
- How to scientifically distinguish diverse spoken lanugages in the world to correctly classify speech samples?
- LSTM RNNS is an excellent choice for classifying a speech sample because they can effectively exploit temporal dependencies in acoustic data and they reportedly perfom better than DNN
- Subset of Mozilla Common Voice speech dataset (
- mp3 format converted to 16kHz waveform
- Volume normalized to -3dbfs
- 54 hours of speech
- 6 different languages
- Step 1: Data preprocessing and feature extraction using MFCC
- Step 2: Classifier training using CNN and LSTM
- Step 3: Model Evaluation
- Python
- Keras Tensorflow
- Split .wav files to equal length audio of 3secs
- Generate MFCC features with 1 sec sliding window
- Normalize the data using MinMax scaler
- Label the audios
(rate,sig) =
mfcc_feat = mfcc(sig,rate)
scaler =
normalized = scaler.transform(mfcc_feat)
- Built using Keras and Tensorflow
- One LSTM layer with 200 units
- One dense layer with 6 units
- Softmax activation
- Learning rate: 0.001
- Early stopping using val_loss
#Setup constants
EPOCHS = 100
SAMPLES_PER_EPOCH = len(range(0, len(X_train_array), BATCH_SIZE))
#Declaring callbacks for early stopping
callbacks = [
EarlyStopping(monitor='val_loss', min_delta=0.01, patience=5,mode = 'min')
#Creating a LSTM with 200 units
model= Sequential()
model.compile(optimizer=Adam(amsgrad=True, lr=0.001),loss='categorical_crossentropy',metrics=['accuracy'])
#Print the model summary
#Fit the model
history =,
callbacks = callbacks,
validation_data=(X_test, Y_test),
#Model Summary
Layer (type) Output Shape Param #
lstm_12 (LSTM) (None, 200) 171200
dense_11 (Dense) (None, 6) 1206
Total params: 172,406
Trainable params: 172,406
Non-trainable params: 0
Accuracy of upto 83% was achieved.
This project was done together with @alvarorgaz for a course called Automatic Speech Recognition at Aalto University.