Skip to content

Latest commit

 

History

History
137 lines (123 loc) · 5.89 KB

evaluation_of_alignment.md

File metadata and controls

137 lines (123 loc) · 5.89 KB

Realignment of Spjallromur

The dataset was realigned using scripts in Kaldi and an acoustic model from Tiro ehf.

Evaluation of alignments

The data was evaluated in two ways. Firstly by decoding using an ASR system and then also by manually listening to a few hundred segments.

Evaluation by decoding

We used a finetuned Whisper model (https://huggingface.co/language-and-voice-lab/whisper-large-icelandic-30k-steps-1000h).

The average word error rate of the entire dataset is 38.63% which is quite high. Results for each speaker are in the table below. This model performs well on other test sets and this is therefore quite a high WER. There are some noticeable outliers in the data. For example, the speaker b_9a67bc98_30-39_f speaks Icelandic as their second language and has a non-native accent.

Spk id WER
b_9a67bc98_30-39_f 163.713
b_24c0c1b3_20-29_m 70.577
a_ebbc5293_20-29_m 69.91
a_2c1b4416_40-49_f 65.217
b_2a07b3a7_20-29_f 62.906
b_5f55950e_20-29_f 62.162
a_92a95e84_30-39_m 61.454
a_eda6925c_30-39_m 60.44
a_2ecf1db5_20-29_m 58.198
a_1939f519_20-29_m 58.067
b_92a95e84_30-39_m 57.94
a_b6ad9f96_20-29_f 56.503
b_dc0967ee_30-39_m 54.291
b_eda6925c_30-39_m 52.725
a_2f1655ff_40-49_f 52.416
a_01119679_40-49_f 52.263
b_81b2b35e_30-39_m 49.907
a_2a07b3a7_20-29_m 49.237
b_ebbc5293_20-29_m 49.057
a_0f2c315c_30-39_m 47.826
a_66ccf3bc_20-29_m 45.914
a_ad46e29b_20-29_m 45.81
a_05b30647_30-39_m 43.48
a_c3a7fbe9_20-29_m 43.014
b_aad7caab_30-39_m 42.371
b_1939f519_20-29_m 41.837
a_69079ee1_30-39_m 41.693
a_7faf84e8_30-39_m 40.616
a_44d73360_30-39_m 40.37
b_05b30647_30-39_f 40.35
a_de3b604f_20-29_m 39.735
b_a56ed5af_60-69_f 39.606
b_5331448b_30-39_m 39.585
b_188092d3_20-29_m 39.216
a_aad7caab_20-29_m 39.211
b_ccd0f1a6_30-39_f 39.152
b_deb42548_20-29_f 38.751
a_e1e7765a_20-29_f 38.606
b_de3b604f_20-29_m 38.322
a_879325a8_20-29_m 37.622
b_2ecf1db5_20-29_o 37.089
b_826b4d3d_40-49_m 37.069
b_2f1655ff_40-49_f 36.728
a_8edc23bf_20-29_m 36.504
b_8af8f246_20-29_m 36.266
b_2c1b4416_30-39_f 36.173
a_389f0bb5_20-29_f 36.082
a_b107d272_30-39_m 35.953
b_b107d272_20-29_m 35.697
b_44d73360_30-39_m 35.394
a_dc0967ee_30-39_m 35.342
a_ccd0f1a6_30-39_m 35.044
b_81dd3246_30-39_f 35.007
b_8edc23bf_30-39_m 34.806
a_8c25247b_20-29_m 34.697
b_bcb44230_30-39_m 34.472
b_2d219d50_30-39_m 34.32
a_a56ed5af_20-29_f 34.271
b_c3a7fbe9_20-29_m 33.028
b_e25bc38d_20-29_m 32.798
a_2d219d50_30-39_m 32.659
a_188092d3_30-39_f 32.47
a_2a139f9b_20-29_m 32.451
b_8c25247b_30-39_m 32.444
b_01119679_50-59_m 32.237
a_826b4d3d_30-39_m 31.758
b_69079ee1_30-39_m 31.299
a_deb42548_20-29_m 31.108
a_bbc3f248_20-29_m 31.073
b_879325a8_20-29_m 31.041
a_5f55950e_20-29_f 30.853
a_54ddefa8_30-39_f 30.416
a_3ac74ae1_40-49_f 30.063
b_389f0bb5_20-29_f 29.851
b_66ccf3bc_50-59_m 29.564
a_2284fd64_40-49_f 29.095
b_0f2c315c_30-39_f 28.894
b_54ddefa8_30-39_m 28.075
b_3ce6563e_20-29_m 27.937
a_f123a375_20-29_f 27.273
a_caa6301e_20-29_m 26.718
a_24c0c1b3_20-29_f 26.46
b_2284fd64_40-49_m 25.86
a_5331448b_30-39_m 25.76
a_50d1de3c_20-29_o 25.136
a_45eebf55_30-39_f 24.843
b_7faf84e8_30-39_f 24.684
a_81b2b35e_30-39_m 24.47
a_bcb44230_20-29_m 24.108
b_45eebf55_50-59_m 24.051
b_ad46e29b_20-29_m 23.427
a_997d4fe0_30-39_m 23.231
b_50d1de3c_20-29_f 22.377
a_3ce6563e_50-59_m 22.217
a_8af8f246_20-29_m 22.178
a_81dd3246_30-39_f 20.583
b_997d4fe0_30-39_m 18.182
b_bbc3f248_20-29_m 17.854
a_9a67bc98_80-89_m 15.834

Manual evaluation by listening

For 60 of the 99 speakers, ~5 segments from that speaker were randomly selected, and their alignments were manually evaluated. Due to time constraints, not all 99 speakers were tested. In total, we evaluated 312 samples. The evaluation was very simple. The evaluator listened to each segment and marked the alignment as one of: "Good" (all words in audio match all words in transcript), "Missing end" (the end of the segment was missing), or "Missing beginning" (The beginning of the segment was missing). The results show that the average accuracy is 94.6%. Most of the alignment issues were minor.

Evaluation Count
Good 296
Missing end 12
Missing beginning 4

Conclusion

The manual evaluation gives us confidence in the alignment, so it seems like something else is causing the high WER of the Whisper model. The most obvious explanation for the WER is that the Whisper model is not trained on this kind of data. Conversational data differs a lot from traditional scripted ASR data, as the speakers are in some cases not speaking clearly, the audio quality varies, the word choice is spontaneous and the utterances are often fragmented thoughts rather than long cohesive sentences. Some examples of this can be seen in the following samples.

a_2c1b4416_40-49_f_0_18.86
<UNK>. Mér finnst þessi rauði þarna, þetta er dálítið villandi, við hliðina á. Þess vegna erum við ekki að senda þetta út á trilljón manns, við erum enn þá laga einhverja bögga.

b_aad7caab_30-39_m_64_19.64
Já, já, já, já, nákvæmlega. Og þurfti að fara, ég man eftir því að þurfti að þú veist actually fara út og tékka á miklu meira dóti sko. Eitthvað svona lítil mission bara eitthvað. Þetta er svona. Ég veit það ekki.