Data Augmentation for Arabic Speech Recognition Based on End-to-End Deep Learning

Alsayadi, Hamzah; Abdelhamid, Abdelaziz; Hegazy, Islam; Taha, Zaki

doi:10.21608/ijicis.2021.73581.1086

	Data Augmentation for Arabic Speech Recognition Based on End-to-End Deep Learning
International Journal of Intelligent Computing and Information Sciences
Article 4, Volume 21, Issue 2, July 2021, Pages 50-64 PDF (904.89 K)
Document Type: Original Article
DOI: 10.21608/ijicis.2021.73581.1086
Authors
Hamzah Alsayadi^* ¹; Abdelaziz Abdelhamid²; Islam Hegazy³; Zaki Taha⁴
¹Department of Computer science, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
²Department of Computer science, Faculty of Computer and Information Sciences,Ain Shams University, Cairo, Egypt
³Faculty of Computer and Information Sciences
⁴Faculty of Computers and Information Sciences, Ain Shams University
Abstract
End-to-end deep learning approach has greatly enhanced the performance of speech recognition systems. With deep learning techniques, the overfitting stills the main problem with a little data. Data augmentation is a suitable solution for the overfitting problem, which is adopted to improve the quantity of training data and enhance robustness of the models. In this paper, we investigate data augmentation method for enhancing Arabic automatic speech recognition (ASR) based on end-to-end deep learning. Data augmentation is applied on original corpus for increasing training data by applying noise adaptation, pitch-shifting, and speed transformation. An CNN-LSTM and attention-based encoder-decoder method are included in building the acoustic model and decoding phase. This method is considered as state-of-art in end-to-end deep learning, and to the best of our knowledge, there is no prior research employed data augmentation for CNN-LSTM and attention-based model in Arabic ASR systems. In addition, the language model is built using RNN-LM and LSTM-LM methods. The Standard Arabic Single Speaker Corpus (SASSC) without diacritics is used as an original corpus. Experimental results show that applying data augmentation improved word error rate (WER) when compared with the same approach without data augmentation. The achieved average reduction in WER is 4.55%.
Keywords
Arabic Speech Recognition; Data Augmentation; CNN-LSTM; RNN-LM; Attention-based Model

Statistics Article View: 941 PDF Download: 1,755