Speech Recognition Using Historian Multimodal Approach | ||||
The Egyptian Journal of Language Engineering | ||||
Article 4, Volume 6, Issue 2, September 2019, Page 44-58 PDF (1.06 MB) | ||||
Document Type: Original Article | ||||
DOI: 10.21608/ejle.2019.59164 | ||||
View on SCiNiTO | ||||
Authors | ||||
Eslam Eid Elmaghraby 1; Amr Refaat Gody 2; Mohamed Hashem Farouk 3 | ||||
1Communication and Electronics Engineering Department from faculty of engineering, Fayoum University | ||||
2Faculty of Engineering, Fayoum University | ||||
3Engineering Math. & Physics Dept., Faculty of Engineering, Cairo University | ||||
Abstract | ||||
This paper proposes an Audio-Visual Speech Recognition (AVSR) model using both audio and visual speech information to improve recognition accuracy in a clean and noisy environment. Mel frequency cepstral coefficient (MFCC) and Discrete Cosine Transform (DCT) are used to extract the effective features from audio and visual speech signal respectively. The Classification process is performed on the combined feature vector by using one of main Deep Neural Network (DNN) architecture, Bidirectional Long-Short Term Memory (BiLSTM), in contrast to the traditional Hidden Markov Models (HMMs). The effectiveness of the proposed model is demonstrated on a multi-speakers AVSR benchmark dataset named GRID. The experimental results show that the early integration between audio and visual features achieved an obvious enhancement in the recognition accuracy and prove that BiLSTM is the most effective classification technique when compared to HMM. The obtained results when using integrated audio-visual features achieved highest recognition accuracy of 99.07%, this result demonstrates an enhancement of up to 9.28% over audio-only recognition for clean data. While for noisy data, the highest recognition accuracy for integrated audio-visual features is 98.47% with enhancement up to 12.05% over audio-only. The main reason for BiLSTM effectiveness is it takes into account the sequential characteristics of the speech signal. The obtained results show the performance enhancement compared to previously obtained highest audio visual recognition accuracies on GRID, and prove the robustness of our AVSR model (BiLSTM-AVSR). | ||||
Keywords | ||||
DCT; MFCC; HMM; BiLSTM; and GRID | ||||
Statistics Article View: 221 PDF Download: 594 |
||||