An Efficient Speaker Diarization Pipeline for Conversational Speech | ||||
Benha Journal of Applied Sciences | ||||
Article 16, Volume 9, Issue 5, May 2024, Page 141-146 PDF (804.06 K) | ||||
Document Type: Original Research Papers | ||||
DOI: 10.21608/bjas.2024.284482.1414 | ||||
![]() | ||||
Authors | ||||
Wael Ali Sultan ![]() | ||||
1Department of Basic Engineering Sciences, Benha Faculty of Engineering, Benha University, Benha, Egypt | ||||
2Information Technology Department, Faculty of Artificial Intelligence, Cairo University, Cairo, Egypt | ||||
Abstract | ||||
In the domain of audio signal processing, the accurate and efficient diarization of conversational speech remains a challenging task, particularly in environments with significant speaker overlap and diverse acoustic scenarios. This paper introduces a comprehensive speaker diarization pipeline that substantially improves both performance and efficiency in processing conversational speech. Our pipeline comprises several key components: Voice Activity Detection (VAD), Speaker Overlap Detection (SOD), Speaker Separation models, robust speaker embedding, clustering algorithms, and sophisticated post-processing techniques. Beginning with Voice Activity Detection (VAD), the pipeline efficiently discriminates between speech and non-speech segments, effectively reducing processing overhead. Following VAD, the Speaker Overlap Detection (SOD) component identifies segments featuring speaker overlap. Following this, a speaker separation model separate the overlapping speech into distinct streams. A pivotal enhancement in our pipeline is the integration of robust speaker embedding and clustering techniques, which capture and utilize speaker-specific characteristics to improve the grouping of speech segments. Finally, the post-processing stage refines these segments to ensure temporal consistency and improve the overall diarization accuracy. We evaluated our pipeline across multiple benchmark datasets, demonstrating significant reductions in Diarization Error Rate (DER) compared to existing methods. The results affirm the effectiveness of incorporating detailed speaker embeddings and clustering in a diarization system, particularly for real-world conversational speech. This enhanced pipeline offers substantial advancements for applications requiring accurate speaker attribution, such as automated transcription services, meeting analysis, and assistive communication technologies. | ||||
Keywords | ||||
speaker diarization; speaker separation; voice activity detection; optimization | ||||
Statistics Article View: 152 PDF Download: 238 |
||||