An Efficient Speaker Diarization Pipeline for Conversational Speech

Sultan, Wael Ali; Semary, Mourad Samir; Abdou, Sherif Mahdy

doi:10.21608/bjas.2024.284482.1414

	An Efficient Speaker Diarization Pipeline for Conversational Speech
Benha Journal of Applied Sciences
Article 16, Volume 9, Issue 5, May 2024, Pages 141-146 PDF (804.06 K)
Document Type: Original Research Papers
DOI: 10.21608/bjas.2024.284482.1414
Authors
Wael Ali Sultan^* ¹; Mourad Samir Semary¹; Sherif Mahdy Abdou²
¹Department of Basic Engineering Sciences, Benha Faculty of Engineering, Benha University, Benha, Egypt
²Information Technology Department, Faculty of Artificial Intelligence, Cairo University, Cairo, Egypt
Abstract
In the domain of audio signal processing, the accurate and efficient diarization of conversational speech remains a challenging task, particularly in environments with significant speaker overlap and diverse acoustic scenarios. This paper introduces a comprehensive speaker diarization pipeline that substantially improves both performance and efficiency in processing conversational speech. Our pipeline comprises several key components: Voice Activity Detection (VAD), Speaker Overlap Detection (SOD), Speaker Separation models, robust speaker embedding, clustering algorithms, and sophisticated post-processing techniques. Beginning with Voice Activity Detection (VAD), the pipeline efficiently discriminates between speech and non-speech segments, effectively reducing processing overhead. Following VAD, the Speaker Overlap Detection (SOD) component identifies segments featuring speaker overlap. Following this, a speaker separation model separate the overlapping speech into distinct streams. A pivotal enhancement in our pipeline is the integration of robust speaker embedding and clustering techniques, which capture and utilize speaker-specific characteristics to improve the grouping of speech segments. Finally, the post-processing stage refines these segments to ensure temporal consistency and improve the overall diarization accuracy. We evaluated our pipeline across multiple benchmark datasets, demonstrating significant reductions in Diarization Error Rate (DER) compared to existing methods. The results affirm the effectiveness of incorporating detailed speaker embeddings and clustering in a diarization system, particularly for real-world conversational speech. This enhanced pipeline offers substantial advancements for applications requiring accurate speaker attribution, such as automated transcription services, meeting analysis, and assistive communication technologies.
Keywords
speaker diarization; speaker separation; voice activity detection; optimization

Statistics Article View: 294 PDF Download: 392