Improving speaker diarization for CHIL lecture meetings
Abstract
Speaker diarization is often performed before automatic speech recognition (ASR) to label speaker segments. In this paper we present two simple schemes to improve the speaker diarization performance. The first is to iteratively refine GMM speaker models by frame level re-labeling and smoothing of the decision likelihood. The second is to use word level alignment information from the ASR process. We focus on the CHIL lecture meeting data. Our experiments on the NIST RT06 evaluation data show that these simple methods are quite effective in improving our baseline diarization system, with alignment information providing 1% absolute reduction in diarization error rate (DER) and the re-label smoothing providing an additional 3.51% absolute reduction in DER. The overall system generates a DER that is 6.8% relative better than the top performing system from the RT06 evaluation.