Abstract:
Speech segmentation is an important component of speech separation systems, which plays an important role in many applications such as source estimation and automatic speech recognition in multi-speaker environments, multi-source target tracking, etc. Segmentation of overlapping speech has always been the focus of this work. In real life, the speech signals collected by microphones in rooms usually contain reverberation and noise signals, which deteriorate the speech quality of the received signals and affect the accuracy of the estimated features of the boda direction, leading to the degradation of the segmentation performance of multi-source overlapping speech. To address the problem that existing multi-source segmentation methods are poorly robust to noise and reverberant signals, a method is proposed to eliminate apparently abnormal noise and reverberant signals in speech signals by pre-processing. The method uses a combination of a generalized parametric phase canceller and a post-filter implemented with a Wiener filter to process the original speech signal, eliminating the reverberant and noisy signals, resulting in improved speech quality and, in turn, more accurate estimation of the direction of arrival features. The segmentation is then performed by tracking the speaker's fundamental frequency features and direction of arrival features simultaneously using multi-hypothesis tracking. 16 conference audios from the AMI corpus are statistically and analytically analyzed with multi-source overlapping speech, and the results show that the average hit rate (HIT) rate is improved by 2.10% compared with the method without pre-processing.