基于特征融合与时延神经网络的说话人确认系统

何青怡; 曾庆宁; 赵学军

doi:10.16725/j.1673-808X.202467

基于特征融合与时延神经网络的说话人确认系统

Speaker verification system based on feature fusion and time-delay neural networks

摘要

摘要: 针对说话人识别中单一特征参数往往无法全面捕捉说话人的独特信息以及在复杂噪声环境下传统声纹识别系统识别率下降的问题，提出了一种结合特征融合与时延神经网络的说话人确认系统。首先，对MFCC特征和GFCC特征进行了分析，MFCC特征因其能够有效地表示语音信号的梅尔频率特性而被广泛应用，而GFCC特征则因其模拟人类耳蜗的听觉特性而在噪声环境下表现出更好的鲁棒性；其次，将这2种特征参数及其动态特性进行混合，形成高维特征集合；再次，为了提升系统的处理效率，利用主成分分析（PCA）技术对混合的特征参数集合进行维度缩减，随后采用k-means聚类算法对这些特征进行进一步的群组划分，从而构建出一种新的混合特征参数；最后，将这种混合特征参数应用于具有通道注意力机制的时延神经网络模型——ECAPA-TDNN进行训练测试。实验结果表明，本系统相较于传统的单一特征系统在设定的−5 dB噪声环境中识别率提升了27.21%，展现出较好的识别效果。

Abstract: In response to the problem that single-feature parameters in speaker recognition often fail to fully capture the unique information of the speaker and the decline in recognition rates in traditional voiceprint recognition systems under complex noisy environments, a speaker confirmation system combining feature fusion and time-delay neural networks has been proposed. The system first analyzes MFCC (mel-frequency cepstral coefficient) and GFCC (gammatone frequency cepstral coefficients) features. MFCC features are widely used due to their effective representation of the Mel-frequency characteristics of speech signals, while GFCC features demonstrate better robustness in noisy environments as they mimic the auditory properties of the human cochlea. Subsequently, these two types of feature parameters and their dynamic characteristics are combined to form a high-dimensional feature set. To improve the system's processing efficiency, Principal Component Analysis (PCA) technology is used to reduce the dimensionality of the combined feature parameter set. Then, the k-means clustering algorithm is applied to further group these features, thereby constructing a new type of hybrid feature parameter. Finally, this hybrid feature parameter is applied to the ECAPA-TDNN, a time-delay neural network with a channel attention mechanism, for both training and testing. Experimental results show that the proposed method has improved the recognition rate by 27.21% in the specified −5 dB noisy environment compared to traditional single features, demonstrating better recognition performance.

HTML全文

参考文献(24)

施引文献

资源附件(0)