Abstract:
In response to the problem that single-feature parameters in speaker recognition often fail to fully capture the unique information of the speaker and the decline in recognition rates in traditional voiceprint recognition systems under complex noisy environments, a speaker confirmation system combining feature fusion and time-delay neural networks has been proposed. The system first analyzes MFCC (mel-frequency cepstral coefficient) and GFCC (gammatone frequency cepstral coefficients) features. MFCC features are widely used due to their effective representation of the Mel-frequency characteristics of speech signals, while GFCC features demonstrate better robustness in noisy environments as they mimic the auditory properties of the human cochlea. Subsequently, these two types of feature parameters and their dynamic characteristics are combined to form a high-dimensional feature set. To improve the system's processing efficiency, Principal Component Analysis (PCA) technology is used to reduce the dimensionality of the combined feature parameter set. Then, the k-means clustering algorithm is applied to further group these features, thereby constructing a new type of hybrid feature parameter. Finally, this hybrid feature parameter is applied to the ECAPA-TDNN, a time-delay neural network with a channel attention mechanism, for both training and testing. Experimental results show that the proposed method has improved the recognition rate by 27.21% in the specified −5 dB noisy environment compared to traditional single features, demonstrating better recognition performance.