Abstract:
In the end-to-end continuous speech recognition system, the Transformer model based entirely on the self-attention mechanism improves accuracy compared to the traditional hybrid model. The Conformer model adds a convolution module that is good at extracting local features based on the Transformer model, and uses this model as the encoder of the entire recognition system. The decoder uses an attention mechanism. Since the attention model is only suitable for short sentence recognition and will cause network training instability when there is noise in the data set, the sequence alignment characteristics of the CTC model are added to assist training to help the model converge faster. In view of the problem that single-channel decoding can further optimize the recognition accuracy, a dual-channel decoding model of CTC and Attention was proposed. The dual-channel decoding was compared and verified with a single CTC decoding and a single Attention decoding. The results show that dual-channel decoding is more effective in recognition. Performance can be improved by 1%. In order to solve the problem of reduced recognition effect in noisy environment, a method of adding language model to the end-to-end network was proposed. The N-gram language model was added to the network for verification. The results show that in a high-noise environment with a signal-to-noise ratio of 10 dB, the language model could reduce the word error rate by 3.5%, improving the robustness of the speech recognition system.