• 中国期刊全文数据库
  • 中国学术期刊综合评价数据库
  • 中国科技论文与引文数据库
  • 中国核心期刊(遴选)数据库
朱洋, 曾庆宁, 赵学军. 双通道解码的端到端连续语音识别[J]. 桂林电子科技大学学报, 2024, 44(2): 167-173. DOI: 10.16725/j.1673-808X.2023223
引用本文: 朱洋, 曾庆宁, 赵学军. 双通道解码的端到端连续语音识别[J]. 桂林电子科技大学学报, 2024, 44(2): 167-173. DOI: 10.16725/j.1673-808X.2023223
ZHU Yang, ZENG Qingning, ZHAO Xuejun. End-to-end continuous speech recognition with dual-channel decoding[J]. Journal of Guilin University of Electronic Technology, 2024, 44(2): 167-173. DOI: 10.16725/j.1673-808X.2023223
Citation: ZHU Yang, ZENG Qingning, ZHAO Xuejun. End-to-end continuous speech recognition with dual-channel decoding[J]. Journal of Guilin University of Electronic Technology, 2024, 44(2): 167-173. DOI: 10.16725/j.1673-808X.2023223

双通道解码的端到端连续语音识别

End-to-end continuous speech recognition with dual-channel decoding

  • 摘要: 在端到端连续语音识别系统中,完全基于自注意力机制的Transformer模型相比传统的混合模型提高了准确率。Conformer模型是在Transformer模型基础上增加一个擅长提取局部特征的卷积模块,将该模型作为整个识别系统的编码器,解码器使用注意力机制,注意力模型只适合短句子识别,并且在数据集存在噪声时会导致网络训练不稳定,添加CTC模型的序列对齐特性辅助训练来帮助模型收敛更快。针对单通道解码可在识别准确率上进一步优化的问题,提出了CTC与Attention双通道解码模型,将双通道解码与单一的CTC解码和单一的Attention解码进行对比验证,结果表明双通道解码在识别性能上提升了1%。针对在噪声环境下识别效果降低的问题,提出对端到端网络添加语言模型的方法。将N-gram语言模型加入网络中进行验证,结果表明在信噪比为10 dB的高噪声环境下,语言模型能够使字错率下降3.5%,提高了语音识别系统的鲁棒性。

     

    Abstract: In the end-to-end continuous speech recognition system, the Transformer model based entirely on the self-attention mechanism improves accuracy compared to the traditional hybrid model. The Conformer model adds a convolution module that is good at extracting local features based on the Transformer model, and uses this model as the encoder of the entire recognition system. The decoder uses an attention mechanism. Since the attention model is only suitable for short sentence recognition and will cause network training instability when there is noise in the data set, the sequence alignment characteristics of the CTC model are added to assist training to help the model converge faster. In view of the problem that single-channel decoding can further optimize the recognition accuracy, a dual-channel decoding model of CTC and Attention was proposed. The dual-channel decoding was compared and verified with a single CTC decoding and a single Attention decoding. The results show that dual-channel decoding is more effective in recognition. Performance can be improved by 1%. In order to solve the problem of reduced recognition effect in noisy environment, a method of adding language model to the end-to-end network was proposed. The N-gram language model was added to the network for verification. The results show that in a high-noise environment with a signal-to-noise ratio of 10 dB, the language model could reduce the word error rate by 3.5%, improving the robustness of the speech recognition system.

     

/

返回文章
返回