面向检索服务的词干提取与相关排序优化研究

朱艳; 张敬伟; 杨青; 胡晓丽; 单美静

面向检索服务的词干提取与相关排序优化研究

Research on stemming and related ranking optimization for retrieval service

摘要

摘要: 新一代信息技术的兴起以及互联网产业的飞速发展使得数据量呈爆炸式增长。为满足数十亿用户从海量数据中快速获取有效信息的需求, 提升搜索引擎的检索质量以及查询效率具有重要意义, 同时也面临挑战。一方面, 用户的查询词日益复杂, 语言词汇形态变异的特点导致检索词变得多样化, 而现有词干提取算法普遍存在词干提取不足、词干提取准确率不高等问题; 另一方面, 在海量数据中检索到满足用户查询要求的文档结果是一项非常耗时的任务, 而现有将文档划分到多个服务器处理查询延迟的方法常常会出现尾延迟问题。针对以上问题, 在文本预处理阶段, 设计了词形规范化算法APS, 对规则函数进行重编码, 优化了特征词提取; 在相关排序阶段, 设计了基于一次一得分查询处理策略的随时排序算法SAR, 在给定时间预算处理完指定数量倒排段后能够提前终止查询过程, 大大减少了查询评估时间。在多个真实数据集上进行了实验, 验证了APS算法对于提高词干提取准确率的有效性以及SAR算法对于控制查询延迟的真实性。

Abstract: The rise of a new generation of information technology and the rapid development of the internet industry have led to an explosive growth in the amount of data. In order to meet the needs of billions of users to obtain effective information from massive data quickly, it is of great significance to improve the retrieval quality and query efficiency of search engines, but it also faces challenges. On the one hand, the query words of users are becoming more and more complex, and the characteristics of the morphological variation of language vocabulary lead to the diversification of search words, while existing stemming algorithms generally suffer from under stemming and unsatisfactory stemming accuracy; On the other hand, it is a very time-consuming task to retrieve document results that meet user query requirements from massive data, and existing methods of dividing documents into multiple servers to handle query latency often suffer from tail latency problems. In view of the above problems, in the text preprocessing stage, the word form normalization algorithm APS (advanced porter stemmer) is designed, the rule function is recoded, and the feature word extraction is optimized; In the related ranking stage, the anytime ranking algorithm SAR (SAAT anytime ranking) is designed based on the score-at-a-Time query processing strategy, which can terminate the query process in advance after a given time budget or processing a specified number of inverted segments and control the query delay effectively. Experiments are carried out on multiple real datasets to verify the effectiveness of the APS algorithm in improving the accuracy of stemming and the authenticity of the SAR algorithm in controlling query latency.

HTML全文

参考文献(40)

施引文献

资源附件(0)