Oozie工作流在Mahout分布式数据挖掘中的应用
Application of Oozie workflow in Mahout distributed data mining
-
摘要: 针对现有数据挖掘软件不支持用户有序、动态地按需定制并行数据挖掘算法,且不能充分利用计算集群的能力,分析了Hadoop技术及其多种数据处理组件,提出应用Mahout分布式数据挖掘算法库和Oozie工作流技术在Hadoop中构建数据挖掘工作流的方法,并设计实现了一个聚类工作流实例。实验结果证明,该方法简单,且能有效地组织数据挖掘流程。Abstract: The existing data mining applications do not allow users to create customized algorithm group on demand and can not support the usage of computing clusters well, an analysis of several components of Hadoop is made, especially focusing on Mahout which is a distributed data mining algorithm library. Then Mahout and Oozie are used to create data mining workflow in Hadoop. At last, a clustering workflow is defined and implemented. The experimental results show that the method is easy and effective for creating distributed data mining process.