• 中国期刊全文数据库
  • 中国学术期刊综合评价数据库
  • 中国科技论文与引文数据库
  • 中国核心期刊(遴选)数据库
覃琴, 杨悦, 陈名松, 王鑫. 改进SMOTE的过采样算法J. 桂林电子科技大学学报, 2022, 42(1): 53-59.
引用本文: 覃琴, 杨悦, 陈名松, 王鑫. 改进SMOTE的过采样算法J. 桂林电子科技大学学报, 2022, 42(1): 53-59.
QIN Qin, YANG Yue, CHEN Mingsong, WANG Xin. Improved SMOTE for OversamplingJ. Journal of Guilin University of Electronic Technology, 2022, 42(1): 53-59.
Citation: QIN Qin, YANG Yue, CHEN Mingsong, WANG Xin. Improved SMOTE for OversamplingJ. Journal of Guilin University of Electronic Technology, 2022, 42(1): 53-59.

改进SMOTE的过采样算法

Improved SMOTE for Oversampling

  • 摘要: 针对不平衡数据集分类, 现有的过采样算法更多地解决了类间不平衡问题, 而未考虑少数类的类内不平衡, 未筛选进行过采样的样本及未去除噪音, 且合成过程中存在样本重叠及样本分布"边缘化"等问题, 提出一种基于层次聚类和改进SMOTE的过采样算法AGNES-SMOTE。该算法对多数类和少数类样本分别进行层次聚类, 并根据获得的多数类簇划分少数类簇, 合并过程中考虑多数类样本分布, 避免重叠样本的生成。接着根据少数类簇中的样本数量确定采样权重, 并根据少数类样本到其近邻多数类样本的距离计算出每个少数类簇的概率分布, 结合两者来选取"种子样本"。最后在采样过程中采用质心方式来限制合成样本的生成区域。将AGNES-SMOTE与分类器结合来处理不平衡数据集的分类问题。通过UCI数据集上与其他文献中相关算法的对比实验表明, AGNES-SMOTE在新样本的整体合成效果上表现出色, 取得更高的G-mean值、F-measure值和AUC值, 有效提高了分类器在不平衡数据集上的分类性能。

     

    Abstract: Regarding imbalanced datasets classifification, the existing oversampling algorithms mainly deal with between-class imbalance and neglect within-class imbalance. Some problems are ignored, such as samples being oversampled are not selected, noise is not removed, samples overlap exists in the synthesis process, and samples will be distributed "marginally". To solve the abovementioned problems, a improved oversampling method AGNES-SMOTE based on hierarchical clustering for imbalanced data is presented. The key of the algorithm is to perform hierarchical clustering on the majority and minority samples and divide the minority clusters according to the obtained majority cluster, and consider the majority samples distribution during the merging process to avoid generating overlapping synthetic samples. Then, determine sampling weights according to sample size in minority subcluster, calculate the probability distribution of each minority subcluster according to the distance between the minority samples and their neighbor majority samples, and combine the two to select "seed sample" for oversampling. Finally the centroid method is used to limit the generated regions of synthesis samples in the process of sampling. AGNES-SMOTE and classifier are combined to deal with the classification problem of imbalanced datasets, and compared with the related algorithms in other literatures on the UCI datasets, AGNES-SMOTE performs well in the minority class samples and achieves higher G-mean value, F-measure value, AUC value, and brings better classitication performance on imbalanced datasets with classifier.

     

/

返回文章
返回