一种面向不平衡数据集的过采样算法

张文辉; 罗鸿豪

doi:10.3969/1673-808X.202305

一种面向不平衡数据集的过采样算法

An oversampling algorithm for imbalanced datasets

摘要

摘要: 传统过采样算法通过合成少数类样本来改善不平衡问题，但未考虑产生噪点与样本分布不均匀等问题，针对该类问题，提出了一种基于聚类与对改进SMOTE的过采样算法SK-SMOTE。该算法在聚类前，先合成一部分少数样本，以此提高少数类样本数量，同时根据合成的少数类样本的邻居样本的类别和距离赋予权重，通过权重总和是否大于设定的值来决定该样本是否可以被保留。在提高少数类样本数量后，再使用KMeans算法进行聚类，然后保留少数样本较多的簇。在簇内进行过采样，相对稀疏的簇将合成更多的少数类样本。选取UCI和KEEL数据库中的不平衡数据集，将SVM、RF、KNN作为分类算法，并选用几种经典的SMOTE算法与SK-SMOTE进行多组对比实验。实验结果表明，SK-SMOTE算法可有效平衡不平衡数据集，且在不平衡比例较高的数据集上取得了比传统过采样算法更好的结果。

Abstract: The traditional oversampling algorithms improve the imbalance problem by synthesizing minority class samples, but they do not consider issues such as generating noise and uneven sample distribution. In response to this kind of problem, a clustering-based oversampling algorithm called SK-SMOTE, which improves SMOTE, was proposed. Before clustering, this algorithm synthesized a portion of minority samples to increase their number. Then, according to the categories and distances of the neighboring samples of the synthesized minority class samples, weights were assigned. By determining whether the weight sum was greater than the set value, it decided whether the sample can be retained. After increasing the number of minority samples, the KMeans algorithm was used for clustering, and clusters with more minority samples were retained. Oversampling was performed within the clusters, with sparser clusters synthesizing more minority class samples. Several classic SMOTE algorithms and SK-SMOTE were compared using SVM, RF, and KNN as classification algorithms and imbalanced datasets from UCI and KEEL databases. The experimental results show that SK-SMOTE algorithm can effectively balance imbalanced datasets and achieve better results than traditional oversampling algorithms, especially on datasets with higher imbalance ratios.

HTML全文

参考文献(20)

施引文献

资源附件(0)