Abstract:
The traditional oversampling algorithms improve the imbalance problem by synthesizing minority class samples, but they do not consider issues such as generating noise and uneven sample distribution. In response to this kind of problem, a clustering-based oversampling algorithm called SK-SMOTE, which improves SMOTE, was proposed. Before clustering, this algorithm synthesized a portion of minority samples to increase their number. Then, according to the categories and distances of the neighboring samples of the synthesized minority class samples, weights were assigned. By determining whether the weight sum was greater than the set value, it decided whether the sample can be retained. After increasing the number of minority samples, the KMeans algorithm was used for clustering, and clusters with more minority samples were retained. Oversampling was performed within the clusters, with sparser clusters synthesizing more minority class samples. Several classic SMOTE algorithms and SK-SMOTE were compared using SVM, RF, and KNN as classification algorithms and imbalanced datasets from UCI and KEEL databases. The experimental results show that SK-SMOTE algorithm can effectively balance imbalanced datasets and achieve better results than traditional oversampling algorithms, especially on datasets with higher imbalance ratios.