Abstract:
Regarding imbalanced datasets classifification, the existing oversampling algorithms mainly deal with between-class imbalance and neglect within-class imbalance. Some problems are ignored, such as samples being oversampled are not selected, noise is not removed, samples overlap exists in the synthesis process, and samples will be distributed "marginally". To solve the abovementioned problems, a improved oversampling method AGNES-SMOTE based on hierarchical clustering for imbalanced data is presented. The key of the algorithm is to perform hierarchical clustering on the majority and minority samples and divide the minority clusters according to the obtained majority cluster, and consider the majority samples distribution during the merging process to avoid generating overlapping synthetic samples. Then, determine sampling weights according to sample size in minority subcluster, calculate the probability distribution of each minority subcluster according to the distance between the minority samples and their neighbor majority samples, and combine the two to select "seed sample" for oversampling. Finally the centroid method is used to limit the generated regions of synthesis samples in the process of sampling. AGNES-SMOTE and classifier are combined to deal with the classification problem of imbalanced datasets, and compared with the related algorithms in other literatures on the UCI datasets, AGNES-SMOTE performs well in the minority class samples and achieves higher G-mean value, F-measure value, AUC value, and brings better classitication performance on imbalanced datasets with classifier.