Abstract:
Convolutional neural network (CNN)-based self-supervised visual representation learning methods has been widely used recently due to its ability to handle unlabeled data. However, the slow convergence speed and the inability to obtain detailed characteristics leads to high consumption and poor performance. To address above issues, a self-supervised method that is both efficient and effective was proposed. Specifically, first the Lego-block style sampling was introduced to construct samples to increase the number of samples, and employ small patches to increase the number of samples while maintaining small amount of calculation in the meantime. Besides, an information retainer projection head (IRPH) is introduced to further balance the information between detailed inconsistency and semantic consistency. At last, multiple loss functions were user to jointly optimize the model. The effectiveness of our method is verified on CIFAR and three fine-grained classification datasets. Experiments demonstrate that this method can obtain more global and detail information than MoCo and other methods, and can achieve better results with linear classification accuracy and detection AP in downstream tasks, and have better visual results.