Abstract:
Aiming at the limitation of the number of blocks caused by overlapping and overlapping of block feature sensing fields in horizontal slicing-based person re-identification method, a person re-identification network structure CNN with INOUT_Transformer(CIT) based on Transformer was proposed. First of all, the input image was extracted from the middle feature image through CNN network, and the feature image was divided into blocks, and each piece of feature was further cut into pixel-level token vector. Then, each pixel level token vector was flattened and the position encoding and global token vector were added, which were input into the TransformerIN encoder. Then, the global token vector was further added into the classified token vector and position encoding, and then input into the TransformerOUT to obtain the final encoder output. Finally, after taking the classification token vector and adding the fully connected layer, the pedestrian was classified by Softmax and cross entropy loss. Experimental results on Market-1501 and DukeMTMC-reID datasets show that the proposed method can extract features more fine-grained, and further improve the number of slices and classification accuracy by utilizing Transformer′s global control ability.