Abstract:
Visual question answering (VQA) refers to the multimedia understanding task where a computer is given an image and a natural language question related to the image content, and it is required to provide a correct answer. Early VQA models often overlooked the emotional information in images, resulting in insufficient performance when answering emotion-related questions. On the other hand, existing emotion-integrated VQA models do not make full use of key regions in images and keywords in text, leading to a lack of in-depth understanding of fine-grained questions and overall low accuracy in their answers. To fully incorporate image emotional information into VQA models and use this information to enhance the models' ability to answer questions, we propose an emotion-enhanced visual question answering model (IEVQA). This model builds on a large-scale pre-trained model framework and uses an emotion module to improve its capability in answering emotion-related questions. Experiments were conducted on a VQA benchmark dataset. The final results show that the IEVQA model outperforms other comparison methods in comprehensive metrics, and it validates the effectiveness of using emotional information to assist VQA models.