University of Macau Library | UML Digital Resources Hub

Title: Hybrid-spatial transformer for image captioning
English Abstract: Show / Hidden

Recent years, the transformer-based model has achieved great success in many tasks such as machine translation. This encoder-decoder architecture is proved to be useful for image captioning tasks as well. We propose a novel Hybrid-Spatial Transformer model for image captioning. In this work, we combine the Global information and Local information of image as input of encoder which extracted by VGG16 and Faster R-CNN respectively. To further improve the performance of model, we add spatial information to attention layer by incorporating geometry features to attention weight. What’s more, queries Q, keys K, values V are a bit different from standard transformer, which is reflected in theses aspects. The positional encoding or embedding is not added to values V both encoder and decoder, the positional embedding is added to keys K on cross-attention. The experimental results illustrate that our model can achieve state-of-the art performance on CIDEr-D, METEROR and BLEU-1 on MS-COCO dataset.
Issue date: 2022.
Author: Zheng, Jin Cheng
Faculty: Faculty of Science and Technology
Department: Department of Computer and Information Science
Degree: M.Sc.
Subject: Image processing -- Digital techniques
Supervisor: Pun, Chi Man
Files In This Item: Full-text (Intranet only)
Location: 1/F Zone C
Library URL: 991010196478106306