UM E-Theses Collection (澳門大學電子學位論文庫)
- Title
-
Hybrid-spatial transformer for image captioning
- English Abstract
-
Show / Hidden
Recent years, the transformer-based model has achieved great success in many tasks such as machine translation. This encoder-decoder architecture is proved to be useful for image captioning tasks as well. We propose a novel Hybrid-Spatial Transformer model for image captioning. In this work, we combine the Global information and Local information of image as input of encoder which extracted by VGG16 and Faster R-CNN respectively. To further improve the performance of model, we add spatial information to attention layer by incorporating geometry features to attention weight. What’s more, queries Q, keys K, values V are a bit different from standard transformer, which is reflected in theses aspects. The positional encoding or embedding is not added to values V both encoder and decoder, the positional embedding is added to keys K on cross-attention. The experimental results illustrate that our model can achieve state-of-the art performance on CIDEr-D, METEROR and BLEU-1 on MS-COCO dataset.
- Issue date
-
2022.
- Author
-
Zheng, Jin Cheng
- Faculty
-
Faculty of Science and Technology
- Department
-
Department of Computer and Information Science
- Degree
-
M.Sc.
- Subject
-
Image processing -- Digital techniques
- Supervisor
-
Pun, Chi Man
- Files In This Item
- Location
- 1/F Zone C
- Library URL
- 991010196478106306