school

UM E-Theses Collection (澳門大學電子學位論文庫)

Title

Hybrid-spatial transformer for image captioning

English Abstract

Recent years, the transformer-based model has achieved great success in many tasks such as machine translation. This encoder-decoder architecture is proved to be useful for image captioning tasks as well. We propose a novel Hybrid-Spatial Transformer model for image captioning. In this work, we combine the Global information and Local information of image as input of encoder which extracted by VGG16 and Faster R-CNN respectively. To further improve the performance of model, we add spatial information to attention layer by incorporating geometry features to attention weight. What’s more, queries Q, keys K, values V are a bit different from standard transformer, which is reflected in theses aspects. The positional encoding or embedding is not added to values V both encoder and decoder, the positional embedding is added to keys K on cross-attention. The experimental results illustrate that our model can achieve state-of-the art performance on CIDEr-D, METEROR and BLEU-1 on MS-COCO dataset.

Issue date

2022.

Author

Zheng, Jin Cheng

Faculty

Faculty of Science and Technology

Department

Department of Computer and Information Science

Degree

M.Sc.

Subject

Image processing -- Digital techniques

Supervisor

Pun, Chi Man

Location
1/F Zone C
Library URL
991010196478106306