TY - GEN
T1 - Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval
AU - Fan, Zhihao
AU - Wei, Zhongyu
AU - Li, Zejun
AU - Wang, Siyuan
AU - Shan, Haijun
AU - Huang, Xuanjing
AU - Fan, Jianqing
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/6/27
Y1 - 2022/6/27
N2 - Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grained semantic units in both sides of vision and language. For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.
AB - Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grained semantic units in both sides of vision and language. For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.
KW - fine-grained supervision
KW - image-text retrieval
KW - phrase modeling
UR - http://www.scopus.com/inward/record.url?scp=85134020146&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134020146&partnerID=8YFLogxK
U2 - 10.1145/3512527.3531368
DO - 10.1145/3512527.3531368
M3 - Conference contribution
AN - SCOPUS:85134020146
T3 - ICMR 2022 - Proceedings of the 2022 International Conference on Multimedia Retrieval
SP - 137
EP - 145
BT - ICMR 2022 - Proceedings of the 2022 International Conference on Multimedia Retrieval
PB - Association for Computing Machinery, Inc
T2 - 2022 International Conference on Multimedia Retrieval, ICMR 2022
Y2 - 27 June 2022 through 30 June 2022
ER -