Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing Huang, Jianqing Fan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grained semantic units in both sides of vision and language. For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.

Original languageEnglish (US)
Title of host publicationICMR 2022 - Proceedings of the 2022 International Conference on Multimedia Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages137-145
Number of pages9
ISBN (Electronic)9781450392389
DOIs
StatePublished - Jun 27 2022
Event2022 International Conference on Multimedia Retrieval, ICMR 2022 - Newark, United States
Duration: Jun 27 2022Jun 30 2022

Publication series

NameICMR 2022 - Proceedings of the 2022 International Conference on Multimedia Retrieval

Conference

Conference2022 International Conference on Multimedia Retrieval, ICMR 2022
Country/TerritoryUnited States
CityNewark
Period6/27/226/30/22

All Science Journal Classification (ASJC) codes

  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction
  • Software

Keywords

  • fine-grained supervision
  • image-text retrieval
  • phrase modeling

Fingerprint

Dive into the research topics of 'Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval'. Together they form a unique fingerprint.

Cite this