SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding

Mengxue Qu, Yu Wu, Wu Liu, Qiqi Gong, Xiaodan Liang, Olga Russakovsky, Yao Zhao, Yunchao Wei

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations


In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for this challenging task. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. In specific, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize rest of the parameters to compel the model to be better optimized based on an enhanced encoder. SiRi can significantly outperform previous approaches on three popular benchmarks. Specifically, our method achieves 83.04% Top1 accuracy on RefCOCO+ testA, outperforming the state-of-the-art approaches (training from scratch) by more than 10.21%. Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity. Code is available at

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2022 - 17th European Conference, Proceedings
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages17
ISBN (Print)9783031198328
StatePublished - 2022
Event17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
Duration: Oct 23 2022Oct 27 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13695 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference17th European Conference on Computer Vision, ECCV 2022
CityTel Aviv

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science


  • Generalization
  • Transformer
  • Visual grounding


Dive into the research topics of 'SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding'. Together they form a unique fingerprint.

Cite this