TY - GEN
T1 - SiRi
T2 - 17th European Conference on Computer Vision, ECCV 2022
AU - Qu, Mengxue
AU - Wu, Yu
AU - Liu, Wu
AU - Gong, Qiqi
AU - Liang, Xiaodan
AU - Russakovsky, Olga
AU - Zhao, Yao
AU - Wei, Yunchao
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for this challenging task. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. In specific, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize rest of the parameters to compel the model to be better optimized based on an enhanced encoder. SiRi can significantly outperform previous approaches on three popular benchmarks. Specifically, our method achieves 83.04% Top1 accuracy on RefCOCO+ testA, outperforming the state-of-the-art approaches (training from scratch) by more than 10.21%. Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity. Code is available at https://github.com/qumengxue/siri-vg.git.
AB - In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for this challenging task. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. In specific, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize rest of the parameters to compel the model to be better optimized based on an enhanced encoder. SiRi can significantly outperform previous approaches on three popular benchmarks. Specifically, our method achieves 83.04% Top1 accuracy on RefCOCO+ testA, outperforming the state-of-the-art approaches (training from scratch) by more than 10.21%. Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity. Code is available at https://github.com/qumengxue/siri-vg.git.
KW - Generalization
KW - Transformer
KW - Visual grounding
UR - http://www.scopus.com/inward/record.url?scp=85144486883&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85144486883&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-19833-5_32
DO - 10.1007/978-3-031-19833-5_32
M3 - Conference contribution
AN - SCOPUS:85144486883
SN - 9783031198328
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 546
EP - 562
BT - Computer Vision – ECCV 2022 - 17th European Conference, Proceedings
A2 - Avidan, Shai
A2 - Brostow, Gabriel
A2 - Cissé, Moustapha
A2 - Farinella, Giovanni Maria
A2 - Hassner, Tal
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 23 October 2022 through 27 October 2022
ER -