TY - GEN
T1 - Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust
AU - Hancock, Asher J.
AU - Ren, Allen Z.
AU - Majumdar, Anirudha
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 60%. Website with additional information, videos, and code: https://aasherh.github.io/byovla/.
AB - Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 60%. Website with additional information, videos, and code: https://aasherh.github.io/byovla/.
UR - https://www.scopus.com/pages/publications/105016630291
UR - https://www.scopus.com/pages/publications/105016630291#tab=citedBy
U2 - 10.1109/ICRA55743.2025.11128017
DO - 10.1109/ICRA55743.2025.11128017
M3 - Conference contribution
AN - SCOPUS:105016630291
T3 - Proceedings - IEEE International Conference on Robotics and Automation
SP - 9499
EP - 9506
BT - 2025 IEEE International Conference on Robotics and Automation, ICRA 2025
A2 - Ott, Christian
A2 - Admoni, Henny
A2 - Behnke, Sven
A2 - Bogdan, Stjepan
A2 - Bolopion, Aude
A2 - Choi, Youngjin
A2 - Ficuciello, Fanny
A2 - Gans, Nicholas
A2 - Gosselin, Clement
A2 - Harada, Kensuke
A2 - Kayacan, Erdal
A2 - Kim, H. Jin
A2 - Leutenegger, Stefan
A2 - Liu, Zhe
A2 - Maiolino, Perla
A2 - Marques, Lino
A2 - Matsubara, Takamitsu
A2 - Mavromatti, Anastasia
A2 - Minor, Mark
A2 - O'Kane, Jason
A2 - Park, Hae Won
A2 - Park, Hae-Won
A2 - Rekleitis, Ioannis
A2 - Renda, Federico
A2 - Ricci, Elisa
A2 - Riek, Laurel D.
A2 - Sabattini, Lorenzo
A2 - Shen, Shaojie
A2 - Sun, Yu
A2 - Wieber, Pierre-Brice
A2 - Yamane, Katsu
A2 - Yu, Jingjin
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Conference on Robotics and Automation, ICRA 2025
Y2 - 19 May 2025 through 23 May 2025
ER -