TY - GEN
T1 - What's in a question
T2 - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
AU - Ganju, Siddha
AU - Russakovsky, Olga
AU - Gupta, Abhinav
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/11/6
Y1 - 2017/11/6
N2 - Collecting fully annotated image datasets is challenging and expensive. Many types of weak supervision have been explored: weak manual annotations, web search results, temporal continuity, ambient sound and others. We focus on one particular unexplored mode: visual questions that are asked about images. The key observation that inspires our work is that the question itself provides useful information about the image (even without the answer being available). For instance, the question “what is the breed of the dog?” informs the AI that the animal in the scene is a dog and that there is only one dog present. We make three contributions: (1) providing an extensive qualitative and quantitative analysis of the information contained in human visual questions, (2) proposing two simple but surprisingly effective modifications to the standard visual question answering models that allow them to make use of weak supervision in the form of unanswered questions associated with images and (3) demonstrating that a simple data augmentation strategy inspired by our insights results in a 7.1% improvement on the standard VQA benchmark.
AB - Collecting fully annotated image datasets is challenging and expensive. Many types of weak supervision have been explored: weak manual annotations, web search results, temporal continuity, ambient sound and others. We focus on one particular unexplored mode: visual questions that are asked about images. The key observation that inspires our work is that the question itself provides useful information about the image (even without the answer being available). For instance, the question “what is the breed of the dog?” informs the AI that the animal in the scene is a dog and that there is only one dog present. We make three contributions: (1) providing an extensive qualitative and quantitative analysis of the information contained in human visual questions, (2) proposing two simple but surprisingly effective modifications to the standard visual question answering models that allow them to make use of weak supervision in the form of unanswered questions associated with images and (3) demonstrating that a simple data augmentation strategy inspired by our insights results in a 7.1% improvement on the standard VQA benchmark.
UR - http://www.scopus.com/inward/record.url?scp=85044529112&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85044529112&partnerID=8YFLogxK
U2 - 10.1109/CVPR.2017.680
DO - 10.1109/CVPR.2017.680
M3 - Conference contribution
AN - SCOPUS:85044529112
T3 - Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
SP - 6422
EP - 6431
BT - Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 21 July 2017 through 26 July 2017
ER -