TY - GEN
T1 - Towards Unique and Informative Captioning of Images
AU - Wang, Zeyu
AU - Feng, Berthy
AU - Narasimhan, Karthik
AU - Russakovsky, Olga
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be ‘topped’ using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model – by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics (Code is available at https://github.com/princetonvisualai/SPICE-U).
AB - Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be ‘topped’ using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model – by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics (Code is available at https://github.com/princetonvisualai/SPICE-U).
UR - http://www.scopus.com/inward/record.url?scp=85097370356&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097370356&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-58571-6_37
DO - 10.1007/978-3-030-58571-6_37
M3 - Conference contribution
AN - SCOPUS:85097370356
SN - 9783030585709
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 629
EP - 644
BT - Computer Vision – ECCV 2020 - 16th European Conference, 2020, Proceedings
A2 - Vedaldi, Andrea
A2 - Bischof, Horst
A2 - Brox, Thomas
A2 - Frahm, Jan-Michael
PB - Springer Science and Business Media Deutschland GmbH
T2 - 16th European Conference on Computer Vision, ECCV 2020
Y2 - 23 August 2020 through 28 August 2020
ER -