TY - GEN
T1 - An automated end-to-end pipeline for fine-grained video annotation using deep neural networks
AU - Vandersmissen, Baptist
AU - Sterckx, Lucas
AU - Demeester, Thomas
AU - Jalalvand, Azarakhsh
AU - De Neve, Wesley
AU - Van De Walle, Rik
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/6/6
Y1 - 2016/6/6
N2 - The searchability of video content is often limited to the descriptions authors and/or annotators care to provide. The level of description can range from absolutely nothing to fine-grained annotations at the level of frames. Based on these annotations, certain parts of the video content are more searchable than others. Within the context of the STEAMER project, we developed an innovative end-to-end system that attempts to tackle the problem of unsupervised retrieval of news video content, leveraging multiple information streams and deep neural networks. In particular, we extracted keyphrases and named entities from transcripts, subsequently refining these keyphrases and named entities based on their visual appearance in the news video content. Moreover, to allow for fine-grained frame-level annotations, we temporally located high-confidence keyphrases in the news video content. To that end, we had to tackle challenges such as the automatic construction of training sets and the automatic assessment of keyphrase imageability. In this paper, we discuss the main components of our end- To-end system, capable of transforming textual and visual information into fine-grained video annotations.
AB - The searchability of video content is often limited to the descriptions authors and/or annotators care to provide. The level of description can range from absolutely nothing to fine-grained annotations at the level of frames. Based on these annotations, certain parts of the video content are more searchable than others. Within the context of the STEAMER project, we developed an innovative end-to-end system that attempts to tackle the problem of unsupervised retrieval of news video content, leveraging multiple information streams and deep neural networks. In particular, we extracted keyphrases and named entities from transcripts, subsequently refining these keyphrases and named entities based on their visual appearance in the news video content. Moreover, to allow for fine-grained frame-level annotations, we temporally located high-confidence keyphrases in the news video content. To that end, we had to tackle challenges such as the automatic construction of training sets and the automatic assessment of keyphrase imageability. In this paper, we discuss the main components of our end- To-end system, capable of transforming textual and visual information into fine-grained video annotations.
KW - Deep neural networks
KW - Fine-grained video annotation
KW - Video retrieval
UR - http://www.scopus.com/inward/record.url?scp=84978646636&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84978646636&partnerID=8YFLogxK
U2 - 10.1145/2911996.2912028
DO - 10.1145/2911996.2912028
M3 - Conference contribution
AN - SCOPUS:84978646636
T3 - ICMR 2016 - Proceedings of the 2016 ACM International Conference on Multimedia Retrieval
SP - 409
EP - 412
BT - ICMR 2016 - Proceedings of the 2016 ACM International Conference on Multimedia Retrieval
PB - Association for Computing Machinery, Inc
T2 - 6th ACM International Conference on Multimedia Retrieval, ICMR 2016
Y2 - 6 June 2016 through 9 June 2016
ER -