End-to-End Learning of Action Detection from Frame Glimpses in Videos

Serena Yeung, Olga Russakovsky, Greg Mori, Li Fei-Fei

Research output: Chapter in Book/Report/Conference proceedingConference contribution

217 Scopus citations

Abstract

In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's decision policy. Our model achieves state-of-the-art results on the THUMOS'14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames.

Original languageEnglish (US)
Title of host publicationProceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
PublisherIEEE Computer Society
Pages2678-2687
Number of pages10
ISBN (Electronic)9781467388504
DOIs
StatePublished - Dec 9 2016
Externally publishedYes
Event29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 - Las Vegas, United States
Duration: Jun 26 2016Jul 1 2016

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2016-December
ISSN (Print)1063-6919

Conference

Conference29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
CountryUnited States
CityLas Vegas
Period6/26/167/1/16

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'End-to-End Learning of Action Detection from Frame Glimpses in Videos'. Together they form a unique fingerprint.

Cite this