TY - GEN
T1 - D3D
T2 - 2020 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2020
AU - Stroud, Jonathan C.
AU - Ross, David A.
AU - Sun, Chen
AU - Deng, Jia
AU - Sukthankar, Rahul
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/3
Y1 - 2020/3
N2 - State-of-the-art methods for action recognition commonly use two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both streams are 3D Convolutional Neural Networks, which use spatiotemporal filters. These filters can respond to motion, and therefore should allow the network to learn motion representations, removing the need for optical flow. However, we still see significant benefits in performance by feeding optical flow into the temporal stream, indicating that the spatial stream is "missing" some of the signal that the temporal stream captures. In this work, we first investigate whether motion representations are indeed missing in the spatial stream, and show that there is significant room for improvement. Second, we demonstrate that these motion representations can be improved using distillation, that is, by tuning the spatial stream to mimic the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with the two-stream approach, with no need to compute optical flow during inference.
AB - State-of-the-art methods for action recognition commonly use two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both streams are 3D Convolutional Neural Networks, which use spatiotemporal filters. These filters can respond to motion, and therefore should allow the network to learn motion representations, removing the need for optical flow. However, we still see significant benefits in performance by feeding optical flow into the temporal stream, indicating that the spatial stream is "missing" some of the signal that the temporal stream captures. In this work, we first investigate whether motion representations are indeed missing in the spatial stream, and show that there is significant room for improvement. Second, we demonstrate that these motion representations can be improved using distillation, that is, by tuning the spatial stream to mimic the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with the two-stream approach, with no need to compute optical flow during inference.
UR - http://www.scopus.com/inward/record.url?scp=85085504852&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85085504852&partnerID=8YFLogxK
U2 - 10.1109/WACV45572.2020.9093274
DO - 10.1109/WACV45572.2020.9093274
M3 - Conference contribution
AN - SCOPUS:85085504852
T3 - Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020
SP - 614
EP - 623
BT - Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 1 March 2020 through 5 March 2020
ER -