TY - JOUR
T1 - Text-based Editing of Talking-head Video
AU - Fried, Ohad
AU - Tewari, Ayush
AU - Zollhöfer, Michael
AU - Finkelstein, Adam
AU - Shechtman, Eli
AU - Goldman, Dan B.
AU - Genova, Kyle
AU - Jin, Zeyu
AU - Theobalt, Christian
AU - Agrawala, Maneesh
N1 - Funding Information:
This work was supported by the Brown Institute for Media Innovation, the Max Planck Center for Visual Computing and Communications, ERC Consolidator Grant 4DRepLy (770784), Adobe Systems, and the Office of the Dean for Research at Princeton University.
Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/7
Y1 - 2019/7
N2 - Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis.
AB - Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis.
KW - Dubbing
KW - Face parameterization
KW - Face tracking
KW - Neural rendering
KW - Talking heads
KW - Text-based video editing
KW - Visemes
UR - http://www.scopus.com/inward/record.url?scp=85073889946&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073889946&partnerID=8YFLogxK
U2 - 10.1145/3306346.3323028
DO - 10.1145/3306346.3323028
M3 - Article
AN - SCOPUS:85073889946
SN - 0730-0301
VL - 38
JO - ACM Transactions on Computer Systems
JF - ACM Transactions on Computer Systems
IS - 4
M1 - 68
ER -