TY - GEN
T1 - The Implicit Values of A Good Hand Shake
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
AU - Chugunov, Ilya
AU - Zhang, Yuxuan
AU - Xia, Zhihao
AU - Zhang, Xuaner
AU - Chen, Jiawen
AU - Heide, Felix
N1 - Funding Information:
We show that with a modern smartphone, its possible to reconstruct a high-fidelity depth map from just a snapshot of a textured “tabletop” object. We quantitatively validate that our technique outperforms several recent baselines and qualitatively compare to a dedicated depth camera. Rolling Shutter. There is a delay in the tens of milliseconds between when we record the first and last row of pixels from the camera sensor [31], during which time the position of the phone could slightly shift. Given accurate shutter timings, one may incorporate a model of rolling shutter similar to [23] directly into the implicit depth model. Training Time. Although our training time is practical for offline processing and opens the potential for the easy collection of a large-scale training corpus, our method may be further accelerated with an adaptive sampling scheme which takes into account pose, color, and depth information to select the most useful samples for network training. Additional Sensors. We hope in the future to get access to raw phone LiDAR samples, whose photon time tags could provide an additional sparse high-trust supervision signal. Modern phones now also come with multiple cameras with different focal properties. If synchronously acquired, their video streams could expand the overall effective baseline of our setup and provide additional geometric information for depth reconstruction – towards snapshot smartphone depth imaging that exploits all available sensor modalities. Acknowledgments. Ilya Chugunov was supported by an NSF Graduate Research Fellowship. Felix Heide was supported by an NSF CAREER Award (2047359), a Sony Young Faculty Award, and a Project X Innovation Award.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Modern smartphones can continuously stream multi-megapixel RGB images at 60 Hz, synchronized with high-quality 3D pose information and low-resolution LiDAR-driven depth estimates. During a snapshot photograph, the natural unsteadiness of the photographer's hands offers millimeter-scale variation in camera pose, which we can capture along with RGB and depth in a circular buffer. In this work we explore how, from a bundle of these measurements acquired during viewfinding, we can combine dense micro-baseline parallax cues with kilopixel LiDAR depth to distill a high-fidelity depth map. We take a test-time optimization approach and train a coordinate MLP to output photometrically and geometrically consistent depth estimates at the continuous coordinates along the path traced by the photographer's natural hand shake. With no additional hardware, artificial hand motion, or user interaction beyond the press of a button, our proposed method brings high-resolution depth estimates to point-and-shoot 'table-top' photography - textured objects at close range.
AB - Modern smartphones can continuously stream multi-megapixel RGB images at 60 Hz, synchronized with high-quality 3D pose information and low-resolution LiDAR-driven depth estimates. During a snapshot photograph, the natural unsteadiness of the photographer's hands offers millimeter-scale variation in camera pose, which we can capture along with RGB and depth in a circular buffer. In this work we explore how, from a bundle of these measurements acquired during viewfinding, we can combine dense micro-baseline parallax cues with kilopixel LiDAR depth to distill a high-fidelity depth map. We take a test-time optimization approach and train a coordinate MLP to output photometrically and geometrically consistent depth estimates at the continuous coordinates along the path traced by the photographer's natural hand shake. With no additional hardware, artificial hand motion, or user interaction beyond the press of a button, our proposed method brings high-resolution depth estimates to point-and-shoot 'table-top' photography - textured objects at close range.
KW - 3D from multi-view and sensors
KW - 3D from single images
KW - Machine learning
KW - RGBD sensors and analytics
UR - http://www.scopus.com/inward/record.url?scp=85132250640&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85132250640&partnerID=8YFLogxK
U2 - 10.1109/CVPR52688.2022.00287
DO - 10.1109/CVPR52688.2022.00287
M3 - Conference contribution
AN - SCOPUS:85132250640
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 2842
EP - 2852
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PB - IEEE Computer Society
Y2 - 19 June 2022 through 24 June 2022
ER -