TY - JOUR
T1 - Dirty Pixels
T2 - Towards End-to-end Image Processing and Perception
AU - Diamond, Steven
AU - Sitzmann, Vincent
AU - Julca-Aguilar, Frank
AU - Boyd, Stephen
AU - Wetzstein, Gordon
AU - Heide, Felix
N1 - Publisher Copyright:
© 2021 Association for Computing Machinery.
PY - 2021/6
Y1 - 2021/6
N2 - Real-world, imaging systems acquire measurements that are degraded by noise, optical aberrations, and other imperfections that make image processing for human viewing and higher-level perception tasks challenging. Conventional cameras address this problem by compartmentalizing imaging from high-level task processing. As such, conventional imaging involves processing the RAW sensor measurements in a sequential pipeline of steps, such as demosaicking, denoising, deblurring, tone-mapping, and compression. This pipeline is optimized to obtain a visually pleasing image. High-level processing, however, involves steps such as feature extraction, classification, tracking, and fusion. While this silo-ed design approach allows for efficient development, it also dictates compartmentalized performance metrics without knowledge of the higher-level task of the camera system. For example, today's demosaicking and denoising algorithms are designed using perceptual image quality metrics but not with domain-specific tasks such as object detection in mind. We propose an end-to-end differentiable architecture that jointly performs demosaicking, denoising, deblurring, tone-mapping, and classification (see Figure 1). The architecture does not require any intermediate losses based on perceived image quality and learns processing pipelines whose outputs differ from those of existing ISPs optimized for perceptual quality, preserving fine detail at the cost of increased noise and artifacts. We show that state-of-the-art ISPs discard information that is essential in corner cases, such as extremely low-light conditions, where conventional imaging and perception stacks fail. We demonstrate on captured and simulated data that our model substantially improves perception in low light and other challenging conditions, which is imperative for real-world applications such as autonomous driving, robotics, and surveillance. Finally, we found that the proposed model also achieves state-of-the-art accuracy when optimized for image reconstruction in low-light conditions, validating the architecture itself as a potentially useful drop-in network for reconstruction and analysis tasks beyond the applications demonstrated in this work. Our proposed models, datasets, and calibration data are available at https://github.com/princeton-computational-imaging/DirtyPixels.
AB - Real-world, imaging systems acquire measurements that are degraded by noise, optical aberrations, and other imperfections that make image processing for human viewing and higher-level perception tasks challenging. Conventional cameras address this problem by compartmentalizing imaging from high-level task processing. As such, conventional imaging involves processing the RAW sensor measurements in a sequential pipeline of steps, such as demosaicking, denoising, deblurring, tone-mapping, and compression. This pipeline is optimized to obtain a visually pleasing image. High-level processing, however, involves steps such as feature extraction, classification, tracking, and fusion. While this silo-ed design approach allows for efficient development, it also dictates compartmentalized performance metrics without knowledge of the higher-level task of the camera system. For example, today's demosaicking and denoising algorithms are designed using perceptual image quality metrics but not with domain-specific tasks such as object detection in mind. We propose an end-to-end differentiable architecture that jointly performs demosaicking, denoising, deblurring, tone-mapping, and classification (see Figure 1). The architecture does not require any intermediate losses based on perceived image quality and learns processing pipelines whose outputs differ from those of existing ISPs optimized for perceptual quality, preserving fine detail at the cost of increased noise and artifacts. We show that state-of-the-art ISPs discard information that is essential in corner cases, such as extremely low-light conditions, where conventional imaging and perception stacks fail. We demonstrate on captured and simulated data that our model substantially improves perception in low light and other challenging conditions, which is imperative for real-world applications such as autonomous driving, robotics, and surveillance. Finally, we found that the proposed model also achieves state-of-the-art accuracy when optimized for image reconstruction in low-light conditions, validating the architecture itself as a potentially useful drop-in network for reconstruction and analysis tasks beyond the applications demonstrated in this work. Our proposed models, datasets, and calibration data are available at https://github.com/princeton-computational-imaging/DirtyPixels.
KW - Computational photography
KW - Machine learning
UR - http://www.scopus.com/inward/record.url?scp=85119111949&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119111949&partnerID=8YFLogxK
U2 - 10.1145/3446918
DO - 10.1145/3446918
M3 - Article
AN - SCOPUS:85119111949
SN - 0730-0301
VL - 40
JO - ACM Transactions on Graphics
JF - ACM Transactions on Graphics
IS - 3
M1 - 23
ER -