TY - GEN
T1 - Outlier Detection in Large Radiological Datasets Using UMAP
AU - Islam, Mohammad Tariqul
AU - Fleischer, Jason W.
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - The success of machine learning algorithms heavily relies on the quality of samples and the accuracy of their corresponding labels. However, building and maintaining large, high-quality datasets is an enormous task. This is especially true for biomedical data and for meta-sets that are compiled from smaller ones, as variations in image quality, labeling, reports, and archiving can lead to errors, inconsistencies, and repeated samples. Here, we show that the uniform manifold approximation and projection (UMAP) algorithm can find these anomalies essentially by forming independent clusters that are distinct from the main (“good”) data but similar to other points with the same error type. As a representative example, we apply UMAP to discover outliers in the publicly available ChestX-ray14, CheXpert, and MURA datasets. While the results are archival and retrospective and focus on radiological images, the graph-based methods work for any data type and will prove equally beneficial for curation at the time of dataset creation.
AB - The success of machine learning algorithms heavily relies on the quality of samples and the accuracy of their corresponding labels. However, building and maintaining large, high-quality datasets is an enormous task. This is especially true for biomedical data and for meta-sets that are compiled from smaller ones, as variations in image quality, labeling, reports, and archiving can lead to errors, inconsistencies, and repeated samples. Here, we show that the uniform manifold approximation and projection (UMAP) algorithm can find these anomalies essentially by forming independent clusters that are distinct from the main (“good”) data but similar to other points with the same error type. As a representative example, we apply UMAP to discover outliers in the publicly available ChestX-ray14, CheXpert, and MURA datasets. While the results are archival and retrospective and focus on radiological images, the graph-based methods work for any data type and will prove equally beneficial for curation at the time of dataset creation.
KW - data curation
KW - data visualization
KW - neighbor embedding
KW - x-ray
UR - http://www.scopus.com/inward/record.url?scp=85207645437&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85207645437&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-73967-5_11
DO - 10.1007/978-3-031-73967-5_11
M3 - Conference contribution
AN - SCOPUS:85207645437
SN - 9783031739668
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 111
EP - 121
BT - Topology- and Graph-Informed Imaging Informatics - 1st International Workshop, TGI3 2024, Held in Conjunction with MICCAI 2024, Proceedings
A2 - Chen, Chao
A2 - Singh, Yash
A2 - Hu, Xiaoling
PB - Springer Science and Business Media Deutschland GmbH
T2 - 1st Workshop on Topology- and Graph- Informed Imaging Informatics, TGI3 2024, held in conjunction with the 27th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2024
Y2 - 10 October 2024 through 10 October 2024
ER -