TY - JOUR
T1 - Missing value estimation methods for DNA microarrays
AU - Troyanskaya, Olga
AU - Cantor, Michael
AU - Sherlock, Gavin
AU - Brown, Pat
AU - Hastie, Trevor
AU - Tibshirani, Robert
AU - Botstein, David
AU - Altman, Russ B.
N1 - Funding Information:
We would like to thank Soumya Raychaudhari and Joshua Stuart for thoughtful comments on the manuscript and discussions, and Orly Alter and Mike Liang for helpful suggestions. O.T. is supported by a Howard Hughes Medical Institute predoctoral fellowship and by a Stanford Graduate Fellowship. M.C. is supported by NIH training grant LM-07033. T.H. is partially supported by NSF grant DMS-9803645 and NIH grant ROI-CA-72028-01. R.T. is supported by the NIH grant 2 R01 CA72028, and NSF grant DMS-9971405. D.B. is partially supported by CA 77097 from the NCI. R.B.A. is supported by NIH-GM61374, NIH-LM06244, NSF DBI-9600637, SUN Microsystems and a grant from the Burroughs-Wellcome Foundation.
PY - 2001
Y1 - 2001
N2 - Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. Results: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1-20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.
AB - Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. Results: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1-20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.
UR - http://www.scopus.com/inward/record.url?scp=0034960264&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0034960264&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/17.6.520
DO - 10.1093/bioinformatics/17.6.520
M3 - Article
C2 - 11395428
AN - SCOPUS:0034960264
SN - 1367-4803
VL - 17
SP - 520
EP - 525
JO - Bioinformatics
JF - Bioinformatics
IS - 6
ER -