TY - JOUR
T1 - Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields
AU - Barutcuoglu, Zafer
AU - Airoldi, Edoardo M.
AU - Dumeaux, Vanessa
AU - Schapire, Robert E.
AU - Troyanskaya, Olga G.
N1 - Funding Information:
Funding: National Science Foundation (IIS-0513552); National Institute of Health (R01 GM071966); NSF CAREER award DBI-0546275; NIGMS Center of Excellence (P50 GM071508).
PY - 2009/5
Y1 - 2009/5
N2 - Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome. Results: Here, we present the heterogeneous hidden conditional random field, a new integrated array-CGH analysis method for jointly classifying tumors, inferring copy numbers and identifying clinically relevant positions in recurrent alteration regions. By capturing the sequentiality as well as the locality of changes, our integrated model provides better noise reduction, and achieves more relevant gene retrieval and more accurate classification than existing methods. We provide an efficient L1-regularized discriminative training algorithm, which notably selects a small set of candidate genes most likely to be clinically relevant and driving the recurrent amplicons of importance. Our method thus provides unbiased starting points in deciding which genomic regions and which genes in particular to pursue for further examination. Our experiments on synthetic data and real genomic cancer prediction data show that our method is superior, both in prediction accuracy and relevant feature discovery, to existing methods. We also demonstrate that it can be used to generate novel biological hypotheses for breast cancer.
AB - Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome. Results: Here, we present the heterogeneous hidden conditional random field, a new integrated array-CGH analysis method for jointly classifying tumors, inferring copy numbers and identifying clinically relevant positions in recurrent alteration regions. By capturing the sequentiality as well as the locality of changes, our integrated model provides better noise reduction, and achieves more relevant gene retrieval and more accurate classification than existing methods. We provide an efficient L1-regularized discriminative training algorithm, which notably selects a small set of candidate genes most likely to be clinically relevant and driving the recurrent amplicons of importance. Our method thus provides unbiased starting points in deciding which genomic regions and which genes in particular to pursue for further examination. Our experiments on synthetic data and real genomic cancer prediction data show that our method is superior, both in prediction accuracy and relevant feature discovery, to existing methods. We also demonstrate that it can be used to generate novel biological hypotheses for breast cancer.
UR - http://www.scopus.com/inward/record.url?scp=65549083895&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=65549083895&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btn585
DO - 10.1093/bioinformatics/btn585
M3 - Article
C2 - 19052061
AN - SCOPUS:65549083895
SN - 1367-4803
VL - 25
SP - 1307
EP - 1313
JO - Bioinformatics
JF - Bioinformatics
IS - 10
ER -