TY - JOUR
T1 - Mining Massive Amounts of Genomic Data
T2 - A Semiparametric Topic Modeling Approach
AU - Fang, Ethan X.
AU - Li, Min Dian
AU - Jordan, Michael I.
AU - Liu, Han
N1 - Funding Information:
aDepartment of Statistics, Department of Industrial and Manufacturing Engineering, Pennsylvania State University, University Park, PA; bDepartment of Genetics and Complex Diseases and Sabri Ülker Center, Harvard T.H. Chan School of Public Health, Boston, MA; cDepartment of EECS and Statistics, University of California, Berkeley, CA; dDepartment of Operations Research and Financial Engineering, Princeton University, Princeton, NJ
Funding Information:
Part of this work is based upon work supported in part by the Office of Naval Research MURI program. H. Liu is partially supported by the grants NSF DMS1454377-CAREER, NSF IIS1546482-BIGDATA, NIH R01MH102339, NSF IIS1408910, NSF IIS1332109, and NIH R01GM083084.
Publisher Copyright:
© 2017 American Statistical Association.
PY - 2017/7/3
Y1 - 2017/7/3
N2 - Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computational identification of TFs is emerging as a useful approach to bridge functional genomics with disease risk loci. In this article, we use large-scale gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theoretical analysis, which are also applicable to a variety of large-scale data analyses. (ii) From an experimental perspective, our method generates an informative list of tumor-related TFs and their possible effected tumor types. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, many of which have not been reported before. In summary, our work established a robust method to identify the association between TFs and biological contexts. Given the limited amount of genome-wide binding profiles of TFs and the massive number of expression profiles, our work provides a useful tool to deconvolute the gene regulatory network for tumors and other biological contexts. Supplementary materials for this article are available online.
AB - Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computational identification of TFs is emerging as a useful approach to bridge functional genomics with disease risk loci. In this article, we use large-scale gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theoretical analysis, which are also applicable to a variety of large-scale data analyses. (ii) From an experimental perspective, our method generates an informative list of tumor-related TFs and their possible effected tumor types. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, many of which have not been reported before. In summary, our work established a robust method to identify the association between TFs and biological contexts. Given the limited amount of genome-wide binding profiles of TFs and the massive number of expression profiles, our work provides a useful tool to deconvolute the gene regulatory network for tumors and other biological contexts. Supplementary materials for this article are available online.
KW - Association study
KW - Genomic data
KW - Semiparametric modeling
KW - Topic modeling
UR - http://www.scopus.com/inward/record.url?scp=85032479953&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032479953&partnerID=8YFLogxK
U2 - 10.1080/01621459.2016.1256812
DO - 10.1080/01621459.2016.1256812
M3 - Article
AN - SCOPUS:85032479953
SN - 0162-1459
VL - 112
SP - 921
EP - 932
JO - Journal of the American Statistical Association
JF - Journal of the American Statistical Association
IS - 519
ER -