TY - JOUR
T1 - M are better than one
T2 - An ensemble-based motif finder and its application to regulatory element prediction
AU - Yanover, Chen
AU - Singh, Mona
AU - Zaslavsky, Elena
N1 - Funding Information:
Funding: Fred Hutchinson Cancer Research Center, Seattle, WA (in part); NIH Center of Excellence at Princeton University (P50 GM071508, in part); National Science Foundation (DGE-9972930, in part); National Institutes of Health award (HHSN266200500021C, in part); National Science Foundation (IIS-061223 to M.S.); National Institutes of Health (GM076275 to M.S.).
PY - 2009
Y1 - 2009
N2 - Motivation: Identifying regulatory elements in genomic sequences is a key component in understanding the control of gene expression. Computationally, this problem is often addressed by motif discovery, where the goal is to find a set of mutually similar subsequences within a collection of input sequences. Though motif discovery is widely studied and many approaches to it have been suggested, it remains a challenging and as yet unresolved problem. Results: We introduce SAMF (Solution-Aggregating Motif Finder), a novel approach for motif discovery. SAMF is based on a Markov Random Field formulation, and its key idea is to uncover and aggregate multiple statistically significant solutions to the given motif finding problem. In contrast to many earlier methods, SAMF does not require prior estimates on the number of motif instances present in the data, is not limited by motif length, and allows motifs to overlap. Though SAMF is broadly applicable, these features make it particularly well suited for addressing the challenges of prokaryotic regulatory element detection. We test SAMF's ability to find transcription factor binding sites in an Escherichia coli dataset and show that it outperforms previous methods. Additionally, we uncover a number of previously unidentified binding sites in this data, and provide evidence that they correspond to actual regulatory elements.
AB - Motivation: Identifying regulatory elements in genomic sequences is a key component in understanding the control of gene expression. Computationally, this problem is often addressed by motif discovery, where the goal is to find a set of mutually similar subsequences within a collection of input sequences. Though motif discovery is widely studied and many approaches to it have been suggested, it remains a challenging and as yet unresolved problem. Results: We introduce SAMF (Solution-Aggregating Motif Finder), a novel approach for motif discovery. SAMF is based on a Markov Random Field formulation, and its key idea is to uncover and aggregate multiple statistically significant solutions to the given motif finding problem. In contrast to many earlier methods, SAMF does not require prior estimates on the number of motif instances present in the data, is not limited by motif length, and allows motifs to overlap. Though SAMF is broadly applicable, these features make it particularly well suited for addressing the challenges of prokaryotic regulatory element detection. We test SAMF's ability to find transcription factor binding sites in an Escherichia coli dataset and show that it outperforms previous methods. Additionally, we uncover a number of previously unidentified binding sites in this data, and provide evidence that they correspond to actual regulatory elements.
UR - http://www.scopus.com/inward/record.url?scp=63549134757&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=63549134757&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btp090
DO - 10.1093/bioinformatics/btp090
M3 - Article
C2 - 19223448
AN - SCOPUS:63549134757
SN - 1367-4803
VL - 25
SP - 868
EP - 874
JO - Bioinformatics
JF - Bioinformatics
IS - 7
ER -