TY - JOUR

T1 - A cost-aggregating integer linear program for motif finding

AU - Kingsford, Carl

AU - Zaslavsky, Elena

AU - Singh, Mona

N1 - Funding Information:
An earlier version of this paper appeared in [10] and was presented at CPM 2006, Barcelona, Spain, July 5–7. We thank Dan Gusfield for helpful comments on that manuscript. C.K. thanks the NSF for grants 0849899 and 0812111 . M.S. thanks the NIH for grant GM076275 . This research has also been partially supported by the NIH Center of Excellence grant P50 GM071508 , NSF grant PECASE MCB-0093399 and DARPA award MDA972-00-1-0031 .

PY - 2011/12

Y1 - 2011/12

N2 - In the motif finding problem one seeks a set of mutually similar substrings within a collection of biological sequences. This is an important and widely-studied problem, as such shared motifs in DNA often correspond to regulatory elements. We study a combinatorial framework where the goal is to find substrings of a given length such that the sum of their pairwise distances is minimized. We describe a novel integer linear program for the problem, which uses the fact that distances between substrings come from a limited set of possibilities allowing for aggregate consideration of sequence position pairs with the same distances. We show how to tighten its linear programming relaxation by adding an exponential set of constraints and give an efficient separation algorithm that can find violated constraints, thereby showing that the tightened linear program can still be solved in polynomial time. We apply our approach to find optimal solutions for the motif finding problem and show that it is effective in practice in uncovering known transcription factor binding sites.

AB - In the motif finding problem one seeks a set of mutually similar substrings within a collection of biological sequences. This is an important and widely-studied problem, as such shared motifs in DNA often correspond to regulatory elements. We study a combinatorial framework where the goal is to find substrings of a given length such that the sum of their pairwise distances is minimized. We describe a novel integer linear program for the problem, which uses the fact that distances between substrings come from a limited set of possibilities allowing for aggregate consideration of sequence position pairs with the same distances. We show how to tighten its linear programming relaxation by adding an exponential set of constraints and give an efficient separation algorithm that can find violated constraints, thereby showing that the tightened linear program can still be solved in polynomial time. We apply our approach to find optimal solutions for the motif finding problem and show that it is effective in practice in uncovering known transcription factor binding sites.

KW - Computational biology

KW - Integer linear programming

KW - Sequence motif finding

UR - http://www.scopus.com/inward/record.url?scp=80755147152&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80755147152&partnerID=8YFLogxK

U2 - 10.1016/j.jda.2011.04.001

DO - 10.1016/j.jda.2011.04.001

M3 - Article

C2 - 22081765

AN - SCOPUS:80755147152

SN - 1570-8667

VL - 9

SP - 326

EP - 334

JO - Journal of Discrete Algorithms

JF - Journal of Discrete Algorithms

IS - 4

ER -