Whole-genome sequence assemblies provide a rich resource for the in silico identification and characterization of regulatory DNAs, particularly enhancers, in different animal groups, including Caenorhabditis elegans, Drosophila melanogaster and Mus musculus. There are two major methods for the recognition of regulatory DNAs within complex genome assemblies: (i) clustering of combinations of sequence motifs that correspond to known binding sites for defined transcription factors and (ii) phylogenetic analyses that identify sequence conservation in noncoding regions among two or more related species. We describe here the first method - clusters of binding sites for multiple transcription factors; the second method is described in the accompanying protocol. Clustering methods require extensive prior knowledge of the binding preferences of known transcription factors. In ideal cases, the process under study relies on the activities of two or more well-defined transcription factors. Even in such optimal cases, however, there is a high incidence of false positives. A 'hit rate' of 30-50% is the limit of precision that can be obtained with these methods. Nonetheless, they are considerably more efficient than the identification of enhancers via 'blind' functional assays, whereby random genomic DNA fragments near or within a given gene are analyzed for regulatory activities. Clustering methods have been used to identify many new enhancers engaged in common developmental processes, permitting the construction of genomic regulatory networks. Clustering methods were first developed nearly a decade ago; however, there is still no 'best' technique or 'universal' software (comparable to BLAST, a fast alignment search tool). Instead, new techniques are constantly being developed, and some even merge clustering methods with phylogenetic analysis. Despite the flourishing diversity of methods, most use common strategies that we describe here in a sequence of steps, providing a format for the identification of functionally related enhancers and coregulated genes in animal genomes.
All Science Journal Classification (ASJC) codes
- Molecular Biology
- Cell Biology