Motivation: Segmental duplications are common in mammalian genomes, but their evolutionary origins remain mysterious. A major difficulty in analyzing segmental duplications is that many duplications are complex mosaics of fragments of numerous other segmental duplications. Results: We introduce a novel measure called duplication distance that describes the minimum number of duplications necessary to create a target string by repeated insertions of fragments of a source string. We derive an efficient algorithm to compute duplication distance, and we use the algorithm to analyze segmental duplications in the human genome. Our analysis reveals possible ancestral relationships between segmental duplications including numerous examples of duplications that contain multiple, nested insertions of fragments from one or more other duplications. Using duplication distance, we also identify a small number of segmental duplications that appear to have seeded many other duplications in the genome, lending support to a two-step model of segmental duplication in the genome.
All Science Journal Classification (ASJC) codes
- Statistics and Probability
- Molecular Biology
- Computer Science Applications
- Computational Theory and Mathematics
- Computational Mathematics