TY - GEN
T1 - Identifying Evolutionary Origins of Repeat Domains in Protein Families
AU - Aluru, Chaitanya
AU - Singh, Mona
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/9/21
Y1 - 2020/9/21
N2 - Arrays of repeat domains are critical to the proper function of a significant fraction of protein families. These repeats are easily identified in sequence, and are thought to have arisen primarily through the simultaneous duplication of multiple domains. However, for most repeat domain protein families, very little is typically known about the specific domain duplication events that occurred in their evolutionary histories. Here we extend existing reconciliation formulations that use domain trees and sequence trees to infer domain duplication and loss events to additionally consider simultaneous domain duplications under arbitrary cost models. We develop a novel integer linear programming (ILP) solution to this reconciliation problem, and demonstrate the accuracy and robustness of our approach on simulated datasets. Finally, as proof of principle, we apply our approach to an orthogroup containing the C2H2 zinc finger repeat domain, and identify simultaneous domain duplications that occurred at the onset of the primate lineage. Simulation and ILP code is available at https://github.com/Singh-Lab/treeSim.
AB - Arrays of repeat domains are critical to the proper function of a significant fraction of protein families. These repeats are easily identified in sequence, and are thought to have arisen primarily through the simultaneous duplication of multiple domains. However, for most repeat domain protein families, very little is typically known about the specific domain duplication events that occurred in their evolutionary histories. Here we extend existing reconciliation formulations that use domain trees and sequence trees to infer domain duplication and loss events to additionally consider simultaneous domain duplications under arbitrary cost models. We develop a novel integer linear programming (ILP) solution to this reconciliation problem, and demonstrate the accuracy and robustness of our approach on simulated datasets. Finally, as proof of principle, we apply our approach to an orthogroup containing the C2H2 zinc finger repeat domain, and identify simultaneous domain duplications that occurred at the onset of the primate lineage. Simulation and ILP code is available at https://github.com/Singh-Lab/treeSim.
KW - duplications
KW - phylogenetics
KW - protein domains
KW - reconciliation
UR - http://www.scopus.com/inward/record.url?scp=85096984774&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096984774&partnerID=8YFLogxK
U2 - 10.1145/3388440.3412416
DO - 10.1145/3388440.3412416
M3 - Conference contribution
AN - SCOPUS:85096984774
T3 - Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
BT - Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
PB - Association for Computing Machinery, Inc
T2 - 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
Y2 - 21 September 2020 through 24 September 2020
ER -