TY - JOUR
T1 - Using context to improve protein domain identification
AU - Ochoa, Alejandro
AU - Llinás, Manuel
AU - Singh, Mona
N1 - Funding Information:
We thank all members of the Singh and Llinás groups for helpful discussions about this work. We additionally thank Tao Yue, Erandi De Silva, Hani Goodarzi, and our reviewers for their feedback on this manuscript. This work was supported by the National Science Foundation [Graduate Research Fellowship DGE 0646086 to AO]; and the National Institutes of Health [1 R21-AI085415 to MS and ML, Center of Excellence P50 GM071508 to the Lewis-Sigler Institute].
PY - 2011
Y1 - 2011
N2 - Background: Identifying domains in protein sequences is an important step in protein structural and functional annotation. Existing domain recognition methods typically evaluate each domain prediction independently of the rest. However, the majority of proteins are multidomain, and pairwise domain co-occurrences are highly specific and non-transitive.Results: Here, we demonstrate how to exploit domain co-occurrence to boost weak domain predictions that appear in previously observed combinations, while penalizing higher confidence domains if such combinations have never been observed. Our framework, Domain Prediction Using Context (dPUC), incorporates pairwise "context" scores between domains, along with traditional domain scores and thresholds, and improves domain prediction across a variety of organisms from bacteria to protozoa and metazoa. Among the genomes we tested, dPUC is most successful at improving predictions for the poorly-annotated malaria parasite Plasmodium falciparum, for which over 38% of the genome is currently unannotated. Our approach enables high-confidence annotations in this organism and the identification of orthologs to many core machinery proteins conserved in all eukaryotes, including those involved in ribosomal assembly and other RNA processing events, which surprisingly had not been previously known.Conclusions: Overall, our results demonstrate that this new context-based approach will provide significant improvements in domain and function prediction, especially for poorly understood genomes for which the need for additional annotations is greatest. Source code for the algorithm is available under a GPL open source license at http://compbio.cs.princeton.edu/dpuc/. Pre-computed results for our test organisms and a web server are also available at that location.
AB - Background: Identifying domains in protein sequences is an important step in protein structural and functional annotation. Existing domain recognition methods typically evaluate each domain prediction independently of the rest. However, the majority of proteins are multidomain, and pairwise domain co-occurrences are highly specific and non-transitive.Results: Here, we demonstrate how to exploit domain co-occurrence to boost weak domain predictions that appear in previously observed combinations, while penalizing higher confidence domains if such combinations have never been observed. Our framework, Domain Prediction Using Context (dPUC), incorporates pairwise "context" scores between domains, along with traditional domain scores and thresholds, and improves domain prediction across a variety of organisms from bacteria to protozoa and metazoa. Among the genomes we tested, dPUC is most successful at improving predictions for the poorly-annotated malaria parasite Plasmodium falciparum, for which over 38% of the genome is currently unannotated. Our approach enables high-confidence annotations in this organism and the identification of orthologs to many core machinery proteins conserved in all eukaryotes, including those involved in ribosomal assembly and other RNA processing events, which surprisingly had not been previously known.Conclusions: Overall, our results demonstrate that this new context-based approach will provide significant improvements in domain and function prediction, especially for poorly understood genomes for which the need for additional annotations is greatest. Source code for the algorithm is available under a GPL open source license at http://compbio.cs.princeton.edu/dpuc/. Pre-computed results for our test organisms and a web server are also available at that location.
UR - http://www.scopus.com/inward/record.url?scp=79953168457&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79953168457&partnerID=8YFLogxK
U2 - 10.1186/1471-2105-12-90
DO - 10.1186/1471-2105-12-90
M3 - Article
C2 - 21453511
AN - SCOPUS:79953168457
SN - 1471-2105
VL - 12
JO - BMC bioinformatics
JF - BMC bioinformatics
M1 - 90
ER -