Protein family classification using sparse Markov transducers

Eleazar Eskin, William Stafford Noble, Yoram Singer

Research output: Contribution to journalArticle

14 Scopus citations

Abstract

We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Since substitutions of amino acids are common in protein families, incorporating wild-cards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. As protein databases become larger, data driven learning algorithms for probabilistic models such as SMTs will require vast amounts of memory. We therefore describe and use efficient data structures to improve the memory usage of SMTs. We evaluate SMTs by building protein family classifiers using the Pfam and SCOP databases and compare our results to previously published results and state-of-the-art protein homology detection methods. SMTs outperform previous probabilistic suffix tree methods and under certain conditions perform comparably to state-of-the-art protein homology methods.

Original languageEnglish (US)
Pages (from-to)187-213
Number of pages27
JournalJournal of Computational Biology
Volume10
Issue number2
DOIs
StatePublished - Jun 7 2003
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics

Keywords

  • Machine learning
  • Probabilistic suffix trees
  • Protein family classification

Fingerprint Dive into the research topics of 'Protein family classification using sparse Markov transducers'. Together they form a unique fingerprint.

  • Cite this