Protein family classification using sparse Markov transducers

Eleazar Eskin, William Stafford Noble, Yoram Singer

Research output: Contribution to journalArticlepeer-review

14 Scopus citations


We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Since substitutions of amino acids are common in protein families, incorporating wild-cards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. As protein databases become larger, data driven learning algorithms for probabilistic models such as SMTs will require vast amounts of memory. We therefore describe and use efficient data structures to improve the memory usage of SMTs. We evaluate SMTs by building protein family classifiers using the Pfam and SCOP databases and compare our results to previously published results and state-of-the-art protein homology detection methods. SMTs outperform previous probabilistic suffix tree methods and under certain conditions perform comparably to state-of-the-art protein homology methods.

Original languageEnglish (US)
Pages (from-to)187-213
Number of pages27
JournalJournal of Computational Biology
Issue number2
StatePublished - 2003

All Science Journal Classification (ASJC) codes

  • Computational Mathematics
  • Genetics
  • Molecular Biology
  • Computational Theory and Mathematics
  • Modeling and Simulation


  • Machine learning
  • Probabilistic suffix trees
  • Protein family classification


Dive into the research topics of 'Protein family classification using sparse Markov transducers'. Together they form a unique fingerprint.

Cite this