Using substitution matrices to estimate probability distributions for biological sequences

Eleazar Eskin, William Stafford Noble, Yoram Singer

Research output: Contribution to journalArticle

1 Scopus citations

Abstract

Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy.

Original languageEnglish (US)
Pages (from-to)775-791
Number of pages17
JournalJournal of Computational Biology
Volume9
Issue number6
DOIs
StatePublished - Dec 1 2002
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics

Keywords

  • Amino acid probabilities
  • Common ancestors
  • Dirichlet mixtures
  • Multinomial estimation
  • Protein families
  • Protein homology

Fingerprint Dive into the research topics of 'Using substitution matrices to estimate probability distributions for biological sequences'. Together they form a unique fingerprint.

  • Cite this