TY - JOUR

T1 - Probability estimation in the rare-events regime

AU - Wagner, Aaron B.

AU - Viswanath, Pramod

AU - Kulkarni, Sanjeev R.

N1 - Funding Information:
Manuscript received November 14, 2009; revised October 22, 2010; accepted October 28, 2010. Date of current version May 25, 2011. This work was supported in part by the National Science Foundation under Grant CCF-0830496 and in part by the Science & Technology Center under Grant CCF-0939370. A. B. Wagner is with the School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853 USA (e-mail: [email protected]). P. Viswanath is with the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail: [email protected]). S. R. Kulkarni is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08540 USA (e-mail: [email protected]). Communicated by I. Kontoyiannis, Associate Editor for Shannon Theory. Digital Object Identifier 10.1109/TIT.2011.2137210

PY - 2011/6

Y1 - 2011/6

N2 - We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the Good-Turing probability estimator is typically used instead. We show that when used to estimate the sequence probability, the Good-Turing estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier.

AB - We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the Good-Turing probability estimator is typically used instead. We show that when used to estimate the sequence probability, the Good-Turing estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier.

KW - Classification

KW - entropy estimation

KW - large alphabets

KW - large number of rare events (LNRE)

KW - natural language

KW - probability estimation

UR - http://www.scopus.com/inward/record.url?scp=79957659601&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79957659601&partnerID=8YFLogxK

U2 - 10.1109/TIT.2011.2137210

DO - 10.1109/TIT.2011.2137210

M3 - Article

AN - SCOPUS:79957659601

SN - 0018-9448

VL - 57

SP - 3207

EP - 3229

JO - IEEE Transactions on Information Theory

JF - IEEE Transactions on Information Theory

IS - 6

M1 - 5773059

ER -