TY - GEN
T1 - Knowing the unseen
T2 - Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL-IJCNLP 2009
AU - Bhat, Suma
AU - Sproat, Richard
PY - 2009
Y1 - 2009
N2 - Empirical studies on corpora involve making measurements of several quantities for the purpose of comparing corpora, creating language models or to make generalizations about specific linguistic phenomena in a language. Quantities such as average word length are stable across sample sizes and hence can be reliably estimated from large enough samples. However, quantities such as vocabulary size change with sample size. Thus measurements based on a given sample will need to be extrapolated to obtain their estimates over larger unseen samples. In this work, we propose a novel nonparametric estimator of vocabulary size. Our main result is to show the statistical consistency of the estimator - the first of its kind in the literature. Finally, we compare our proposal with the state of the art estimators (both parametric and nonparametric) on large standard corpora; apart from showing the favorable performance of our estimator, we also see that the classical Good-Turing estimator consistently underestimates the vocabulary size.
AB - Empirical studies on corpora involve making measurements of several quantities for the purpose of comparing corpora, creating language models or to make generalizations about specific linguistic phenomena in a language. Quantities such as average word length are stable across sample sizes and hence can be reliably estimated from large enough samples. However, quantities such as vocabulary size change with sample size. Thus measurements based on a given sample will need to be extrapolated to obtain their estimates over larger unseen samples. In this work, we propose a novel nonparametric estimator of vocabulary size. Our main result is to show the statistical consistency of the estimator - the first of its kind in the literature. Finally, we compare our proposal with the state of the art estimators (both parametric and nonparametric) on large standard corpora; apart from showing the favorable performance of our estimator, we also see that the classical Good-Turing estimator consistently underestimates the vocabulary size.
UR - http://www.scopus.com/inward/record.url?scp=84859899929&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84859899929&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84859899929
SN - 9781617382581
T3 - ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.
SP - 108
EP - 117
BT - ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.
Y2 - 2 August 2009 through 7 August 2009
ER -