Contextual dependencies in unsupervised word segmentation

Sharon Goldwater, Thomas L. Griffiths, Mark Johnson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

151 Scopus citations

Abstract

Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model (and previous probabilistic models), demonstrating the importance of such dependencies for word segmentation. We also show that previous probabilistic models rely crucially on suboptimal search procedures.

Original languageEnglish (US)
Title of host publicationCOLING/ACL 2006 - 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages673-680
Number of pages8
ISBN (Print)1932432655, 9781932432657
DOIs
StatePublished - 2006
Externally publishedYes
Event21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, COLING/ACL 2006 - Sydney, NSW, Australia
Duration: Jul 17 2006Jul 21 2006

Publication series

NameCOLING/ACL 2006 - 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
Volume1

Other

Other21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, COLING/ACL 2006
Country/TerritoryAustralia
CitySydney, NSW
Period7/17/067/21/06

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Contextual dependencies in unsupervised word segmentation'. Together they form a unique fingerprint.

Cite this