Baselines and bigrams: Simple, good sentiment and topic classification

Sida Wang, Christopher D. Manning

Research output: Chapter in Book/Report/Conference proceedingConference contribution

670 Scopus citations

Abstract

Variants of Naive Bayes (NB) and Support Vector Machines (SVM) are often used as baseline methods for text classification, but their performance varies greatly depending on the model variant, features used and task/ dataset. We show that: (i) the inclusion of word bigram features gives consistent gains on sentiment analysis tasks; (ii) for short snippet sentiment tasks, NB actually does better than SVMs (while for longer documents the opposite result holds); (iii) a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets. Based on these observations, we identify simple NB and SVM variants which outperform most published results on sentiment analysis datasets, sometimes providing a new state-of-the-art performance level.

Original languageEnglish (US)
Title of host publication50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Proceedings of the Conference
Pages90-94
Number of pages5
StatePublished - Dec 1 2012
Event50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Jeju Island, Korea, Republic of
Duration: Jul 8 2012Jul 14 2012

Publication series

Name50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Proceedings of the Conference
Volume2

Other

Other50th Annual Meeting of the Association for Computational Linguistics, ACL 2012
CountryKorea, Republic of
CityJeju Island
Period7/8/127/14/12

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Software

Fingerprint Dive into the research topics of 'Baselines and bigrams: Simple, good sentiment and topic classification'. Together they form a unique fingerprint.

Cite this