TY - JOUR
T1 - BATS
T2 - A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation
AU - Wu, Qiong
AU - Hare, Adam
AU - Wang, Sirui
AU - Tu, Yuwei
AU - Liu, Zhenming
AU - Brinton, Christopher G.
AU - Li, Yanhua
N1 - Funding Information:
Christopher G. Brinton was supported in part by the Charles Koch Foundation. Qiong Wu and Zhenming Liu are supported by NSF Grants No. NSF-2008557, No. NSF-1835821, and No. NSF-1755769. Yanhua Li was supported in part by NSF Grants No. IIS-1942680 (CAREER), No. CNS-1952085, No. CMMI1831140, and No. DGE-2021871. Authors’ addresses: Q. Wu, S. Wang, and Z. Liu, Department of Computer Science, College of William and Mary, 200 Stadium Dr, Williamsburg, Virginia, 23185; emails: {qwu05, swang23}@email.wm.edu, zliu20@wm.edu; A. Hare and Y. Tu, Zoomi Inc., 325 Sentry Parkway, Suite 200, Blue Bell, Pennsylvania, 19422; emails: {adam.hare, yuwei.tu}@zoomi.ai; C. G. Brinton, School of Electrical and Computer Engineering, Purdue University, 610 Purdue Mall, West Lafayette, Indiana, 47907; email: cgb@purdue.edu; Y. Li, Department of Computer Science, Worchester Polytechnic Institute, 100 Institute Rd, Worcester, Massachusetts, 01609; email: yli15@wpi.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2021 Association for Computing Machinery. 2157-6904/2021/10-ART54 $15.00 https://doi.org/10.1145/3468268
Publisher Copyright:
© 2021 Association for Computing Machinery.
PY - 2021/10
Y1 - 2021/10
N2 - Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of "topic identification"and "text segmentation"for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called Biclustering Approach to Topic modeling and Segmentation (BATS). BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on six datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.
AB - Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of "topic identification"and "text segmentation"for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called Biclustering Approach to Topic modeling and Segmentation (BATS). BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on six datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.
KW - Biclustering
KW - text segmentation
KW - topic modeling
UR - http://www.scopus.com/inward/record.url?scp=85119310864&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119310864&partnerID=8YFLogxK
U2 - 10.1145/3468268
DO - 10.1145/3468268
M3 - Article
AN - SCOPUS:85119310864
VL - 12
JO - ACM Transactions on Intelligent Systems and Technology
JF - ACM Transactions on Intelligent Systems and Technology
SN - 2157-6904
IS - 5
M1 - 3468268
ER -