A domain-independent text segmentation method for educational course content

Yuwei Tu, Ying Xiong, Weiyu Chen, Christopher Brinton

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

In this study, we have proposed a domain-independent text segmentation algorithm which is particularly useful in online educational courses. Text segmentation is proven to be helpful in improving the readability of large corpora of documents, which is essential in education scenarios. While existing domain-dependent text segmentation methods have much better performance than domain-independent methods in most cases, only domain-independent methods are applicable to sparse training content in education scenarios. Our method, unlike other domain-dependent text segmentation methods, doesn't require heavy training on prior documents, but only need to train on the current corpus of documents with topic distributions and word vector representations. Our proposed method develops text boundaries between small text units in three steps. We first calculate input text features via topical distributions (latent Dirichlet allocation) and word embeddings (GloVe). We then calculate similarity values between such textual features and detect distribution changes between the similarities. We finally perform clustering on the similarities and detect sub-topic boundaries via cluster differences. We test our method on two datasets, one from an online education course and one from a popular public dataset-Choi Dataset. The results demonstrate that our method outperforms other state-of-the-art domain-independent text segmentation approaches while achieving performance comparable to a few domain-dependent algorithms.

Original languageEnglish (US)
Title of host publicationProceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
EditorsHanghang Tong, Zhenhui Li, Feida Zhu, Jeffrey Yu
PublisherIEEE Computer Society
Pages320-327
Number of pages8
ISBN (Electronic)9781538692882
DOIs
StatePublished - Jul 2 2018
Externally publishedYes
Event18th IEEE International Conference on Data Mining Workshops, ICDMW 2018 - Singapore, Singapore
Duration: Nov 17 2018Nov 20 2018

Publication series

NameIEEE International Conference on Data Mining Workshops, ICDMW
Volume2018-November
ISSN (Print)2375-9232
ISSN (Electronic)2375-9259

Conference

Conference18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
Country/TerritorySingapore
CitySingapore
Period11/17/1811/20/18

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Software

Keywords

  • Latent Dirichlet Allocation
  • Semantic Information
  • Text Segmentation
  • Topic Modeling
  • Word Embedding

Fingerprint

Dive into the research topics of 'A domain-independent text segmentation method for educational course content'. Together they form a unique fingerprint.

Cite this