TY - GEN
T1 - A domain-independent text segmentation method for educational course content
AU - Tu, Yuwei
AU - Xiong, Ying
AU - Chen, Weiyu
AU - Brinton, Christopher
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - In this study, we have proposed a domain-independent text segmentation algorithm which is particularly useful in online educational courses. Text segmentation is proven to be helpful in improving the readability of large corpora of documents, which is essential in education scenarios. While existing domain-dependent text segmentation methods have much better performance than domain-independent methods in most cases, only domain-independent methods are applicable to sparse training content in education scenarios. Our method, unlike other domain-dependent text segmentation methods, doesn't require heavy training on prior documents, but only need to train on the current corpus of documents with topic distributions and word vector representations. Our proposed method develops text boundaries between small text units in three steps. We first calculate input text features via topical distributions (latent Dirichlet allocation) and word embeddings (GloVe). We then calculate similarity values between such textual features and detect distribution changes between the similarities. We finally perform clustering on the similarities and detect sub-topic boundaries via cluster differences. We test our method on two datasets, one from an online education course and one from a popular public dataset-Choi Dataset. The results demonstrate that our method outperforms other state-of-the-art domain-independent text segmentation approaches while achieving performance comparable to a few domain-dependent algorithms.
AB - In this study, we have proposed a domain-independent text segmentation algorithm which is particularly useful in online educational courses. Text segmentation is proven to be helpful in improving the readability of large corpora of documents, which is essential in education scenarios. While existing domain-dependent text segmentation methods have much better performance than domain-independent methods in most cases, only domain-independent methods are applicable to sparse training content in education scenarios. Our method, unlike other domain-dependent text segmentation methods, doesn't require heavy training on prior documents, but only need to train on the current corpus of documents with topic distributions and word vector representations. Our proposed method develops text boundaries between small text units in three steps. We first calculate input text features via topical distributions (latent Dirichlet allocation) and word embeddings (GloVe). We then calculate similarity values between such textual features and detect distribution changes between the similarities. We finally perform clustering on the similarities and detect sub-topic boundaries via cluster differences. We test our method on two datasets, one from an online education course and one from a popular public dataset-Choi Dataset. The results demonstrate that our method outperforms other state-of-the-art domain-independent text segmentation approaches while achieving performance comparable to a few domain-dependent algorithms.
KW - Latent Dirichlet Allocation
KW - Semantic Information
KW - Text Segmentation
KW - Topic Modeling
KW - Word Embedding
UR - http://www.scopus.com/inward/record.url?scp=85062865867&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062865867&partnerID=8YFLogxK
U2 - 10.1109/ICDMW.2018.00053
DO - 10.1109/ICDMW.2018.00053
M3 - Conference contribution
AN - SCOPUS:85062865867
T3 - IEEE International Conference on Data Mining Workshops, ICDMW
SP - 320
EP - 327
BT - Proceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
A2 - Tong, Hanghang
A2 - Li, Zhenhui
A2 - Zhu, Feida
A2 - Yu, Jeffrey
PB - IEEE Computer Society
T2 - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
Y2 - 17 November 2018 through 20 November 2018
ER -