TY - GEN
T1 - Euphemistic Phrase Detection by Masked Language Model
AU - Zhu, Wanzheng
AU - Bhat, Suma
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics.
PY - 2021
Y1 - 2021
N2 - It is a well-known approach for fringe groups and organizations to use euphemisms - ordinary-sounding and innocent-looking words with a secret meaning - to conceal what they are discussing. For instance, drug dealers often use "pot"for marijuana and "avocado"for heroin. From a social media content moderation perspective, though recent advances in NLP have enabled the automatic detection of such single-word euphemisms, no existing work is capable of automatically detecting multi-word euphemisms, such as "blue dream"(marijuana) and "black tar"(heroin). Our paper tackles the problem of euphemistic phrase detection without human effort for the first time, as far as we are aware. We first perform phrase mining on a raw text corpus (e.g., social media posts) to extract quality phrases. Then, we utilize word embedding similarities to select a set of euphemistic phrase candidates. Finally, we rank those candidates by a masked language model - SpanBERT. Compared to strong baselines, we report 20-50% higher detection accuracies using our algorithm for detecting euphemistic phrases.
AB - It is a well-known approach for fringe groups and organizations to use euphemisms - ordinary-sounding and innocent-looking words with a secret meaning - to conceal what they are discussing. For instance, drug dealers often use "pot"for marijuana and "avocado"for heroin. From a social media content moderation perspective, though recent advances in NLP have enabled the automatic detection of such single-word euphemisms, no existing work is capable of automatically detecting multi-word euphemisms, such as "blue dream"(marijuana) and "black tar"(heroin). Our paper tackles the problem of euphemistic phrase detection without human effort for the first time, as far as we are aware. We first perform phrase mining on a raw text corpus (e.g., social media posts) to extract quality phrases. Then, we utilize word embedding similarities to select a set of euphemistic phrase candidates. Finally, we rank those candidates by a masked language model - SpanBERT. Compared to strong baselines, we report 20-50% higher detection accuracies using our algorithm for detecting euphemistic phrases.
UR - http://www.scopus.com/inward/record.url?scp=85129140039&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85129140039&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85129140039
T3 - Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
SP - 163
EP - 168
BT - Findings of the Association for Computational Linguistics, Findings of ACL
A2 - Moens, Marie-Francine
A2 - Huang, Xuanjing
A2 - Specia, Lucia
A2 - Yih, Scott Wen-Tau
PB - Association for Computational Linguistics (ACL)
T2 - 2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
Y2 - 7 November 2021 through 11 November 2021
ER -