Euphemistic Phrase Detection by Masked Language Model

Wanzheng Zhu, Suma Bhat

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Scopus citations

Abstract

It is a well-known approach for fringe groups and organizations to use euphemisms - ordinary-sounding and innocent-looking words with a secret meaning - to conceal what they are discussing. For instance, drug dealers often use "pot"for marijuana and "avocado"for heroin. From a social media content moderation perspective, though recent advances in NLP have enabled the automatic detection of such single-word euphemisms, no existing work is capable of automatically detecting multi-word euphemisms, such as "blue dream"(marijuana) and "black tar"(heroin). Our paper tackles the problem of euphemistic phrase detection without human effort for the first time, as far as we are aware. We first perform phrase mining on a raw text corpus (e.g., social media posts) to extract quality phrases. Then, we utilize word embedding similarities to select a set of euphemistic phrase candidates. Finally, we rank those candidates by a masked language model - SpanBERT. Compared to strong baselines, we report 20-50% higher detection accuracies using our algorithm for detecting euphemistic phrases.

Original languageEnglish (US)
Title of host publicationFindings of the Association for Computational Linguistics, Findings of ACL
Subtitle of host publicationEMNLP 2021
EditorsMarie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-Tau Yih
PublisherAssociation for Computational Linguistics (ACL)
Pages163-168
Number of pages6
ISBN (Electronic)9781955917100
StatePublished - 2021
Externally publishedYes
Event2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 - Punta Cana, Dominican Republic
Duration: Nov 7 2021Nov 11 2021

Publication series

NameFindings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021

Conference

Conference2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
Country/TerritoryDominican Republic
CityPunta Cana
Period11/7/2111/11/21

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Euphemistic Phrase Detection by Masked Language Model'. Together they form a unique fingerprint.

Cite this