TY - GEN
T1 - Enhancing Interpretability using Human Similarity Judgements to Prune Word Embeddings
AU - Manrique, Natalia Flechas
AU - Bao, Wanqian
AU - Herbelot, Aurelie
AU - Hasson, Uri
N1 - Publisher Copyright:
©2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Interpretability methods in NLP aim to provide insights into the semantics underlying specific system architectures. Focusing on word embeddings, we present a supervised-learning method that, for a given domain (e.g., sports, professions), identifies a subset of model features (columns of the embedding space) that strongly improve prediction of human similarity judgments. We show this method keeps only 20-40% of the original embeddings, for 8 independent semantic domains, and that it retains different feature sets across domains. We then present two approaches for interpreting the semantics of the retained features. The first obtains the scores of the domain words (co-hyponyms) on the first principal component of the retained embeddings, and extracts terms whose co-occurrence with the co-hyponyms tracks these scores’ profile. This analysis reveals that humans differentiate e.g. sports based on how gender-inclusive and international they are. The second approach uses the retained sets as variables in a probing task that predicts values along 65 semantically annotated dimensions for a dataset of 535 words. The features retained for professions are best at predicting cognitive, emotional and social dimensions, whereas features retained for fruits or vegetables best predict the gustation (taste) dimension. We discuss implications for alignment between AI systems and human knowledge.
AB - Interpretability methods in NLP aim to provide insights into the semantics underlying specific system architectures. Focusing on word embeddings, we present a supervised-learning method that, for a given domain (e.g., sports, professions), identifies a subset of model features (columns of the embedding space) that strongly improve prediction of human similarity judgments. We show this method keeps only 20-40% of the original embeddings, for 8 independent semantic domains, and that it retains different feature sets across domains. We then present two approaches for interpreting the semantics of the retained features. The first obtains the scores of the domain words (co-hyponyms) on the first principal component of the retained embeddings, and extracts terms whose co-occurrence with the co-hyponyms tracks these scores’ profile. This analysis reveals that humans differentiate e.g. sports based on how gender-inclusive and international they are. The second approach uses the retained sets as variables in a probing task that predicts values along 65 semantically annotated dimensions for a dataset of 535 words. The features retained for professions are best at predicting cognitive, emotional and social dimensions, whereas features retained for fruits or vegetables best predict the gustation (taste) dimension. We discuss implications for alignment between AI systems and human knowledge.
UR - http://www.scopus.com/inward/record.url?scp=85184805621&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85184805621&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85184805621
T3 - BlackboxNLP 2023 - Analyzing and Interpreting Neural Networks for NLP, Proceedings of the 6th Workshop
SP - 169
EP - 179
BT - BlackboxNLP 2023 - Analyzing and Interpreting Neural Networks for NLP, Proceedings of the 6th Workshop
A2 - Belinkov, Yonatan
A2 - Hao, Sophie
A2 - Jumelet, Jaap
A2 - Kim, Najoung
A2 - McCarthy, Arya
A2 - Mohebbi, Hosein
PB - Association for Computational Linguistics (ACL)
T2 - 6th Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP 2023
Y2 - 7 December 2023
ER -