Multilanguage Word Embeddings for Social Scientists: Estimation, Inference, and Validation Resources for 157 Languages

Elisa M. Wirsching, Pedro L. Rodriguez, Arthur Spirling, Brandon Michael Stewart

Research output: Contribution to journalArticlepeer-review

Abstract

Word embeddings are now a vital resource for social science research. However, obtaining high-quality training data for non-English languages can be difficult, and fitting embeddings therein may be computationally expensive. In addition, social scientists typically want to make statistical comparisons and do hypothesis tests on embeddings, yet this is nontrivial with current approaches. We provide three new data resources designed to ameliorate the union of these issues: (1) a new version of fastText model embeddings, (2) a multilanguage "a la carte"(ALC) embedding version of the fastText model, and (3) a multilanguage ALC embedding version of the well-known GloVe model. All three are fit to Wikipedia corpora. These materials are aimed at "low-resource"settings where the analysts lack access to large corpora in their language of interest or to the computational resources required to produce high-quality vector representations. We make these resources available for 40 languages, along with a code pipeline for another 117 languages available from Wikipedia corpora. We extensively validate the materials via reconstruction tests and other proofs-of-concept. We also conduct human crowdworker tests for our embeddings for Arabic, French, (traditional Mandarin) Chinese, Japanese, Korean, Russian, and Spanish. Finally, we offer some advice to practitioners using our resources.

Original languageEnglish (US)
JournalPolitical Analysis
DOIs
StateAccepted/In press - 2024

All Science Journal Classification (ASJC) codes

  • Sociology and Political Science
  • Political Science and International Relations

Keywords

  • machine learning
  • natural language processing
  • text as data
  • word embeddings

Fingerprint

Dive into the research topics of 'Multilanguage Word Embeddings for Social Scientists: Estimation, Inference, and Validation Resources for 157 Languages'. Together they form a unique fingerprint.

Cite this