A theoretical analysis of contrastive unsupervised representation learning

Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, Nikunj Saunshi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

110 Scopus citations

Abstract

Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of se-mantically "similar" data points and "negative samples," the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce (labeled) sample complexity on downstream tasks. We conduct controlled experiments in both the text and image domains to support the theory. This work is supported by NSF, ONR, the Simons Foundation, the Schmidt Foundation, Mozilla Research, Amazon Research, DARPA, and SRC. We thank Michael Collins, Rong Ge, Elad Hazan, Sham Kakade, Karthik Narasimhan, Karan Singh, and Yi Zhang for helpful discussions and suggestions.

Original languageEnglish (US)
Title of host publication36th International Conference on Machine Learning, ICML 2019
PublisherInternational Machine Learning Society (IMLS)
Pages9904-9923
Number of pages20
ISBN (Electronic)9781510886988
StatePublished - 2019
Event36th International Conference on Machine Learning, ICML 2019 - Long Beach, United States
Duration: Jun 9 2019Jun 15 2019

Publication series

Name36th International Conference on Machine Learning, ICML 2019
Volume2019-June

Conference

Conference36th International Conference on Machine Learning, ICML 2019
Country/TerritoryUnited States
CityLong Beach
Period6/9/196/15/19

All Science Journal Classification (ASJC) codes

  • Education
  • Computer Science Applications
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'A theoretical analysis of contrastive unsupervised representation learning'. Together they form a unique fingerprint.

Cite this