GR0: SELF-SUPERVISED GLOBAL REPRESENTATION LEARNING FOR ZERO-SHOT VOICE CONVERSION

Yunyun Wang, Jiaqi Su, Adam Finkelstein, Zeyu Jin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Research in generative self-supervised learning (SSL) has largely focused on local embeddings for tokenized sequences. We introduce a generative SSL framework that learns a global representation that is disentangled from local embeddings. We apply this technique to jointly learn a global speaker embedding and a zero-shot voice converter. The converter modifies recorded speech to sound as if it were spoken by a different person while preserving the content, using only a short reference clip unavailable to the model during training. Listening experiments conducted on an unseen dataset show that our models significantly outperform SOTA baselines in both quality and speaker similarity for various datasets and unseen languages.

Original languageEnglish (US)
Title of host publication2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages10786-10790
Number of pages5
ISBN (Electronic)9798350344851
DOIs
StatePublished - 2024
Event49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of
Duration: Apr 14 2024Apr 19 2024

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Country/TerritoryKorea, Republic of
CitySeoul
Period4/14/244/19/24

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Keywords

  • cross-lingual zero-shot voice conversion
  • generative self-supervised global representation learning

Fingerprint

Dive into the research topics of 'GR0: SELF-SUPERVISED GLOBAL REPRESENTATION LEARNING FOR ZERO-SHOT VOICE CONVERSION'. Together they form a unique fingerprint.

Cite this