CONTROLLABLE SPEECH REPRESENTATION LEARNING VIA VOICE CONVERSION AND AIC LOSS

Yunyun Wang, Jiaqi Su, Adam Finkelstein, Zeyu Jin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Speech representation learning transforms speech into features that are suitable for downstream tasks, e.g. speech recognition, phoneme classification, or speaker identification. For such recognition tasks, a representation can be lossy (non-invertible), which is typical of BERT-like self-supervised models. However, when used for synthesis tasks, we find these lossy representations prove to be insufficient to plausibly reconstruct the input signal. This paper introduces a method for invertible and controllable speech representation learning based on disentanglement. The representation can be decoded into a signal perceptually identical to the original. Moreover, its disentangled components (content, pitch, speaker identity, and energy) can be controlled independently to alter the synthesis result. Our model builds upon a zero-shot voice conversion model AutoVC-F0, in which we introduce alteration invariant content loss (AIC loss) and adversarial training (GAN). Through objective measures and subjective tests, we show that our formulation offers significant improvement in voice conversion sound quality as well as more precise control over the disentangled features.

Original languageEnglish (US)
Title of host publication2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6682-6686
Number of pages5
ISBN (Electronic)9781665405409
DOIs
StatePublished - 2022
Event47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Virtual, Online, Singapore
Duration: May 23 2022May 27 2022

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2022-May
ISSN (Print)1520-6149

Conference

Conference47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022
Country/TerritorySingapore
CityVirtual, Online
Period5/23/225/27/22

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Keywords

  • representation learning
  • voice conversion

Fingerprint

Dive into the research topics of 'CONTROLLABLE SPEECH REPRESENTATION LEARNING VIA VOICE CONVERSION AND AIC LOSS'. Together they form a unique fingerprint.

Cite this