HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features

Jiaqi Su, Zeyu Jin, Adam Finkelstein

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Modern speech content creation tasks such as podcasts, video voice-overs, and audio books require studio-quality audio with full bandwidth and balanced equalization (EQ). These goals pose a challenge for conventional speech enhancement methods, which typically focus on removing significant acoustic degradation such as noise and reverb so as to improve speech clarity and intelligibility. We present HiFi-GAN-2, a waveform-to-waveform enhancement method that improves the quality of real-world consumer-grade recordings, with moderate noise, reverb and EQ distortion, to sound like studio recordings. HiFi-GAN-2 has three components. First, given a noisy reverberant recording as input, a recurrent network predicts the acoustic features (MFCCs) of a clean signal. Second, given the same noisy input, and conditioned on the MFCCs output by the first network, a feed-forward WaveNet (modeled via multidomain multi-scale adversarial training) generates a clean 16kHz signal. Third, a pre-trained bandwidth extension network generates the final 48kHz studio-quality signal from the 16kHz output of the second network. The complete pipeline is trained via simulation of noise, reverb and EQ added to studio-quality speech. Objective and subjective evaluations show that the proposed method outperforms state-of-the-art baselines on both conventional denoising as well as joint dereverberation and denoising tasks. Listening tests also show that our method achieves close to studio quality on real-world speech content (TED Talks and the VoxCeleb dataset).

Original languageEnglish (US)
Title of host publication2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages166-170
Number of pages5
ISBN (Electronic)9781665448703
DOIs
StatePublished - 2021
Event2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2021 - New Paltz, United States
Duration: Oct 17 2021Oct 20 2021

Publication series

NameIEEE Workshop on Applications of Signal Processing to Audio and Acoustics
Volume2021-October
ISSN (Print)1931-1168
ISSN (Electronic)1947-1629

Conference

Conference2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2021
Country/TerritoryUnited States
CityNew Paltz
Period10/17/2110/20/21

All Science Journal Classification (ASJC) codes

  • Electrical and Electronic Engineering
  • Computer Science Applications

Keywords

  • acoustic features
  • denoising
  • dereverberation
  • generative adversarial networks
  • speech enhancement

Fingerprint

Dive into the research topics of 'HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features'. Together they form a unique fingerprint.

Cite this