AnyEnhance: A Unified Generative Model with Prompt-Guidance
and Self-Critic for Voice Enhancement

Abstract

We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests.

What can AnyEnhance do?

AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning (Figure adapted from Urgent Challenge). Below are some audio examples of the enhancement tasks that AnyEnhance can handle (including General Speech Restoration (GSR), Speech Enhancement (SE), and Target Speaker Extraction (TSE)).

How does AnyEnhance work?

Based on a masked generative model, AnyEnhance operates in two stages: the semantic enhancement stage and the acoustic enhancement stage. In the semantic enhancement stage, the encoder is responsible for extracting the semantic features from the input distorted audio using representations aligned with pre-trained features. In the acoustic enhancement stage, the decoder predicts masked tokens based on the semantic features and existing acoustic tokens

The model introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. For more details, please refer to the paper.

GSR Examples

General Speech Restoration (GSR) aims to solve a wide range of speech enhancement tasks, including denoising, dereverberation, declipping and super-resolution.

Librivox GSR testset:

Noisy	Clean	Enhanced (w/o prompt)	Enhanced (w/ prompt)	Prompt

CCMusic GSR testset:

Noisy	Clean	Enhanced (w/o prompt)	Enhanced (w/ prompt)	Prompt

SE Examples

Speech Enhancement (SE) aims to improve the quality of speech signals by removing noise and reverberation.

DNS With Reverb testset:

Noisy	Clean*	Enhanced

* Clean audios are with reverb (dns challenge only targets on noise reduction, while we also handle dereverberation)

DNS No Reverb testset:

Noisy	Clean	Enhanced

TSE Examples

Target Speaker Extraction (TSE) aims to extract the target speaker's voice from a mixture of multiple speakers. AnyEnhance can handle the TSE task both with background noise/reverb and without background noise/reverb.

Libri2Mix testset: