In debating, rebuttal is one of the most critical stages, where a speaker addresses the arguments presented by the opposing side. During this process, the speaker synthesizes their own persuasive articulation given the context from the opposing side. This work proposes a novel zero-shot text-to-speech synthesis system for rebuttal, namely Debatts. Debatts takes two speech prompts, one from the opposing side (i.e. opponent) and one from the speaker. The prompt from the opponent is supposed to provide debating style prosody, and the prompt from the speaker provides identity information. In particular, we pretrain the Debatts system from in-the-wild dataset, and integrate an additional reference encoder to take debating prompt for style. In addition, we also create a debating dataset to develop Debatts. In this setting, Debatts can generate a debating-style speech in rebuttal for any voices. Experimental results confirm the effectiveness of the proposed system in comparison with the classic zero-shot TTS systems.
The Debatts-Data
dataset is constructed from a vast collection of professional Madarin speech
data sourced from diverse
video platforms and podcasts on the Internet. The in-the-wild collection approach ensures the real and natural
rebuttal speech. This is the first Madarin rebuttal speech dataset for expressive text-to-speech synthesis.
In addition, the dataset contains annotations of transcription, duration and style embed.
The table and chart below provide the statistic information for the dataset.
Dataset | Lang | Num of Spks | Duration(hrs) | Text/Speech | SR(kHz) | Wild/Studio |
---|---|---|---|---|---|---|
VivesDebate[1] | EN | - | 24 (est.) | T | - | W |
Record[2] | EN | 10 | 6 (est.) | T+S | 44.1 | S |
Rebuttal[3] | EN | 14 | 27 (est.) | T+S | - | S |
DBates[4] | EN | 140 | 70 (est.) | T+S | 16 | W |
Debatts-Data | ZH | 2350 | 111.9 | T+S | 16 | W |
[1] Ruiz-Dolz, R., Nofre, M., Taulé, M., Heras, S., & García-Fornes, A. (2021). Vivesdebate: A new annotated multilingual corpus of argumentation in a debate tournament. Applied Sciences, 11(15), 7160.
[2] Mirkin, S., Jacovi, M., Lavee, T., Kuo, H. K., Thomas, S., Sager, L., ... & Slonim, N. (2017). A recorded debating dataset. arXiv preprint arXiv:1709.06438.
[3] Orbach, M., Bilu, Y., Gera, A., Kantor, Y., Dankin, L., Lavee, T., ... & Slonim, N. (2019). A dataset of general-purpose rebuttal. arXiv preprint arXiv:1909.00393.
[4] Sen, T. K., Naven, G., Gerstner, L., Bagley, D., Baten, R. A., Rahman, W., ... & Hoque, E. (2021). Dbates: Dataset for discerning benefits of audio, textual, and facial expression features in competitive debate speeches. IEEE Transactions on Affective Computing, 14(2), 1028-1043.
To better understand the performance of the pipeline as well as the diversity and quality of the rebuttal dataset, we have sampled a few speech examples below for preview. The below figure is an illustration of rebuttal.
Rebuttal Subject | Opponent Speech | Speech |
---|---|---|
{{ example.subject }}
{{ example.subject_translation }}
|
{{ example.opponentSpeeches[0].text }}
{{ example.opponentSpeeches[0].translation }}
|
{{ example.speeches[0].text }}
{{ example.speeches[0].translation }}
|
{{ example.opponentSpeeches[i].text }}
{{ example.opponentSpeeches[i].translation }}
|
{{ example.speeches[i].text }}
{{ example.speeches[i].translation }}
|
Debatts-Data Pipe
is the first open-source preprocessing pipeline designed to transform in-the-wild
professional debating speech data into annotated, high-quality rebuttal data for text-to-speech generation.
It encompasses five key steps: moderator detection, rebuttal session extraction, speaker diarization,
overlap deletion, merging, and speech enhancement with metadata extraction.
The diagram below outlines the Debatts-Data Pipe workflow. Initially, the pipeline introduces a precise
methodology
for detecting moderators within extensive competitive speech datasets, facilitating the extraction of relevant
sections
based on moderator cues. This approach is adaptable to other languages with minimal modifications and can be
extended to any context involving moderator-led discussions.
After processing, the Debatts-Data outputs the speech data in JSON and WAV format. The JSON file contains metadata such as language, transcription, style embedding and conversational context path, while the WAV file contains the speech data. The JSON file is structured as in the last step of pipeline.
Debatts
is the first zero-shot debating text-to-speech (TTS) system that utilizes the opponent's
speech as a style prompt, alongside the target speaker's speech as a speaker identification prompt. It features
a two-stage model architecture comprising a text-to-semantic stage and a semantic-to-acoustic stage. In the
first stage, the model predicts target semantic tokens by integrating the semantic tokens from both the opponent
and the target speaker, along with the text tokens. In the second stage, it generates speech with a debating
style based on the concatenated target speaker's and predicted semantic tokens, combined with the target
speaker's acoustic tokens. This approach enables the generation of natural and expressive rebuttal speech for
any voice.
In this section, we demonstrate the zero-shot TTS performance of the Debatts compared to the baseline. Note that the translated text are just for illustration, and the generated speech are all in Madarin.
This video is an example of Debatts’ synthesized speech interacting with a human voice, highlighting the SpeechLab's work at The Chinese University of Hong Kong, Shenzhen. The female voice is human, while the male voice is generated by Debatts, showcasing Debatts' natural, fluent, and highly expressive synthesized speech.