NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing

Yifan Liang, Fangkun Liu, Andong Li, Xiaodong Li, Chengshi Zheng
Insititute of Acoustic,Chinese Academy of Sciences, Beijing

Abstract

Recently, lip-to-speech synthesis has made great progress due to the advancements in visual speech recognition (VSR). This is because the pre-trained VSR model can provide valuable semantic information to the lip-to-speech model, significantly improving the intelligibility of the synthesized speech. This progress explains the promising results achieved by the existing cascade models, where pseudo VSR and pseudo text-to-speech (TTS) are combined, or the transcribed text from VSR for the following TTS is used implicitly. These existing methods typically generate a mel-spectrogram as an intermediate feature, which may cause an inevitable mismatch during the vocoder inferencing. This paper proposes an end-to-end lip-to-speech framework named Natural Lip-to-Speech (NaturalL2S). Our goal is to produce more expressive and natural speech from silent videos by leveraging acoustic inductive biases. Specifically, a fundamental frequency predictor is introduced to promote the generated speech containing more natural pauses and pitch variations, and the predicted fundamental frequency is then used as an input feature for the proposed Differentiable Digital Signal Processing (DDSP) synthesizer module. The output of the DDSP synthesizer serves as prior knowledge to alleviate the unnaturalness of the synthesized speech. Additionally, instead of using the reference speaker embedding as an auxiliary input for the waveform generation, the speaker embedding is learned from the lip movement directly. Both objective and subjective evaluation results demonstrate that the proposed model can effectively improve the quality of the synthesized speech when compared with other state-of-the-art methods.

LRS2 Samples

Ground Truth NaturalL2S DiffV2S Lip2speech-Unit Multi-Task VCA-GAN Text

LRS3 Samples

Ground Truth NaturalL2S DiffV2S Lip2speech-Unit Multi-Task VCA-GAN Text

Ablation Study on LRS2

Ground Truth NaturalL2S NaturalL2S harmonic signal NaturalL2S noise signal NaturalL2S-wo-DDSP NaturalL2S-wo-e2e Text