Recently, lip-to-speech synthesis has made great progress due to the advancements in visual
speech recognition (VSR). This is because the pre-trained VSR model can provide valuable
semantic information to the lip-to-speech model, significantly improving the intelligibility of
the synthesized speech. This progress explains the promising results achieved by the existing
cascade models, where pseudo VSR and pseudo text-to-speech (TTS) are combined, or the
transcribed text from VSR for the following TTS is used implicitly. These existing methods
typically generate a mel-spectrogram as an intermediate feature, which may cause an inevitable
mismatch during the vocoder inferencing. This paper proposes an end-to-end lip-to-speech
framework named Natural Lip-to-Speech (NaturalL2S). Our goal is to produce more expressive and
natural speech from silent videos by leveraging acoustic inductive biases. Specifically, a
fundamental frequency predictor is introduced to promote the generated speech containing more
natural pauses and pitch variations, and the predicted fundamental frequency is then used as an
input feature for the proposed Differentiable Digital Signal Processing (DDSP) synthesizer
module. The output of the DDSP synthesizer serves as prior knowledge to alleviate the
unnaturalness of the synthesized speech. Additionally, instead of using the reference speaker
embedding as an auxiliary input for the waveform generation, the speaker embedding is learned
from the lip movement directly. Both objective and subjective evaluation results demonstrate
that the proposed model can effectively improve the quality of the synthesized speech when
compared with other state-of-the-art methods.