株式会社RevComm / シニアリサーチエンジニア
Speech-to-Face Conversion Using Denoising Diffusion Probabilistic Models
Speech-to-face conversion is the task of generating face images from speech signals. Many studies have been conducted to address this task, and achieved good performances. In this paper, we introduce denoising diffusion probabilistic models (DDPMs) to generate face images instead of generative adversarial networks (GANs) or autoencoders, which are used in most of the prior studies. Moreover, unlike prior studies, several components of our system are designed to use high-resolution face image datasets instead of audio-visual paired data. As a result, our system can generate high-resolution face images from speech signals with an architecture that is simpler and more flexible than the ones used in prior studies. In addition, introducing DDPMs enables us to utilize techniques that control out- puts of DDPMs or improve performance of them in succeeding studies.