Search CORE

10 research outputs found

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Author: Jawahar C. V.
Mukhopadhyay Rudrabha
Namboodiri Vinay
Prajwal K. R.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/05/2020
Field of study

Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space. Please check out our demo video for a quick overview of the paper, method, and qualitative results. https://www.youtube.com/watch?v=HziA-jmlk_4&feature=youtu.beComment: 10 pages (including references), 5 figures, Accepted in CVPR, 202

arXiv.org e-Print Archive

OPUS

Crossref

Compressing Video Calls using Synthetic Talking Heads

Author: Agarwal Madhav
Gupta Anchit
Jawahar C V
Mukhopadhyay Rudrabha
Namboodiri Vinay P.
Publication venue
Publication date: 07/10/2022
Field of study

We leverage the modern advancements in talking head generation to propose an end-to-end system for talking head video compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect key points in the non-pivot frames and transmit them to the receiver. A dense flow is then calculated to warp a pivot frame to reconstruct the non-pivot ones. Transmitting key points instead of full frames leads to significant compression. We propose a novel algorithm to adaptively select the best-suited pivot frames at regular intervals to provide a smooth experience. We also propose a frame-interpolater at the receiver's end to improve the compression levels further. Finally, a face enhancement network improves reconstruction quality, significantly improving several aspects like the sharpness of the generations. We evaluate our method both qualitatively and quantitatively on benchmark datasets and compare it with multiple compression techniques. We release a demo video and additional information at https://cvit.iiit.ac.in/research/projects/cvit-projects/talking-video-compression.Comment: British Machine Vision Conference (BMVC), 202

arXiv.org e-Print Archive

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

Author: Afouras T.
Chen Lele
Chung Joon Son
Chung Joon Son
Duchi John
Kumar Rithesh
Maas Andrew L
Rudrabha Mukhopadhyay Prajwal KR
Vougioukas Konstantinos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/08/2020
Field of study

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide a demo video clearly showing the substantial impact of our Wav2Lip model and evaluation benchmarks on our website: \url{cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild}. The code and models are released at this GitHub repository: \url{github.com/Rudrabha/Wav2Lip}. You can also try out the interactive demo at this link: \url{bhaasha.iiit.ac.in/lipsync}.Comment: 9 pages (including references), 3 figures, Accepted in ACM Multimedia, 202

arXiv.org e-Print Archive

OPUS

Crossref

DualLip: A System for Joint Lip Reading and Generation

Author: Assael Yannis M
Chung Joon Son
King Davis E.
Kingma Diederik
Kumar Rithesh
McGurk Harry
Paszke Adam
Qu Leyuan
Ren Yi
Rudrabha Mukhopadhyay Prajwal KR
van den Oord Aaron
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/09/2020
Field of study

Lip reading aims to recognize text from talking lip, while lip generation aims to synthesize talking lip according to text, which is a key component in talking face generation and is a dual task of lip reading. In this paper, we develop DualLip, a system that jointly improves lip reading and generation by leveraging the task duality and using unlabeled text and lip video data. The key ideas of the DualLip include: 1) Generate lip video from unlabeled text with a lip generation model, and use the pseudo pairs to improve lip reading; 2) Generate text from unlabeled lip video with a lip reading model, and use the pseudo pairs to improve lip generation. We further extend DualLip to talking face generation with two additionally introduced components: lip to face generation and text to speech generation. Experiments on GRID and TCD-TIMIT demonstrate the effectiveness of DualLip on improving lip reading, lip generation, and talking face generation by utilizing unlabeled data. Specifically, the lip generation model in our DualLip system trained with only10% paired data surpasses the performance of that trained with the whole paired data. And on the GRID benchmark of lip reading, we achieve 1.16% character error rate and 2.71% word error rate, outperforming the state-of-the-art models using the same amount of paired data.Comment: Accepted by ACM Multimedia 202

arXiv.org e-Print Archive

Crossref

Revisiting Low Resource Status of Indian Languages in Machine Translation

Author: Arora Sanjeev
Barrault Loïc
Bañón Marta
Dabre Raj
Goyal Vikrant
Jha Girish Nath
Koehn Philipp
Kudo Taku
Kunchukuttan Anoop
Nakazawa Toshiaki
Nakazawa Toshiaki
Nakazawa Toshiaki
Papineni Kishore
Parida Shantipriya
Post Matt
Ramasamy Loganathan
Rudrabha Mukhopadhyay Prajwal KR
Schwenk Holger
Sennrich Rico
Sennrich Rico
Siripragada Shashank
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/11/2020
Field of study

Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.Comment: 10 pages, few figures, Preprint under revie

arXiv.org e-Print Archive

Crossref

Visual speech enhancement without a real visual stream

Author: Hegde Sindhu B.
Jawahar C. V.
Mukhopadhyay Rudrabha
Namboodiri Vinay
Prajwal K. R.
Publication venue: IEEE
Publication date: 20/12/2020
Field of study

In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state- of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only "methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech- driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a "visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is comparable ( < 3% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as human evaluations. Additional ablation studies and a demo video on our website containing qualitative comparisons and results clearly illustrate the effectiveness of our approach.</p

arXiv.org e-Print Archive

OPUS

NTIRE 2018 Challenge on Spectral Reconstruction from RGB Images

Author: Alvarez-Gila Aitor
Arad Boaz
Baran Can Yigit
Ben-Shahar Ohad
Chaudhury Santanu
Chen Chang
Cui Ximin
El Helou Majed
Galdran Adrian
Galliani Silvano
Ganapathysubramanian Baskar
Gao Lianru
Garrote Estibaliz
Ghosal Sambuddha
Koppers Simon
Koundinya Sriharsha
Lahoud Fayez
Lanaras Charis
Liu Dong
Manekar Raunak
Mukhopadhyay Rudrabha
Nagasubramanian Koushik
Sarkar Soumik
Schindler Konrad
Seltsam Philipp
Shahpaski Marjan
Sharma Himanshu
Sharma Manoj
Shi Zhan
Singh Arti
Singh Asheesh K.
Stiebel Tarek
Timofte Radu
Upadhyay Avinash
van de Weijer Joost
van Gool Luc
Wu Feng
Xiong Zhiwei
Yang Ming-Hsuan
Yu Haoyang
Zhang Bing
Zhang Lei A.
Zheng Ke
Zhou Ruofan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

This paper reviews the first challenge on spectral image reconstruction from RGB images, i.e., the recovery of whole-scene hyperspectral (HS) information from a 3- channel RGB image. The challenge was divided into 2 tracks: the “Clean” track sought HS recovery from noise- less RGB images obtained from a known response func- tion (representing spectrally-calibrated camera) while the “Real World” track challenged participants to recover HS cubes from JPEG-compressed RGB images generated by an unknown response function. To facilitate the challenge, the BGU Hyperspectral Image Database was extended to provide participants with 256 natural HS training images, and 5+10 additional images for validation and testing, re- spectively. The “Clean” and “Real World” tracks had 73 and 63 registered participants respectively, with 12 teams competing in the final testing phase. Proposed methods and their corresponding results are reported in this review

Infoscience - École polytechnique fédérale de Lausanne

Repository for Publications and Research Data

Publikationsserver der RWTH Aachen University

NTIRE 2019 Challenge on Video Super-Resolution: Methods and Results

Author: Badhwar Anuj
Baik Sungyong
Chan Kelvin C.K.
Chaudhury Santanu
Chun Se Young
Ding Y
Dong Chao
Dong Hang
Fan Yuchen
Fu Yun
Gu Shuhang
Haris Muhammad
He Dongliang
Hong Seokil
Hu Zhe
Huang Thomas S.
Huo Xiao
Jiang Junjun
Jiang Kui
Kalarot Ratheesh
Kang Dong Un
Khanna Dheeraj
Kim Kwanyoung
Korkmaz Cansu
Lee Kyoung Mu
Li Chao
Liu Ding
Liu Xiao
Loy Chen Change
Ma Jiayi
Makwana Megh
Mandal A.S.
Miao Si
Moon Gyeongsik
Mukhopadhyay Rudrabha
Nah Seungjun
Porikli Fatih
Purohit Kuldeep
Rajagopalan AN
Shakhnarovich Greg
Sharma Manoj
Shukla Ankit
Singh Ajay Pratap
Son Sanghyun
Tekalp AM
Tian Yapeng
Timofte Radu
Ukita Norimichi
Upadhyay Avinash
Wang Xintao
Wang Zhongyuan
Wen Shilei
Xu Chenliang
Yi Peng
Yilmaz M. Akin
Yu Jiahui
Yu Ke
Zhang Xinyi
Zhang Yulun
Zhu Yongxin
Publication venue: IEEE/CVF
Publication date: 01/06/2019
Field of study

This paper reviews the first NTIRE challenge on video super-resolution (restoration of rich details in low-resolution video frames) with focus on proposed solutions and results. A new REalistic and Diverse Scenes dataset (REDS) was employed. The challenge was divided into 2 tracks. Track 1 employed standard bicubic downscaling setup while Track 2 had realistic dynamic motion blurs. Each competition had 124 and 104 registered participants. There were total 14 teams in the final testing phase. They gauge the state-of-the-art in video super-resolution

SNU Open Repository and Archive

ScholarWorks@UNIST

NTIRE 2018 Challenge on Single Image Super-Resolution: Methods and Results

Author: Aadil Muneeb
Ah Namhyuk
Bai Hongliang
Cai Xiaowang
Chaudhury Santanu
Chen Rong
Choi Jae-Seok
Chun Se Young
Damian Alexandru
Deng Wei
Dong Chao
Dong Yuan
Fan Yuchen
Fu Lingzhi
Gool Luc Van
Gu Shuhang
Gu Yanan
Guo Jinkang
Guo Shi
Haris Muhammad
Hu Shijia
Hu Yu Hen
Huang Fang
Huang Thomas S.
Huang Yiwen
Hui Tak-Wai
Hui Zheng
Hussain Sibt Ul
Jeon Seunghyun
Jeon Taegyun
Jiang Xiao
Jing Liting
Kang Byungkon
Ki Sehwan
Kim Jun-Hyuk
Kim Kwanyoung
Kim Munchurl
Kim Saehun
Kim Soo Ye
Koo Jamyoung
Koundinya Sriharsha
Lee Jong-Seok
Li Cuihua
Lin Liang
Liu Hanwen
Liu Jie
Liu Jiye
Liu Pengjv
Liu Yijiao
Loy Chen Change
McWilliams Brian
Menon Sachit
Michelini Pablo Navarrete
Mukhopadhyay Rudrabha
Pang Jiahao
Park Dongwon
Perazzi Federico
Qiu Ming
Qu Yanyun
Rahim Rafia
Ravi Nikhil
Schroers Christopher
Seo Junghoon
Seo Soomin
Shakhnarovich Greg
Sharma Manoj
Shukla Ankit
Sim Hyeonjun
Sohn. Kyung-Ah
Sorkine-Hornung Alexander
Sorkine-Hornung Olga
Timofte Radu
Ukita Norimichi
Upadhyay Avinash
Wang Xinchao
Wang Xintao
Wang Yifan
Wang Ying
Wang Zhaowen
Wu Jiqing
Xiong Fengye
Xu Jinchang
Xu Ning
Xu Xiangyu
Xu Yueshu
Yang Jianchao
Yang Ming-Hsuan
Yijie Webster Bei
Yu Jiahui
Yu Ke
Yuan Yuan
Zeng Jiehang
Zeng Kun
Zhang Jiawei
Zhang Kai
Zhang Lei
Zhang Zhe
Zhao Yan
Zhu Dan
Zuo Wangmeng
Publication venue: IEEE Computer Society
Publication date: 18/06/2018
Field of study

This paper reviews the 2nd NTIRE challenge on single image super-resolution (restoration of rich details in a low resolution image) with focus on proposed solutions and results. The challenge had 4 tracks. Track 1 employed the standard bicubic downscaling setup, while Tracks 2, 3 and 4 had realistic unknown downgrading operators simulating camera image acquisition pipeline. The operators were learnable through provided pairs of low and high resolution train images. The tracks had 145, 114, 101, and 113 registered participants, resp., and 31 teams competed in the final testing phase. They gauge the state-of-the-art in single image super-resolution

Crossref

ScholarWorks@UNIST