325 research outputs found
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
Alternative visual units for an optimized phoneme-based lipreading system
Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as `visemes’. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes which are better still. Second, we present a novel two-pass training scheme for phoneme classifiers. This approach uses our new intermediary visual units from our first experiment in the first pass as classifiers; before using the phoneme-to-viseme maps, we retrain these into phoneme classifiers. This method significantly improves on previous lipreading results with RMAV speakers
Slanted Stixels: A way to represent steep streets
This work presents and evaluates a novel compact scene representation based
on Stixels that infers geometric and semantic information. Our approach
overcomes the previous rather restrictive geometric assumptions for Stixels by
introducing a novel depth model to account for non-flat roads and slanted
objects. Both semantic and depth cues are used jointly to infer the scene
representation in a sound global energy minimization formulation.
Furthermore, a novel approximation scheme is introduced in order to
significantly reduce the computational complexity of the Stixel algorithm, and
then achieve real-time computation capabilities. The idea is to first perform
an over-segmentation of the image, discarding the unlikely Stixel cuts, and
apply the algorithm only on the remaining Stixel cuts. This work presents a
novel over-segmentation strategy based on a Fully Convolutional Network (FCN),
which outperforms an approach based on using local extrema of the disparity
map.
We evaluate the proposed methods in terms of semantic and geometric accuracy
as well as run-time on four publicly available benchmark datasets. Our approach
maintains accuracy on flat road scene datasets while improving substantially on
a novel non-flat road dataset.Comment: Journal preprint (published in IJCV 2019:
https://link.springer.com/article/10.1007/s11263-019-01226-9). arXiv admin
note: text overlap with arXiv:1707.0539
- …