78 research outputs found
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
On the Feasibility of Real-Time 3D Hand Tracking using Edge GPGPU Acceleration
This paper presents the case study of a non-intrusive porting of a monolithic
C++ library for real-time 3D hand tracking, to the domain of edge-based
computation. Towards a proof of concept, the case study considers a pair of
workstations, a computationally powerful and a computationally weak one. By
wrapping the C++ library in Java container and by capitalizing on a Java-based
offloading infrastructure that supports both CPU and GPGPU computations, we are
able to establish automatically the required server-client workflow that best
addresses the resource allocation problem in the effort to execute from the
weak workstation. As a result, the weak workstation can perform well at the
task, despite lacking the sufficient hardware to do the required computations
locally. This is achieved by offloading computations which rely on GPGPU, to
the powerful workstation, across the network that connects them. We show the
edge-based computation challenges associated with the information flow of the
ported algorithm, demonstrate how we cope with them, and identify what needs to
be improved for achieving even better performance.Comment: 6 pages, 5 figure
Generic Tubelet Proposals for Action Localization
We develop a novel framework for action localization in videos. We propose
the Tube Proposal Network (TPN), which can generate generic, class-independent,
video-level tubelet proposals in videos. The generated tubelet proposals can be
utilized in various video analysis tasks, including recognizing and localizing
actions in videos. In particular, we integrate these generic tubelet proposals
into a unified temporal deep network for action classification. Compared with
other methods, our generic tubelet proposal method is accurate, general, and is
fully differentiable under a smoothL1 loss function. We demonstrate the
performance of our algorithm on the standard UCF-Sports, J-HMDB21, and UCF-101
datasets. Our class-independent TPN outperforms other tubelet generation
methods, and our unified temporal deep network achieves state-of-the-art
localization results on all three datasets
Slanted Stixels: A way to represent steep streets
This work presents and evaluates a novel compact scene representation based
on Stixels that infers geometric and semantic information. Our approach
overcomes the previous rather restrictive geometric assumptions for Stixels by
introducing a novel depth model to account for non-flat roads and slanted
objects. Both semantic and depth cues are used jointly to infer the scene
representation in a sound global energy minimization formulation.
Furthermore, a novel approximation scheme is introduced in order to
significantly reduce the computational complexity of the Stixel algorithm, and
then achieve real-time computation capabilities. The idea is to first perform
an over-segmentation of the image, discarding the unlikely Stixel cuts, and
apply the algorithm only on the remaining Stixel cuts. This work presents a
novel over-segmentation strategy based on a Fully Convolutional Network (FCN),
which outperforms an approach based on using local extrema of the disparity
map.
We evaluate the proposed methods in terms of semantic and geometric accuracy
as well as run-time on four publicly available benchmark datasets. Our approach
maintains accuracy on flat road scene datasets while improving substantially on
a novel non-flat road dataset.Comment: Journal preprint (published in IJCV 2019:
https://link.springer.com/article/10.1007/s11263-019-01226-9). arXiv admin
note: text overlap with arXiv:1707.0539
- …