30 research outputs found
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
DocDeshadower: Frequency-aware Transformer for Document Shadow Removal
The presence of shadows significantly impacts the visual quality of scanned
documents. However, the existing traditional techniques and deep learning
methods used for shadow removal have several limitations. These methods either
rely heavily on heuristics, resulting in suboptimal performance, or require
large datasets to learn shadow-related features. In this study, we propose the
DocDeshadower, a multi-frequency Transformer-based model built on Laplacian
Pyramid. DocDeshadower is designed to remove shadows at different frequencies
in a coarse-to-fine manner. To achieve this, we decompose the shadow image into
different frequency bands using Laplacian Pyramid. In addition, we introduce
two novel components to this model: the Attention-Aggregation Network and the
Gated Multi-scale Fusion Transformer. The Attention-Aggregation Network is
designed to remove shadows in the low-frequency part of the image, whereas the
Gated Multi-scale Fusion Transformer refines the entire image at a global scale
with its large perceptive field. Our extensive experiments demonstrate that
DocDeshadower outperforms the current state-of-the-art methods in both
qualitative and quantitative terms
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
This paper proposes a novel lip reading framework, especially for
low-resource languages, which has not been well addressed in the previous
literature. Since low-resource languages do not have enough video-text paired
data to train the model to have sufficient power to model lip movements and
language, it is regarded as challenging to develop lip reading models for
low-resource languages. In order to mitigate the challenge, we try to learn
general speech knowledge, the ability to model lip movements, from a
high-resource language through the prediction of speech units. It is known that
different languages partially share common phonemes, thus general speech
knowledge learned from one language can be extended to other languages. Then,
we try to learn language-specific knowledge, the ability to model language, by
proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder
saves language-specific audio features into memory banks and can be trained on
audio-text paired data which is more easily accessible than video-text paired
data. Therefore, with LMDecoder, we can transform the input speech units into
language-specific audio features and translate them into texts by utilizing the
learned rich language knowledge. Finally, by combining general speech knowledge
and language-specific knowledge, we can efficiently develop lip reading models
even for low-resource languages. Through extensive experiments using five
languages, English, Spanish, French, Italian, and Portuguese, the effectiveness
of the proposed method is evaluated.Comment: Accepted at ICCV 202
Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis
This paper presents ER-NeRF, a novel conditional Neural Radiance Fields
(NeRF) based architecture for talking portrait synthesis that can concurrently
achieve fast convergence, real-time rendering, and state-of-the-art performance
with small model size. Our idea is to explicitly exploit the unequal
contribution of spatial regions to guide talking portrait modeling.
Specifically, to improve the accuracy of dynamic head reconstruction, a compact
and expressive NeRF-based Tri-Plane Hash Representation is introduced by
pruning empty spatial regions with three planar hash encoders. For speech
audio, we propose a Region Attention Module to generate region-aware condition
feature via an attention mechanism. Different from existing methods that
utilize an MLP-based encoder to learn the cross-modal relation implicitly, the
attention mechanism builds an explicit connection between audio features and
spatial regions to capture the priors of local motions. Moreover, a direct and
fast Adaptive Pose Encoding is introduced to optimize the head-torso separation
problem by mapping the complex transformation of the head pose into spatial
coordinates. Extensive experiments demonstrate that our method renders better
high-fidelity and audio-lips synchronized talking portrait videos, with
realistic details and high efficiency compared to previous methods.Comment: Accepted by ICCV 202
PL-UNeXt: Per-stage Edge Detail and Line Feature Guided Segmentation for Power Line Detection
Power line detection is a critical inspection task for electricity companies
and is also useful in avoiding drone obstacles. Accurately separating power
lines from the surrounding area in the aerial image is still challenging due to
the intricate background and low pixel ratio. In order to properly capture the
guidance of the spatial edge detail prior and line features, we offer PL-UNeXt,
a power line segmentation model with a booster training strategy. We design
edge detail heads computing the loss in edge space to guide the lower-level
detail learning and line feature heads generating auxiliary segmentation masks
to supervise higher-level line feature learning. Benefited from this design,
our model can reach 70.6 F1 score (+1.9%) on TTPLA and 68.41 mIoU (+5.2%) on
VITL (without utilizing IR images), while preserving a real-time performance
due to few inference parameters.Comment: Accepted to IEEE ICIP 202
An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation
With the rapid development of artificial intelligence (AI), digital humans
have attracted more and more attention and are expected to achieve a wide range
of applications in several industries. Then, most of the existing digital
humans still rely on manual modeling by designers, which is a cumbersome
process and has a long development cycle. Therefore, facing the rise of digital
humans, there is an urgent need for a digital human generation system combined
with AI to improve development efficiency. In this paper, an implementation
scheme of an intelligent digital human generation system with multimodal fusion
is proposed. Specifically, text, speech and image are taken as inputs, and
interactive speech is synthesized using large language model (LLM), voiceprint
extraction, and text-to-speech conversion techniques. Then the input image is
age-transformed and a suitable image is selected as the driving image. Then,
the modification and generation of digital human video content is realized by
digital human driving, novel view synthesis, and intelligent dressing
techniques. Finally, we enhance the user experience through style transfer,
super-resolution, and quality evaluation. Experimental results show that the
system can effectively realize digital human generation. The related code is
released at https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker
SDR-GAIN: A High Real-Time Occluded Pedestrian Pose Completion Method for Autonomous Driving
To mitigate the challenges arising from partial occlusion in human pose
keypoint based pedestrian detection methods , we present a novel pedestrian
pose keypoint completion method called the separation and dimensionality
reduction-based generative adversarial imputation networks (SDR-GAIN) .
Firstly, we utilize OpenPose to estimate pedestrian poses in images. Then, we
isolate the head and torso keypoints of pedestrians with incomplete keypoints
due to occlusion or other factors and perform dimensionality reduction to
enhance features and further unify feature distribution. Finally, we introduce
two generative models based on the generative adversarial networks (GAN)
framework, which incorporate Huber loss, residual structure, and L1
regularization to generate missing parts of the incomplete head and torso pose
keypoints of partially occluded pedestrians, resulting in pose completion. Our
experiments on MS COCO and JAAD datasets demonstrate that SDR-GAIN outperforms
basic GAIN framework, interpolation methods PCHIP and MAkima, machine learning
methods k-NN and MissForest in terms of pose completion task. In addition, the
runtime of SDR-GAIN is approximately 0.4ms, displaying high real-time
performance and significant application value in the field of autonomous
driving
Deep Learning Model Implementation Using Convolutional Neural Network Algorithm for Default P2P Lending Prediction
Peer-to-peer (P2P) lending is one of the innovations in the field of fintech that offers microloan services through online channels without intermediaries. P2P lending facilitates the lending and borrowing process between borrowers and lenders, but on the other hand, there is a threat that can harm lenders, namely default. Defaults on P2P lending platforms result in significant losses for lenders and pose a threat to the overall efficiency of the peer-to-peer lending system. So it is essential to have an understanding of such risk management methods. However, designing feature extractors with very complicated information about borrowers and loan products takes a lot of work. In this study, we present a deep convolutional neural network (CNN) architecture for predicting default in P2P lending, with the goal of extracting features automatically and improving performance. CNN is a deep learning technique for classifying complex information that automatically extracts discriminative features from input data using convolutional operations. The dataset used is the Lending Club dataset from P2P lending platforms in America containing 9,578 data. The results of the model performance evaluation got an accuracy of 85.43%. This study shows reasonably decent results in predicting p2p lending based on CNN. This research is expected to contribute to the development of new methods of deep learning that are more complex and effective in predicting risks on P2P lending platforms