3,065 research outputs found
Visual Speech Enhancement
When video is shot in noisy environment, the voice of a speaker seen in the
video can be enhanced using the visible mouth movements, reducing background
noise. While most existing methods use audio-only inputs, improved performance
is obtained with our visual speech enhancement, based on an audio-visual neural
network. We include in the training data videos to which we added the voice of
the target speaker as background noise. Since the audio input is not sufficient
to separate the voice of a speaker from his own voice, the trained model better
exploits the visual input and generalizes well to different noise types. The
proposed model outperforms prior audio visual methods on two public lipreading
datasets. It is also the first to be demonstrated on a dataset not designed for
lipreading, such as the weekly addresses of Barack Obama.Comment: Accepted to Interspeech 2018. Supplementary video:
https://www.youtube.com/watch?v=nyYarDGpcY
Layered Controllable Video Generation
We introduce layered controllable video generation, where we, without any
supervision, decompose the initial frame of a video into foreground and
background layers, with which the user can control the video generation process
by simply manipulating the foreground mask. The key challenges are the
unsupervised foreground-background separation, which is ambiguous, and ability
to anticipate user manipulations with access to only raw video sequences. We
address these challenges by proposing a two-stage learning procedure. In the
first stage, with the rich set of losses and dynamic foreground size prior, we
learn how to separate the frame into foreground and background layers and,
conditioned on these layers, how to generate the next frame using VQ-VAE
generator. In the second stage, we fine-tune this network to anticipate edits
to the mask, by fitting (parameterized) control to the mask from future frame.
We demonstrate the effectiveness of this learning and the more granular control
mechanism, while illustrating state-of-the-art performance on two benchmark
datasets. We provide a video abstract as well as some video results on
https://gabriel-huang.github.io/layered_controllable_video_generationComment: This paper has been accepted to ECCV 2022 as an Oral pape
Project RISE: Recognizing Industrial Smoke Emissions
Industrial smoke emissions pose a significant concern to human health. Prior
works have shown that using Computer Vision (CV) techniques to identify smoke
as visual evidence can influence the attitude of regulators and empower
citizens to pursue environmental justice. However, existing datasets are not of
sufficient quality nor quantity to train the robust CV models needed to support
air quality advocacy. We introduce RISE, the first large-scale video dataset
for Recognizing Industrial Smoke Emissions. We adopted a citizen science
approach to collaborate with local community members to annotate whether a
video clip has smoke emissions. Our dataset contains 12,567 clips from 19
distinct views from cameras that monitored three industrial facilities. These
daytime clips span 30 days over two years, including all four seasons. We ran
experiments using deep neural networks to establish a strong performance
baseline and reveal smoke recognition challenges. Our survey study discussed
community feedback, and our data analysis displayed opportunities for
integrating citizen scientists and crowd workers into the application of
Artificial Intelligence for social good.Comment: Technical repor
Moving Target Detection Based on an Adaptive Low-Rank Sparse Decomposition
For the exact detection of moving targets in video processing, an adaptive low-rank sparse decomposition algorithm is proposed in this paper. In the paper's algorithm, the background model and the solved frame vector are first used to construct an augmented matrix, then robust principal component analysis (RPCA) is used to perform a low-rank sparse decomposition on the enhanced augmented matrix. The separated low-rank part and sparse noise correspond to the background and motion foreground of the video frame, respectively, the incremental singular value decomposition method and the current background vector are used to update the background model. The experimental results show that the algorithm can deal with complex scenes such as light changes and background motion better, and the algorithm's delay and memory consumption can be reduced effectively
- …