51 research outputs found
Project RISE: Recognizing Industrial Smoke Emissions
Industrial smoke emissions pose a significant concern to human health. Prior
works have shown that using Computer Vision (CV) techniques to identify smoke
as visual evidence can influence the attitude of regulators and empower
citizens to pursue environmental justice. However, existing datasets are not of
sufficient quality nor quantity to train the robust CV models needed to support
air quality advocacy. We introduce RISE, the first large-scale video dataset
for Recognizing Industrial Smoke Emissions. We adopted a citizen science
approach to collaborate with local community members to annotate whether a
video clip has smoke emissions. Our dataset contains 12,567 clips from 19
distinct views from cameras that monitored three industrial facilities. These
daytime clips span 30 days over two years, including all four seasons. We ran
experiments using deep neural networks to establish a strong performance
baseline and reveal smoke recognition challenges. Our survey study discussed
community feedback, and our data analysis displayed opportunities for
integrating citizen scientists and crowd workers into the application of
Artificial Intelligence for social good.Comment: Technical repor
Cross-View Image Synthesis using Conditional GANs
Learning to generate natural scenes has always been a challenging task in
computer vision. It is even more painstaking when the generation is conditioned
on images with drastically different views. This is mainly because
understanding, corresponding, and transforming appearance and semantic
information across the views is not trivial. In this paper, we attempt to solve
the novel problem of cross-view image synthesis, aerial to street-view and vice
versa, using conditional generative adversarial networks (cGAN). Two new
architectures called Crossview Fork (X-Fork) and Crossview Sequential (X-Seq)
are proposed to generate scenes with resolutions of 64x64 and 256x256 pixels.
X-Fork architecture has a single discriminator and a single generator. The
generator hallucinates both the image and its semantic segmentation in the
target view. X-Seq architecture utilizes two cGANs. The first one generates the
target image which is subsequently fed to the second cGAN for generating its
corresponding semantic segmentation map. The feedback from the second cGAN
helps the first cGAN generate sharper images. Both of our proposed
architectures learn to generate natural images as well as their semantic
segmentation maps. The proposed methods show that they are able to capture and
maintain the true semantics of objects in source and target views better than
the traditional image-to-image translation method which considers only the
visual appearance of the scene. Extensive qualitative and quantitative
evaluations support the effectiveness of our frameworks, compared to two state
of the art methods, for natural scene generation across drastically different
views.Comment: Accepted at CVPR 201
SRMAE: Masked Image Modeling for Scale-Invariant Deep Representations
Due to the prevalence of scale variance in nature images, we propose to use
image scale as a self-supervised signal for Masked Image Modeling (MIM). Our
method involves selecting random patches from the input image and downsampling
them to a low-resolution format. Our framework utilizes the latest advances in
super-resolution (SR) to design the prediction head, which reconstructs the
input from low-resolution clues and other patches. After 400 epochs of
pre-training, our Super Resolution Masked Autoencoders (SRMAE) get an accuracy
of 82.1% on the ImageNet-1K task. Image scale signal also allows our SRMAE to
capture scale invariance representation. For the very low resolution (VLR)
recognition task, our model achieves the best performance, surpassing DeriveNet
by 1.3%. Our method also achieves an accuracy of 74.84% on the task of
recognizing low-resolution facial expressions, surpassing the current
state-of-the-art FMD by 9.48%
Going Deeper into Action Recognition: A Survey
Understanding human actions in visual data is tied to advances in
complementary research areas including object recognition, human dynamics,
domain adaptation and semantic segmentation. Over the last decade, human action
analysis evolved from earlier schemes that are often limited to controlled
environments to nowadays advanced solutions that can learn from millions of
videos and apply to almost all daily activities. Given the broad range of
applications from video surveillance to human-computer interaction, scientific
milestones in action recognition are achieved more rapidly, eventually leading
to the demise of what used to be good in a short time. This motivated us to
provide a comprehensive review of the notable steps taken towards recognizing
human actions. To this end, we start our discussion with the pioneering methods
that use handcrafted representations, and then, navigate into the realm of deep
learning based approaches. We aim to remain objective throughout this survey,
touching upon encouraging improvements as well as inevitable fallbacks, in the
hope of raising fresh questions and motivating new research directions for the
reader
PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation
Existing 3D human pose estimators face challenges in adapting to new datasets
due to the lack of 2D-3D pose pairs in training sets. To overcome this issue,
we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis
\textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge
this data disparity gap in target domain. Typically, PoSynDA uses a
diffusion-inspired structure to simulate 3D pose distribution in the target
domain. By incorporating a multi-hypothesis network, PoSynDA generates diverse
pose hypotheses and aligns them with the target domain. To do this, it first
utilizes target-specific source augmentation to obtain the target domain
distribution data from the source domain by decoupling the scale and position
parameters. The process is then further refined through the teacher-student
paradigm and low-rank adaptation. With extensive comparison of benchmarks such
as Human3.6M and MPI-INF-3DHP, PoSynDA demonstrates competitive performance,
even comparable to the target-trained MixSTE model\cite{zhang2022mixste}. This
work paves the way for the practical application of 3D human pose estimation in
unseen domains. The code is available at https://github.com/hbing-l/PoSynDA.Comment: Accepted to ACM Multimedia 2023; 10 pages, 4 figures, 8 tables; the
code is at https://github.com/hbing-l/PoSynD
Learning to Recover Spectral Reflectance from RGB Images
This paper tackles spectral reflectance recovery (SRR) from RGB images. Since
capturing ground-truth spectral reflectance and camera spectral sensitivity are
challenging and costly, most existing approaches are trained on synthetic
images and utilize the same parameters for all unseen testing images, which are
suboptimal especially when the trained models are tested on real images because
they never exploit the internal information of the testing images. To address
this issue, we adopt a self-supervised meta-auxiliary learning (MAXL) strategy
that fine-tunes the well-trained network parameters with each testing image to
combine external with internal information. To the best of our knowledge, this
is the first work that successfully adapts the MAXL strategy to this problem.
Instead of relying on naive end-to-end training, we also propose a novel
architecture that integrates the physical relationship between the spectral
reflectance and the corresponding RGB images into the network based on our
mathematical analysis. Besides, since the spectral reflectance of a scene is
independent to its illumination while the corresponding RGB images are not, we
recover the spectral reflectance of a scene from its RGB images captured under
multiple illuminations to further reduce the unknown. Qualitative and
quantitative evaluations demonstrate the effectiveness of our proposed network
and of the MAXL. Our code and data are available at
https://github.com/Dong-Huo/SRR-MAXL
Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation
Accurately estimating the 3D pose of humans in video sequences requires both
accuracy and a well-structured architecture. With the success of transformers,
we introduce the Refined Temporal Pyramidal Compression-and-Amplification
(RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends
intra-block temporal modeling via its Temporal Pyramidal
Compression-and-Amplification (TPCA) structure and refines inter-block feature
interaction with a Cross-Layer Refinement (XLR) module. In particular, TPCA
block exploits a temporal pyramid paradigm, reinforcing key and value
representation capabilities and seamlessly extracting spatial semantics from
motion sequences. We stitch these TPCA blocks with XLR that promotes rich
semantic representation through continuous interaction of queries, keys, and
values. This strategy embodies early-stage information with current flows,
addressing typical deficits in detail and stability seen in other
transformer-based methods. We demonstrate the effectiveness of RTPCA by
achieving state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP
benchmarks with minimal computational overhead. The source code is available at
https://github.com/hbing-l/RTPCA.Comment: 11 pages, 5 figure
- …