25,913 research outputs found
Facial Expression Recognition in the Wild using Rich Deep Features
Facial Expression Recognition is an active area of research in computer
vision with a wide range of applications. Several approaches have been
developed to solve this problem for different benchmark datasets. However,
Facial Expression Recognition in the wild remains an area where much work is
still needed to serve real-world applications. To this end, in this paper we
present a novel approach towards facial expression recognition. We fuse rich
deep features with domain knowledge through encoding discriminant facial
patches. We conduct experiments on two of the most popular benchmark datasets;
CK and TFE. Moreover, we present a novel dataset that, unlike its precedents,
consists of natural - not acted - expression images. Experimental results show
that our approach achieves state-of-the-art results over standard benchmarks
and our own datasetComment: in International Conference in Image Processing, 201
Learning Instance Segmentation by Interaction
We present an approach for building an active agent that learns to segment
its visual observations into individual objects by interacting with its
environment in a completely self-supervised manner. The agent uses its current
segmentation model to infer pixels that constitute objects and refines the
segmentation model by interacting with these pixels. The model learned from
over 50K interactions generalizes to novel objects and backgrounds. To deal
with noisy training signal for segmenting objects obtained by self-supervised
interactions, we propose robust set loss. A dataset of robot's interactions
along-with a few human labeled examples is provided as a benchmark for future
research. We test the utility of the learned segmentation model by providing
results on a downstream vision-based control task of rearranging multiple
objects into target configurations from visual inputs alone. Videos, code, and
robotic interaction dataset are available at
https://pathak22.github.io/seg-by-interaction/Comment: Website at https://pathak22.github.io/seg-by-interaction
The ApolloScape Open Dataset for Autonomous Driving and its Application
Autonomous driving has attracted tremendous attention especially in the past
few years. The key techniques for a self-driving car include solving tasks like
3D map construction, self-localization, parsing the driving road and
understanding objects, which enable vehicles to reason and act. However, large
scale data set for training and system evaluation is still a bottleneck for
developing robust perception models. In this paper, we present the ApolloScape
dataset [1] and its applications for autonomous driving. Compared with existing
public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape
contains much large and richer labelling including holistic semantic dense
point cloud for each site, stereo, per-pixel semantic labelling, lanemark
labelling, instance segmentation, 3D car instance, high accurate location for
every frame in various driving videos from multiple sites, cities and daytimes.
For each task, it contains at lease 15x larger amount of images than SOTA
datasets. To label such a complete dataset, we develop various tools and
algorithms specified for each task to accelerate the labelling process, such as
3D-2D segment labeling tools, active labelling in videos etc. Depend on
ApolloScape, we are able to develop algorithms jointly consider the learning
and inference of multiple tasks. In this paper, we provide a sensor fusion
scheme integrating camera videos, consumer-grade motion sensors (GPS/IMU), and
a 3D semantic map in order to achieve robust self-localization and semantic
segmentation for autonomous driving. We show that practically, sensor fusion
and joint learning of multiple tasks are beneficial to achieve a more robust
and accurate system. We expect our dataset and proposed relevant algorithms can
support and motivate researchers for further development of multi-sensor fusion
and multi-task learning in the field of computer vision.Comment: Version 4: Accepted by TPAMI. Version 3: 17 pages, 10 tables, 11
figures, added the application (DeLS-3D) based on the ApolloScape Dataset.
Version 2: 7 pages, 6 figures, added comparison with BDD100K datase
RGBD Datasets: Past, Present and Future
Since the launch of the Microsoft Kinect, scores of RGBD datasets have been
released. These have propelled advances in areas from reconstruction to gesture
recognition. In this paper we explore the field, reviewing datasets across
eight categories: semantics, object pose estimation, camera tracking, scene
reconstruction, object tracking, human actions, faces and identification. By
extracting relevant information in each category we help researchers to find
appropriate data for their needs, and we consider which datasets have succeeded
in driving computer vision forward and why.
Finally, we examine the future of RGBD datasets. We identify key areas which
are currently underexplored, and suggest that future directions may include
synthetic data and dense reconstructions of static and dynamic scenes.Comment: 8 pages excluding references (CVPR style
Three-dimensional Backbone Network for 3D Object Detection in Traffic Scenes
The task of detecting 3D objects in traffic scenes has a pivotal role in many
real-world applications. However, the performance of 3D object detection is
lower than that of 2D object detection due to the lack of powerful 3D feature
extraction methods. To address this issue, this study proposes a 3D backbone
network to acquire comprehensive 3D feature maps for 3D object detection. It
primarily consists of sparse 3D convolutional neural network operations in the
point cloud. The 3D backbone network can inherently learn 3D features from the
raw data without compressing the point cloud into multiple 2D images. The
sparse 3D convolutional neural network takes full advantage of the sparsity in
the 3D point cloud to accelerate computation and save memory, which makes the
3D backbone network feasible in a real-world application. Empirical experiments
were conducted on the KITTI benchmark and comparable results were obtained with
respect to the state-of-the-art performance for 3D object detection
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
Visual saliency models have enjoyed a big leap in performance in recent
years, thanks to advances in deep learning and large scale annotated data.
Despite enormous effort and huge breakthroughs, however, models still fall
short in reaching human-level accuracy. In this work, I explore the landscape
of the field emphasizing on new deep saliency models, benchmarks, and datasets.
A large number of image and video saliency models are reviewed and compared
over two image benchmarks and two large scale video datasets. Further, I
identify factors that contribute to the gap between models and humans and
discuss remaining issues that need to be addressed to build the next generation
of more powerful saliency models. Some specific questions that are addressed
include: in what ways current models fail, how to remedy them, what can be
learned from cognitive studies of attention, how explicit saliency judgments
relate to fixations, how to conduct fair model comparison, and what are the
emerging applications of saliency models
BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames
Semi-supervised video object segmentation has made significant progress on
real and challenging videos in recent years. The current paradigm for
segmentation methods and benchmark datasets is to segment objects in video
provided a single annotation in the first frame. However, we find that
segmentation performance across the entire video varies dramatically when
selecting an alternative frame for annotation. This paper address the problem
of learning to suggest the single best frame across the video for user
annotation-this is, in fact, never the first frame of video. We achieve this by
introducing BubbleNets, a novel deep sorting network that learns to select
frames using a performance-based loss function that enables the conversion of
expansive amounts of training examples from already existing datasets. Using
BubbleNets, we are able to achieve an 11% relative improvement in segmentation
performance on the DAVIS benchmark without any changes to the underlying method
of segmentation.Comment: CVPR 201
Salient Object Detection: A Benchmark
We extensively compare, qualitatively and quantitatively, 40 state-of-the-art
models (28 salient object detection, 10 fixation prediction, 1 objectness, and
1 baseline) over 6 challenging datasets for the purpose of benchmarking salient
object detection and segmentation methods. From the results obtained so far,
our evaluation shows a consistent rapid progress over the last few years in
terms of both accuracy and running time. The top contenders in this benchmark
significantly outperform the models identified as the best in the previous
benchmark conducted just two years ago. We find that the models designed
specifically for salient object detection generally work better than models in
closely related areas, which in turn provides a precise definition and suggests
an appropriate treatment of this problem that distinguishes it from other
problems. In particular, we analyze the influences of center bias and scene
complexity in model performance, which, along with the hard cases for
state-of-the-art models, provide useful hints towards constructing more
challenging large scale datasets and better saliency models. Finally, we
propose probable solutions for tackling several open problems such as
evaluation scores and dataset bias, which also suggest future research
directions in the rapidly-growing field of salient object detection
Joint Multi-view Face Alignment in the Wild
The de facto algorithm for facial landmark estimation involves running a face
detector with a subsequent deformable model fitting on the bounding box. This
encompasses two basic problems: i) the detection and deformable fitting steps
are performed independently, while the detector might not provide best-suited
initialisation for the fitting step, ii) the face appearance varies hugely
across different poses, which makes the deformable face fitting very
challenging and thus distinct models have to be used (\eg, one for profile and
one for frontal faces). In this work, we propose the first, to the best of our
knowledge, joint multi-view convolutional network to handle large pose
variations across faces in-the-wild, and elegantly bridge face detection and
facial landmark localisation tasks. Existing joint face detection and landmark
localisation methods focus only on a very small set of landmarks. By contrast,
our method can detect and align a large number of landmarks for semi-frontal
(68 landmarks) and profile (39 landmarks) faces. We evaluate our model on a
plethora of datasets including standard static image datasets such as IBUG,
300W, COFW, and the latest Menpo Benchmark for both semi-frontal and profile
faces. Significant improvement over state-of-the-art methods on deformable face
tracking is witnessed on 300VW benchmark. We also demonstrate state-of-the-art
results for face detection on FDDB and MALF datasets.Comment: submit to IEEE Transactions on Image Processin
A Dataset for Developing and Benchmarking Active Vision
We present a new public dataset with a focus on simulating robotic vision
tasks in everyday indoor environments using real imagery. The dataset includes
20,000+ RGB-D images and 50,000+ 2D bounding boxes of object instances densely
captured in 9 unique scenes. We train a fast object category detector for
instance detection on our data. Using the dataset we show that, although
increasingly accurate and fast, the state of the art for object detection is
still severely impacted by object scale, occlusion, and viewing direction all
of which matter for robotics applications. We next validate the dataset for
simulating active vision, and use the dataset to develop and evaluate a
deep-network-based system for next best move prediction for object
classification using reinforcement learning. Our dataset is available for
download at cs.unc.edu/~ammirato/active_vision_dataset_website/.Comment: To appear at ICRA 201
- …