18,536 research outputs found
Weighted Bilinear Coding over Salient Body Parts for Person Re-identification
Deep convolutional neural networks (CNNs) have demonstrated dominant
performance in person re-identification (Re-ID). Existing CNN based methods
utilize global average pooling (GAP) to aggregate intermediate convolutional
features for Re-ID. However, this strategy only considers the first-order
statistics of local features and treats local features at different locations
equally important, leading to sub-optimal feature representation. To deal with
these issues, we propose a novel weighted bilinear coding (WBC) framework for
local feature aggregation in CNN networks to pursue more representative and
discriminative feature representations, which can adapt to other
state-of-the-art methods and improve their performance. In specific, bilinear
coding is used to encode the channel-wise feature correlations to capture
richer feature interactions. Meanwhile, a weighting scheme is applied on the
bilinear coding to adaptively adjust the weights of local features at different
locations based on their importance in recognition, further improving the
discriminability of feature aggregation. To handle the spatial misalignment
issue, we use a salient part net (spatial attention module) to derive salient
body parts, and apply the WBC model on each part. The final representation,
formed by concatenating the WBC encoded features of each part, is both
discriminative and resistant to spatial misalignment. Experiments on three
benchmarks including Market-1501, DukeMTMC-reID and CUHK03 evidence the
favorable performance of our method against other outstanding methods.Comment: 22 page
Playing hard exploration games by watching YouTube
Deep reinforcement learning methods traditionally struggle with tasks where
environment rewards are particularly sparse. One successful method of guiding
exploration in these domains is to imitate trajectories provided by a human
demonstrator. However, these demonstrations are typically collected under
artificial conditions, i.e. with access to the agent's exact environment setup
and the demonstrator's action and reward trajectories. Here we propose a
two-stage method that overcomes these limitations by relying on noisy,
unaligned footage without access to such data. First, we learn to map unaligned
videos from multiple sources to a common representation using self-supervised
objectives constructed over both time and modality (i.e. vision and sound).
Second, we embed a single YouTube video in this representation to construct a
reward function that encourages an agent to imitate human gameplay. This method
of one-shot imitation allows our agent to convincingly exceed human-level
performance on the infamously hard exploration games Montezuma's Revenge,
Pitfall! and Private Eye for the first time, even if the agent is not presented
with any environment rewards
Facial Landmark Machines: A Backbone-Branches Architecture with Progressive Representation Learning
Facial landmark localization plays a critical role in face recognition and
analysis. In this paper, we propose a novel cascaded backbone-branches fully
convolutional neural network~(BB-FCN) for rapidly and accurately localizing
facial landmarks in unconstrained and cluttered settings. Our proposed BB-FCN
generates facial landmark response maps directly from raw images without any
preprocessing. BB-FCN follows a coarse-to-fine cascaded pipeline, which
consists of a backbone network for roughly detecting the locations of all
facial landmarks and one branch network for each type of detected landmark for
further refining their locations. Furthermore, to facilitate the facial
landmark localization under unconstrained settings, we propose a large-scale
benchmark named SYSU16K, which contains 16000 faces with large variations in
pose, expression, illumination and resolution. Extensive experimental
evaluations demonstrate that our proposed BB-FCN can significantly outperform
the state-of-the-art under both constrained (i.e., within detected facial
regions only) and unconstrained settings. We further confirm that high-quality
facial landmarks localized with our proposed network can also improve the
precision and recall of face detection
Rainfall Advection using Velocimetry by Multiresolution Viscous Alignment
An algorithm to estimate motion from satellite imagery is presented. Dense
displacement fields are computed from time-separated images of of significant
convective activity using a Bayesian formulation of the motion estimation
problem. Ordinarily this motion estimation problem is ill-posed; there are far
too many degrees of freedom than necessary to represent the motion. Therefore,
some form of regularization becomes necessary and by imposing smoothness and
non-divergence as desirable properties of the estimated displacement vector
field, excellent solutions are obtained. Our approach provides a marked
improvement over other methods in conventional use. In contrast to correlation
based approaches, the displacement fields produced by our method are dense,
spatial consistency of the displacement vector field is implicit, and
higher-order and small-scale deformations can be easily handled. In contrast
with optic-flow algorithms, we can produce solutions at large separations of
mesoscale features between large time-steps or where the deformation is rapidly
evolving
What is a salient object? A dataset and a baseline model for salient object detection
Salient object detection or salient region detection models, diverging from
fixation prediction models, have traditionally been dealing with locating and
segmenting the most salient object or region in a scene. While the notion of
most salient object is sensible when multiple objects exist in a scene, current
datasets for evaluation of saliency detection approaches often have scenes with
only one single object. We introduce three main contributions in this paper:
First, we take an indepth look at the problem of salient object detection by
studying the relationship between where people look in scenes and what they
choose as the most salient object when they are explicitly asked. Based on the
agreement between fixations and saliency judgments, we then suggest that the
most salient object is the one that attracts the highest fraction of fixations.
Second, we provide two new less biased benchmark datasets containing scenes
with multiple objects that challenge existing saliency models. Indeed, we
observed a severe drop in performance of 8 state-of-the-art models on our
datasets (40% to 70%). Third, we propose a very simple yet powerful model based
on superpixels to be used as a baseline for model evaluation and comparison.
While on par with the best models on MSRA-5K dataset, our model wins over other
models on our data highlighting a serious drawback of existing models, which is
convoluting the processes of locating the most salient object and its
segmentation. We also provide a review and statistical analysis of some labeled
scene datasets that can be used for evaluating salient object detection models.
We believe that our work can greatly help remedy the over-fitting of models to
existing biased datasets and opens new venues for future research in this
fast-evolving field.Comment: IEEE Transactions on Image Processing, 201
Nasal Patches and Curves for Expression-robust 3D Face Recognition
The potential of the nasal region for expression robust 3D face recognition
is thoroughly investigated by a novel five-step algorithm. First, the nose tip
location is coarsely detected and the face is segmented, aligned and the nasal
region cropped. Then, a very accurate and consistent nasal landmarking
algorithm detects seven keypoints on the nasal region. In the third step, a
feature extraction algorithm based on the surface normals of Gabor-wavelet
filtered depth maps is utilised and, then, a set of spherical patches and
curves are localised over the nasal region to provide the feature descriptors.
The last step applies a genetic algorithm-based feature selector to detect the
most stable patches and curves over different facial expressions. The algorithm
provides the highest reported nasal region-based recognition ranks on the FRGC,
Bosphorus and BU-3DFE datasets. The results are comparable with, and in many
cases better than, many state-of-the-art 3D face recognition algorithms, which
use the whole facial domain. The proposed method does not rely on sophisticated
alignment or denoising steps, is very robust when only one sample per subject
is used in the gallery, and does not require a training step for the
landmarking algorithm. https://github.com/mehryaragha/NoseBiometric
Hand-guided 3D surface acquisition by combining simple light sectioning with real-time algorithms
Precise 3D measurements of rigid surfaces are desired in many fields of
application like quality control or surgery. Often, views from all around the
object have to be acquired for a full 3D description of the object surface. We
present a sensor principle called "Flying Triangulation" which avoids an
elaborate "stop-and-go" procedure. It combines a low-cost classical
light-section sensor with an algorithmic pipeline. A hand-guided sensor
captures a continuous movie of 3D views while being moved around the object.
The views are automatically aligned and the acquired 3D model is displayed in
real time. In contrast to most existing sensors no bandwidth is wasted for
spatial or temporal encoding of the projected lines. Nor is an expensive color
camera necessary for 3D acquisition. The achievable measurement uncertainty and
lateral resolution of the generated 3D data is merely limited by physics. An
alternating projection of vertical and horizontal lines guarantees the
existence of corresponding points in successive 3D views. This enables a
precise registration without surface interpolation. For registration, a variant
of the iterative closest point algorithm - adapted to the specific nature of
our 3D views - is introduced. Furthermore, data reduction and smoothing without
losing lateral resolution as well as the acquisition and mapping of a color
texture is presented. The precision and applicability of the sensor is
demonstrated by simulation and measurement results.Comment: 19 pages, 22 figure
Fast Localization of Facial Landmark Points
Localization of salient facial landmark points, such as eye corners or the
tip of the nose, is still considered a challenging computer vision problem
despite recent efforts. This is especially evident in unconstrained
environments, i.e., in the presence of background clutter and large head pose
variations. Most methods that achieve state-of-the-art accuracy are slow, and,
thus, have limited applications. We describe a method that can accurately
estimate the positions of relevant facial landmarks in real-time even on
hardware with limited processing power, such as mobile devices. This is
achieved with a sequence of estimators based on ensembles of regression trees.
The trees use simple pixel intensity comparisons in their internal nodes and
this makes them able to process image regions very fast. We test the developed
system on several publicly available datasets and analyse its processing speed
on various devices. Experimental results show that our method has practical
value
Deep Visual Attention Prediction
In this work, we aim to predict human eye fixation with view-free scenes
based on an end-to-end deep learning architecture. Although Convolutional
Neural Networks (CNNs) have made substantial improvement on human attention
prediction, it is still needed to improve CNN based attention models by
efficiently leveraging multi-scale features. Our visual attention network is
proposed to capture hierarchical saliency information from deep, coarse layers
with global saliency information to shallow, fine layers with local saliency
response. Our model is based on a skip-layer network structure, which predicts
human attention from multiple convolutional layers with various reception
fields. Final saliency prediction is achieved via the cooperation of those
global and local predictions. Our model is learned in a deep supervision
manner, where supervision is directly fed into multi-level layers, instead of
previous approaches of providing supervision only at the output layer and
propagating this supervision back to earlier layers. Our model thus
incorporates multi-level saliency predictions within a single network, which
significantly decreases the redundancy of previous approaches of learning
multiple network streams with different input scales. Extensive experimental
analysis on various challenging benchmark datasets demonstrate our method
yields state-of-the-art performance with competitive inference time.Comment: W. Wang and J. Shen. Deep visual attention prediction. IEEE TIP,
27(5):2368-2378,2018. Code and results can be found in
https://github.com/wenguanwang/deepattentio
Attended End-to-end Architecture for Age Estimation from Facial Expression Videos
The main challenges of age estimation from facial expression videos lie not
only in the modeling of the static facial appearance, but also in the capturing
of the temporal facial dynamics. Traditional techniques to this problem focus
on constructing handcrafted features to explore the discriminative information
contained in facial appearance and dynamics separately. This relies on
sophisticated feature-refinement and framework-design. In this paper, we
present an end-to-end architecture for age estimation, called Spatially-Indexed
Attention Model (SIAM), which is able to simultaneously learn both the
appearance and dynamics of age from raw videos of facial expressions.
Specifically, we employ convolutional neural networks to extract effective
latent appearance representations and feed them into recurrent networks to
model the temporal dynamics. More importantly, we propose to leverage attention
models for salience detection in both the spatial domain for each single image
and the temporal domain for the whole video as well. We design a specific
spatially-indexed attention mechanism among the convolutional layers to extract
the salient facial regions in each individual image, and a temporal attention
layer to assign attention weights to each frame. This two-pronged approach not
only improves the performance by allowing the model to focus on informative
frames and facial areas, but it also offers an interpretable correspondence
between the spatial facial regions as well as temporal frames, and the task of
age estimation. We demonstrate the strong performance of our model in
experiments on a large, gender-balanced database with 400 subjects with ages
spanning from 8 to 76 years. Experiments reveal that our model exhibits
significant superiority over the state-of-the-art methods given sufficient
training data.Comment: Accepted by Transactions on Image Processing (TIP
- …