192,702 research outputs found
Large-Scale Long-Tailed Recognition in an Open World
Real world data often have a long-tailed and open-ended distribution. A
practical recognition system must classify among majority and minority classes,
generalize from a few known instances, and acknowledge novelty upon a never
seen instance. We define Open Long-Tailed Recognition (OLTR) as learning from
such naturally distributed data and optimizing the classification accuracy over
a balanced test set which include head, tail, and open classes. OLTR must
handle imbalanced classification, few-shot learning, and open-set recognition
in one integrated algorithm, whereas existing classification approaches focus
only on one aspect and deliver poorly over the entire class spectrum. The key
challenges are how to share visual knowledge between head and tail classes and
how to reduce confusion between tail and open classes. We develop an integrated
OLTR algorithm that maps an image to a feature space such that visual concepts
can easily relate to each other based on a learned metric that respects the
closed-world classification while acknowledging the novelty of the open world.
Our so-called dynamic meta-embedding combines a direct image feature and an
associated memory feature, with the feature norm indicating the familiarity to
known classes. On three large-scale OLTR datasets we curate from object-centric
ImageNet, scene-centric Places, and face-centric MS1M data, our method
consistently outperforms the state-of-the-art. Our code, datasets, and models
enable future OLTR research and are publicly available at
https://liuziwei7.github.io/projects/LongTail.html.Comment: To appear in CVPR 2019 as an oral presentation. Code, datasets and
models are available at https://liuziwei7.github.io/projects/LongTail.htm
Dynamic Filtering with Large Sampling Field for ConvNets
We propose a dynamic filtering strategy with large sampling field for
ConvNets (LS-DFN), where the position-specific kernels learn from not only the
identical position but also multiple sampled neighbor regions. During sampling,
residual learning is introduced to ease training and an attention mechanism is
applied to fuse features from different samples. Such multiple samples enlarge
the kernels' receptive fields significantly without requiring more parameters.
While LS-DFN inherits the advantages of DFN, namely avoiding feature map
blurring by position-wise kernels while keeping translation invariance, it also
efficiently alleviates the overfitting issue caused by much more parameters
than normal CNNs. Our model is efficient and can be trained end-to-end via
standard back-propagation. We demonstrate the merits of our LS-DFN on both
sparse and dense prediction tasks involving object detection, semantic
segmentation, and flow estimation. Our results show LS-DFN enjoys stronger
recognition abilities in object detection and semantic segmentation tasks on
VOC benchmark and sharper responses in flow estimation on FlyingChairs dataset
compared to strong baselines.Comment: ECCV 201
Hierarchical Feature Embedding for Attribute Recognition
Attribute recognition is a crucial but challenging task due to viewpoint
changes, illumination variations and appearance diversities, etc. Most of
previous work only consider the attribute-level feature embedding, which might
perform poorly in complicated heterogeneous conditions. To address this
problem, we propose a hierarchical feature embedding (HFE) framework, which
learns a fine-grained feature embedding by combining attribute and ID
information. In HFE, we maintain the inter-class and intra-class feature
embedding simultaneously. Not only samples with the same attribute but also
samples with the same ID are gathered more closely, which could restrict the
feature embedding of visually hard samples with regard to attributes and
improve the robustness to variant conditions. We establish this hierarchical
structure by utilizing HFE loss consisted of attribute-level and ID-level
constraints. We also introduce an absolute boundary regularization and a
dynamic loss weight as supplementary components to help build up the feature
embedding. Experiments show that our method achieves the state-of-the-art
results on two pedestrian attribute datasets and a facial attribute dataset.Comment: CVPR 202
Attended End-to-end Architecture for Age Estimation from Facial Expression Videos
The main challenges of age estimation from facial expression videos lie not
only in the modeling of the static facial appearance, but also in the capturing
of the temporal facial dynamics. Traditional techniques to this problem focus
on constructing handcrafted features to explore the discriminative information
contained in facial appearance and dynamics separately. This relies on
sophisticated feature-refinement and framework-design. In this paper, we
present an end-to-end architecture for age estimation, called Spatially-Indexed
Attention Model (SIAM), which is able to simultaneously learn both the
appearance and dynamics of age from raw videos of facial expressions.
Specifically, we employ convolutional neural networks to extract effective
latent appearance representations and feed them into recurrent networks to
model the temporal dynamics. More importantly, we propose to leverage attention
models for salience detection in both the spatial domain for each single image
and the temporal domain for the whole video as well. We design a specific
spatially-indexed attention mechanism among the convolutional layers to extract
the salient facial regions in each individual image, and a temporal attention
layer to assign attention weights to each frame. This two-pronged approach not
only improves the performance by allowing the model to focus on informative
frames and facial areas, but it also offers an interpretable correspondence
between the spatial facial regions as well as temporal frames, and the task of
age estimation. We demonstrate the strong performance of our model in
experiments on a large, gender-balanced database with 400 subjects with ages
spanning from 8 to 76 years. Experiments reveal that our model exhibits
significant superiority over the state-of-the-art methods given sufficient
training data.Comment: Accepted by Transactions on Image Processing (TIP
Dynamic Curriculum Learning for Imbalanced Data Classification
Human attribute analysis is a challenging task in the field of computer
vision, since the data is largely imbalance-distributed. Common techniques such
as re-sampling and cost-sensitive learning require prior-knowledge to train the
system. To address this problem, we propose a unified framework called Dynamic
Curriculum Learning (DCL) to online adaptively adjust the sampling strategy and
loss learning in single batch, which resulting in better generalization and
discrimination. Inspired by the curriculum learning, DCL consists of two level
curriculum schedulers: (1) sampling scheduler not only manages the data
distribution from imbalanced to balanced but also from easy to hard; (2) loss
scheduler controls the learning importance between classification and metric
learning loss. Learning from these two schedulers, we demonstrate our DCL
framework with the new state-of-the-art performance on the widely used face
attribute dataset CelebA and pedestrian attribute dataset RAP
1D-Convolutional Capsule Network for Hyperspectral Image Classification
Recently, convolutional neural networks (CNNs) have achieved excellent
performances in many computer vision tasks. Specifically, for hyperspectral
images (HSIs) classification, CNNs often require very complex structure due to
the high dimension of HSIs. The complex structure of CNNs results in
prohibitive training efforts. Moreover, the common situation in HSIs
classification task is the lack of labeled samples, which results in accuracy
deterioration of CNNs. In this work, we develop an easy-to-implement capsule
network to alleviate the aforementioned problems, i.e., 1D-convolution capsule
network (1D-ConvCapsNet). Firstly, 1D-ConvCapsNet separately extracts spatial
and spectral information on spatial and spectral domains, which is more
lightweight than 3D-convolution due to fewer parameters. Secondly,
1D-ConvCapsNet utilizes the capsule-wise constraint window method to reduce
parameter amount and computational complexity of conventional capsule network.
Finally, 1D-ConvCapsNet obtains accurate predictions with respect to input
samples via dynamic routing. The effectiveness of the 1D-ConvCapsNet is
verified by three representative HSI datasets. Experimental results demonstrate
that 1D-ConvCapsNet is superior to state-of-the-art methods in both the
accuracy and training effort
Dynamic Computational Time for Visual Attention
We propose a dynamic computational time model to accelerate the average
processing time for recurrent visual attention (RAM). Rather than attention
with a fixed number of steps for each input image, the model learns to decide
when to stop on the fly. To achieve this, we add an additional continue/stop
action per time step to RAM and use reinforcement learning to learn both the
optimal attention policy and stopping policy. The modification is simple but
could dramatically save the average computational time while keeping the same
recognition performance as RAM. Experimental results on CUB-200-2011 and
Stanford Cars dataset demonstrate the dynamic computational model can work
effectively for fine-grained image recognition.The source code of this paper
can be obtained from https://github.com/baidu-research/DT-RA
Pixel-wise Attentional Gating for Parsimonious Pixel Labeling
To achieve parsimonious inference in per-pixel labeling tasks with a limited
computational budget, we propose a \emph{Pixel-wise Attentional Gating} unit
(\emph{PAG}) that learns to selectively process a subset of spatial locations
at each layer of a deep convolutional network. PAG is a generic,
architecture-independent, problem-agnostic mechanism that can be readily
"plugged in" to an existing model with fine-tuning. We utilize PAG in two ways:
1) learning spatially varying pooling fields that improve model performance
without the extra computation cost associated with multi-scale pooling, and 2)
learning a dynamic computation policy for each pixel to decrease total
computation while maintaining accuracy.
We extensively evaluate PAG on a variety of per-pixel labeling tasks,
including semantic segmentation, boundary detection, monocular depth and
surface normal estimation. We demonstrate that PAG allows competitive or
state-of-the-art performance on these tasks. Our experiments show that PAG
learns dynamic spatial allocation of computation over the input image which
provides better performance trade-offs compared to related approaches (e.g.,
truncating deep models or dynamically skipping whole layers). Generally, we
observe PAG can reduce computation by without noticeable loss in
accuracy and performance degrades gracefully when imposing stronger
computational constraints.Comment: https://www.ics.uci.edu/~skong2/PAG.htm
Self-Attention Capsule Networks for Object Classification
We propose a novel architecture for object classification, called
Self-Attention Capsule Networks (SACN). SACN is the first model that
incorporates the Self-Attention mechanism as an integral layer within the
Capsule Network (CapsNet). While the Self-Attention mechanism supplies a
long-range dependencies, results in selecting the more dominant image regions
to focus on, the CapsNet analyzes the relevant features and their spatial
correlations inside these regions only. The features are extracted in the
convolutional layer. Then, the Self-Attention layer learns to suppress
irrelevant regions based on features analysis and highlights salient features
useful for a specific task. The attention map is then fed into the CapsNet
primary layer that is followed by a classification layer. The proposed SACN
model was designed to solve two main limitations of the baseline CapsNet -
analysis of complex data and significant computational load. In this work, we
use a shallow CapsNet architecture and compensates for the absence of a deeper
network by using the Self-Attention module to significantly improve the
results. The proposed Self-Attention CapsNet architecture was extensively
evaluated on six different datasets, mainly on three different medical sets, in
addition to the natural MNIST, SVHN and CIFAR10. The model was able to classify
images and their patches with diverse and complex backgrounds better than the
baseline CapsNet. As a result, the proposed Self-Attention CapsNet
significantly improved classification performance within and across different
datasets and outperformed the baseline CapsNet, ResNet-18 and DenseNet-40 not
only in classification accuracy but also in robustness
Embryo staging with weakly-supervised region selection and dynamically-decoded predictions
To optimize clinical outcomes, fertility clinics must strategically select
which embryos to transfer. Common selection heuristics are formulas expressed
in terms of the durations required to reach various developmental milestones,
quantities historically annotated manually by experienced embryologists based
on time-lapse EmbryoScope videos. We propose a new method for automatic embryo
staging that exploits several sources of structure in this time-lapse data.
First, noting that in each image the embryo occupies a small subregion, we
jointly train a region proposal network with the downstream classifier to
isolate the embryo. Notably, because we lack ground-truth bounding boxes, our
we weakly supervise the region proposal network optimizing its parameters via
reinforcement learning to improve the downstream classifier's loss. Moreover,
noting that embryos reaching the blastocyst stage progress monotonically
through earlier stages, we develop a dynamic-programming-based decoder that
post-processes our predictions to select the most likely monotonic sequence of
developmental stages. Our methods outperform vanilla residual networks and
rival the best numbers in contemporary papers, as measured by both per-frame
accuracy and transition prediction error, despite operating on smaller data
than many
- …