28,069 research outputs found
Gradually Updated Neural Networks for Large-Scale Image Recognition
Depth is one of the keys that make neural networks succeed in the task of
large-scale image recognition. The state-of-the-art network architectures
usually increase the depths by cascading convolutional layers or building
blocks. In this paper, we present an alternative method to increase the depth.
Our method is by introducing computation orderings to the channels within
convolutional layers or blocks, based on which we gradually compute the outputs
in a channel-wise manner. The added orderings not only increase the depths and
the learning capacities of the networks without any additional computation
costs, but also eliminate the overlap singularities so that the networks are
able to converge faster and perform better. Experiments show that the networks
based on our method achieve the state-of-the-art performances on CIFAR and
ImageNet datasets
BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation
Recent leading approaches to semantic segmentation rely on deep convolutional
networks trained with human-annotated, pixel-level segmentation masks. Such
pixel-accurate supervision demands expensive labeling effort and limits the
performance of deep networks that usually benefit from more training data. In
this paper, we propose a method that achieves competitive accuracy but only
requires easily obtained bounding box annotations. The basic idea is to iterate
between automatically generating region proposals and training convolutional
networks. These two steps gradually recover segmentation masks for improving
the networks, and vise versa. Our method, called BoxSup, produces competitive
results supervised by boxes only, on par with strong baselines fully supervised
by masks under the same setting. By leveraging a large amount of bounding
boxes, BoxSup further unleashes the power of deep convolutional networks and
yields state-of-the-art results on PASCAL VOC 2012 and PASCAL-CONTEXT
Human Pose Estimation with Spatial Contextual Information
We explore the importance of spatial contextual information in human pose
estimation. Most state-of-the-art pose networks are trained in a multi-stage
manner and produce several auxiliary predictions for deep supervision. With
this principle, we present two conceptually simple and yet computational
efficient modules, namely Cascade Prediction Fusion (CPF) and Pose Graph Neural
Network (PGNN), to exploit underlying contextual information. Cascade
prediction fusion accumulates prediction maps from previous stages to extract
informative signals. The resulting maps also function as a prior to guide
prediction at following stages. To promote spatial correlation among joints,
our PGNN learns a structured representation of human pose as a graph. Direct
message passing between different joints is enabled and spatial relation is
captured. These two modules require very limited computational complexity.
Experimental results demonstrate that our method consistently outperforms
previous methods on MPII and LSP benchmark
Deep Clustering with a Dynamic Autoencoder: From Reconstruction towards Centroids Construction
In unsupervised learning, there is no apparent straightforward cost function
that can capture the significant factors of variations and similarities. Since
natural systems have smooth dynamics, an opportunity is lost if an unsupervised
objective function remains static during the training process. The absence of
concrete supervision suggests that smooth dynamics should be integrated.
Compared to classical static cost functions, dynamic objective functions allow
to better make use of the gradual and uncertain knowledge acquired through
pseudo-supervision. In this paper, we propose Dynamic Autoencoder (DynAE), a
novel model for deep clustering that overcomes a clustering-reconstruction
trade-off, by gradually and smoothly eliminating the reconstruction objective
function in favor of a construction one. Experimental evaluations on benchmark
datasets show that our approach achieves state-of-the-art results compared to
the most relevant deep clustering methods
Learning to Segment Human by Watching YouTube
An intuition on human segmentation is that when a human is moving in a video,
the video-context (e.g., appearance and motion clues) may potentially infer
reasonable mask information for the whole human body. Inspired by this, based
on popular deep convolutional neural networks (CNN), we explore a very-weakly
supervised learning framework for human segmentation task, where only an
imperfect human detector is available along with massive weakly-labeled YouTube
videos. In our solution, the video-context guided human mask inference and CNN
based segmentation network learning iterate to mutually enhance each other
until no further improvement gains. In the first step, each video is decomposed
into supervoxels by the unsupervised video segmentation. The superpixels within
the supervoxels are then classified as human or non-human by graph optimization
with unary energies from the imperfect human detection results and the
predicted confidence maps by the CNN trained in the previous iteration. In the
second step, the video-context derived human masks are used as direct labels to
train CNN. Extensive experiments on the challenging PASCAL VOC 2012 semantic
segmentation benchmark demonstrate that the proposed framework has already
achieved superior results than all previous weakly-supervised methods with
object class or bounding box annotations. In addition, by augmenting with the
annotated masks from PASCAL VOC 2012, our method reaches a new state-of-the-art
performance on the human segmentation task.Comment: Very-weakly supervised learning framework. New state-of-the-art
performance on the human segmentation task! (Published in T-PAMI 2017
Building Fast and Compact Convolutional Neural Networks for Offline Handwritten Chinese Character Recognition
Like other problems in computer vision, offline handwritten Chinese character
recognition (HCCR) has achieved impressive results using convolutional neural
network (CNN)-based methods. However, larger and deeper networks are needed to
deliver state-of-the-art results in this domain. Such networks intuitively
appear to incur high computational cost, and require the storage of a large
number of parameters, which renders them unfeasible for deployment in portable
devices. To solve this problem, we propose a Global Supervised Low-rank
Expansion (GSLRE) method and an Adaptive Drop-weight (ADW) technique to solve
the problems of speed and storage capacity. We design a nine-layer CNN for HCCR
consisting of 3,755 classes, and devise an algorithm that can reduce the
networks computational cost by nine times and compress the network to 1/18 of
the original size of the baseline model, with only a 0.21% drop in accuracy. In
tests, the proposed algorithm surpassed the best single-network performance
reported thus far in the literature while requiring only 2.3 MB for storage.
Furthermore, when integrated with our effective forward implementation, the
recognition of an offline character image took only 9.7 ms on a CPU. Compared
with the state-of-the-art CNN model for HCCR, our approach is approximately 30
times faster, yet 10 times more cost efficient.Comment: 15 pages, 7 figures, 5 table
Regularized Binary Network Training
There is a significant performance gap between Binary Neural Networks (BNNs)
and floating point Deep Neural Networks (DNNs). We propose to improve the
binary training method, by introducing a new regularization function that
encourages training weights around binary values. In addition, we add trainable
scaling factors to our regularization functions. Additionally, an improved
approximation of the derivative of the sign activation function in the backward
computation. These modifications are based on linear operations that are easily
implementable into the binary training framework. Experimental results on
ImageNet shows our method outperforms the traditional BNN method and XNOR-net.Comment: NeurIPS19 Workshop on Energy Efficient Machine Learning and Cognitive
Computing (2019
Cost-effective Object Detection: Active Sample Mining with Switchable Selection Criteria
Though quite challenging, leveraging large-scale unlabeled or partially
labeled data in learning systems (e.g., model/classifier training) has
attracted increasing attentions due to its fundamental importance. To address
this problem, many active learning (AL) methods have been proposed that employ
up-to-date detectors to retrieve representative minority samples according to
predefined confidence or uncertainty thresholds. However, these AL methods
cause the detectors to ignore the remaining majority samples (i.e., those with
low uncertainty or high prediction confidence). In this work, by developing a
principled active sample mining (ASM) framework, we demonstrate that
cost-effectively mining samples from these unlabeled majority data is key to
training more powerful object detectors while minimizing user effort.
Specifically, our ASM framework involves a switchable sample selection
mechanism for determining whether an unlabeled sample should be manually
annotated via AL or automatically pseudo-labeled via a novel self-learning
process. The proposed process can be compatible with mini-batch based training
(i.e., using a batch of unlabeled or partially labeled data as a one-time
input) for object detection. In addition, a few samples with low-confidence
predictions are selected and annotated via AL. Notably, our method is suitable
for object categories that are not seen in the unlabeled data during the
learning process. Extensive experiments clearly demonstrate that our ASM
framework can achieve performance comparable to that of alternative methods but
with significantly fewer annotations.Comment: Automatically determining whether an unlabeled sample should be
manually annotated or pseudo-labeled via a novel self-learning process
(Accepted by TNNLS 2018) The source code is available at
http://kezewang.com/codes/ASM_ver1.zi
Looking Fast and Slow: Memory-Guided Mobile Video Object Detection
With a single eye fixation lasting a fraction of a second, the human visual
system is capable of forming a rich representation of a complex environment,
reaching a holistic understanding which facilitates object recognition and
detection. This phenomenon is known as recognizing the "gist" of the scene and
is accomplished by relying on relevant prior knowledge. This paper addresses
the analogous question of whether using memory in computer vision systems can
not only improve the accuracy of object detection in video streams, but also
reduce the computation time. By interleaving conventional feature extractors
with extremely lightweight ones which only need to recognize the gist of the
scene, we show that minimal computation is required to produce accurate
detections when temporal memory is present. In addition, we show that the
memory contains enough information for deploying reinforcement learning
algorithms to learn an adaptive inference policy. Our model achieves
state-of-the-art performance among mobile methods on the Imagenet VID 2015
dataset, while running at speeds of up to 70+ FPS on a Pixel 3 phone
Learning to Align Images using Weak Geometric Supervision
Image alignment tasks require accurate pixel correspondences, which are
usually recovered by matching local feature descriptors. Such descriptors are
often derived using supervised learning on existing datasets with ground truth
correspondences. However, the cost of creating such datasets is usually
prohibitive. In this paper, we propose a new approach to align two images
related by an unknown 2D homography where the local descriptor is learned from
scratch from the images and the homography is estimated simultaneously. Our key
insight is that a siamese convolutional neural network can be trained jointly
while iteratively updating the homography parameters by optimizing a single
loss function. Our method is currently weakly supervised because the input
images need to be roughly aligned.
We have used this method to align images of different modalities such as RGB
and near-infra-red (NIR) without using any prior labeled data. Images
automatically aligned by our method were then used to train descriptors that
generalize to new images. We also evaluated our method on RGB images. On the
HPatches benchmark, our method achieves comparable accuracy to deep local
descriptors that were trained offline in a supervised setting.Comment: Accepted in 3DV 201
- …