1,873 research outputs found
Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization
Weakly supervised temporal action localization, which aims at temporally
locating action instances in untrimmed videos using only video-level class
labels during training, is an important yet challenging problem in video
analysis. Many current methods adopt the "localization by classification"
framework: first do video classification, then locate temporal area
contributing to the results most. However, this framework fails to locate the
entire action instances and gives little consideration to the local context. In
this paper, we present a novel architecture called Cascaded Pyramid Mining
Network (CPMN) to address these issues using two effective modules. First, to
discover the entire temporal interval of specific action, we design a two-stage
cascaded module with proposed Online Adversarial Erasing (OAE) mechanism, where
new and complementary regions are mined through feeding the erased feature maps
of discovered regions back to the system. Second, to exploit hierarchical
contextual information in videos and reduce missing detections, we design a
pyramid module which produces a scale-invariant attention map through combining
the feature maps from different levels. Final, we aggregate the results of two
modules to perform action localization via locating high score areas in
temporal Class Activation Sequence (CAS). Extensive experiments conducted on
THUMOS14 and ActivityNet-1.3 datasets demonstrate the effectiveness of our
method.Comment: Accepted at ACCV 201
DeCaFA: Deep Convolutional Cascade for Face Alignment In The Wild
Face Alignment is an active computer vision domain, that consists in
localizing a number of facial landmarks that vary across datasets.
State-of-the-art face alignment methods either consist in end-to-end
regression, or in refining the shape in a cascaded manner, starting from an
initial guess. In this paper, we introduce DeCaFA, an end-to-end deep
convolutional cascade architecture for face alignment. DeCaFA uses
fully-convolutional stages to keep full spatial resolution throughout the
cascade. Between each cascade stage, DeCaFA uses multiple chained transfer
layers with spatial softmax to produce landmark-wise attention maps for each of
several landmark alignment tasks. Weighted intermediate supervision, as well as
efficient feature fusion between the stages allow to learn to progressively
refine the attention maps in an end-to-end manner. We show experimentally that
DeCaFA significantly outperforms existing approaches on 300W, CelebA and WFLW
databases. In addition, we show that DeCaFA can learn fine alignment with
reasonable accuracy from very few images using coarsely annotated data
ProNet: Learning to Propose Object-specific Boxes for Cascaded Neural Networks
This paper aims to classify and locate objects accurately and efficiently,
without using bounding box annotations. It is challenging as objects in the
wild could appear at arbitrary locations and in different scales. In this
paper, we propose a novel classification architecture ProNet based on
convolutional neural networks. It uses computationally efficient neural
networks to propose image regions that are likely to contain objects, and
applies more powerful but slower networks on the proposed regions. The basic
building block is a multi-scale fully-convolutional network which assigns
object confidence scores to boxes at different locations and scales. We show
that such networks can be trained effectively using image-level annotations,
and can be connected into cascades or trees for efficient object
classification. ProNet outperforms previous state-of-the-art significantly on
PASCAL VOC 2012 and MS COCO datasets for object classification and point-based
localization.Comment: CVPR 2016 (fixed reference issue
Soft Proposal Networks for Weakly Supervised Object Localization
Weakly supervised object localization remains challenging, where only image
labels instead of bounding boxes are available during training. Object proposal
is an effective component in localization, but often computationally expensive
and incapable of joint optimization with some of the remaining modules. In this
paper, to the best of our knowledge, we for the first time integrate weakly
supervised object proposal into convolutional neural networks (CNNs) in an
end-to-end learning manner. We design a network component, Soft Proposal (SP),
to be plugged into any standard convolutional architecture to introduce the
nearly cost-free object proposal, orders of magnitude faster than
state-of-the-art methods. In the SP-augmented CNNs, referred to as Soft
Proposal Networks (SPNs), iteratively evolved object proposals are generated
based on the deep feature maps then projected back, and further jointly
optimized with network parameters, with image-level supervision only. Through
the unified learning process, SPNs learn better object-centric filters,
discover more discriminative visual evidence, and suppress background
interference, significantly boosting both weakly supervised object localization
and classification performance. We report the best results on popular
benchmarks, including PASCAL VOC, MS COCO, and ImageNet.Comment: ICCV 201
Global Weighted Average Pooling Bridges Pixel-level Localization and Image-level Classification
In this work, we first tackle the problem of simultaneous pixel-level
localization and image-level classification with only image-level labels for
fully convolutional network training. We investigate the global pooling method
which plays a vital role in this task. Classical global max pooling and average
pooling methods are hard to indicate the precise regions of objects. Therefore,
we revisit the global weighted average pooling (GWAP) method for this task and
propose the class-agnostic GWAP module and the class-specific GWAP module in
this paper. We evaluate the classification and pixel-level localization ability
on the ILSVRC benchmark dataset. Experimental results show that the proposed
GWAP module can better capture the regions of the foreground objects. We
further explore the knowledge transfer between the image classification task
and the region-based object detection task. We propose a multi-task framework
that combines our class-specific GWAP module with R-FCN. The framework is
trained with few ground truth bounding boxes and large-scale image-level
labels. We evaluate this framework on PASCAL VOC dataset. Experimental results
show that this framework can use the data with only image-level labels to
improve the generalization of the object detection model.Comment: technical repor
Hand Pose Estimation through Semi-Supervised and Weakly-Supervised Learning
We propose a method for hand pose estimation based on a deep regressor
trained on two different kinds of input. Raw depth data is fused with an
intermediate representation in the form of a segmentation of the hand into
parts. This intermediate representation contains important topological
information and provides useful cues for reasoning about joint locations. The
mapping from raw depth to segmentation maps is learned in a
semi/weakly-supervised way from two different datasets: (i) a synthetic dataset
created through a rendering pipeline including densely labeled ground truth
(pixelwise segmentations); and (ii) a dataset with real images for which ground
truth joint positions are available, but not dense segmentations. Loss for
training on real images is generated from a patch-wise restoration process,
which aligns tentative segmentation maps with a large dictionary of synthetic
poses. The underlying premise is that the domain shift between synthetic and
real data is smaller in the intermediate representation, where labels carry
geometric and topological meaning, than in the raw input domain. Experiments on
the NYU dataset show that the proposed training method decreases error on
joints over direct regression of joints from depth data by 15.7%.Comment: 13 pages, 10 figures, 4 table
Weakly Supervised Object Discovery by Generative Adversarial & Ranking Networks
The deep generative adversarial networks (GAN) recently have been shown to be
promising for different computer vision applications, like image edit- ing,
synthesizing high resolution images, generating videos, etc. These networks and
the corresponding learning scheme can handle various visual space map- pings.
We approach GANs with a novel training method and learning objective, to
discover multiple object instances for three cases: 1) synthesizing a picture
of a specific object within a cluttered scene; 2) localizing different
categories in images for weakly supervised object detection; and 3) improving
object discov- ery in object detection pipelines. A crucial advantage of our
method is that it learns a new deep similarity metric, to distinguish multiple
objects in one im- age. We demonstrate that the network can act as an
encoder-decoder generating parts of an image which contain an object, or as a
modified deep CNN to rep- resent images for object detection in supervised and
weakly supervised scheme. Our ranking GAN offers a novel way to search through
images for object specific patterns. We have conducted experiments for
different scenarios and demonstrate the method performance for object
synthesizing and weakly supervised object detection and classification using
the MS-COCO and PASCAL VOC datasets
Weakly Supervised Medical Diagnosis and Localization from Multiple Resolutions
Diagnostic imaging often requires the simultaneous identification of a
multitude of findings of varied size and appearance. Beyond global indication
of said findings, the prediction and display of localization information
improves trust in and understanding of results when augmenting clinical
workflow. Medical training data rarely includes more than global image-level
labels as segmentations are time-consuming and expensive to collect. We
introduce an approach to managing these practical constraints by applying a
novel architecture which learns at multiple resolutions while generating
saliency maps with weak supervision. Further, we parameterize the Log-Sum-Exp
pooling function with a learnable lower-bounded adaptation (LSE-LBA) to build
in a sharpness prior and better handle localizing abnormalities of different
sizes using only image-level labels. Applying this approach to interpreting
chest x-rays, we set the state of the art on 9 abnormalities in the NIH's CXR14
dataset while generating saliency maps with the highest resolution to date.Comment: submitted to ECCV 201
A Generic Deep Architecture for Single Image Reflection Removal and Image Smoothing
This paper proposes a deep neural network structure that exploits edge
information in addressing representative low-level vision tasks such as layer
separation and image filtering. Unlike most other deep learning strategies
applied in this context, our approach tackles these challenging problems by
estimating edges and reconstructing images using only cascaded convolutional
layers arranged such that no handcrafted or application-specific
image-processing components are required. We apply the resulting transferrable
pipeline to two different problem domains that are both sensitive to edges,
namely, single image reflection removal and image smoothing. For the former,
using a mild reflection smoothness assumption and a novel synthetic data
generation method that acts as a type of weak supervision, our network is able
to solve much more difficult reflection cases that cannot be handled by
previous methods. For the latter, we also exceed the state-of-the-art
quantitative and qualitative results by wide margins. In all cases, the
proposed framework is simple, fast, and easy to transfer across disparate
domains.Comment: Appeared at ICCV'17 (International Conference on Computer Vision
Collaborative Learning for Weakly Supervised Object Detection
Weakly supervised object detection has recently received much attention,
since it only requires image-level labels instead of the bounding-box labels
consumed in strongly supervised learning. Nevertheless, the save in labeling
expense is usually at the cost of model accuracy. In this paper, we propose a
simple but effective weakly supervised collaborative learning framework to
resolve this problem, which trains a weakly supervised learner and a strongly
supervised learner jointly by enforcing partial feature sharing and prediction
consistency. For object detection, taking WSDDN-like architecture as weakly
supervised detector sub-network and Faster-RCNN-like architecture as strongly
supervised detector sub-network, we propose an end-to-end Weakly Supervised
Collaborative Detection Network. As there is no strong supervision available to
train the Faster-RCNN-like sub-network, a new prediction consistency loss is
defined to enforce consistency of predictions between the two sub-networks as
well as within the Faster-RCNN-like sub-networks. At the same time, the two
detectors are designed to partially share features to further guarantee the
model consistency at perceptual level. Extensive experiments on PASCAL VOC 2007
and 2012 data sets have demonstrated the effectiveness of the proposed
framework
- …