2,809 research outputs found
Where are the Blobs: Counting by Localization with Point Supervision
Object counting is an important task in computer vision due to its growing
demand in applications such as surveillance, traffic monitoring, and counting
everyday objects. State-of-the-art methods use regression-based optimization
where they explicitly learn to count the objects of interest. These often
perform better than detection-based methods that need to learn the more
difficult task of predicting the location, size, and shape of each object.
However, we propose a detection-based method that does not need to estimate the
size and shape of the objects and that outperforms regression-based methods.
Our contributions are three-fold: (1) we propose a novel loss function that
encourages the network to output a single blob per object instance using
point-level annotations only; (2) we design two methods for splitting large
predicted blobs between object instances; and (3) we show that our method
achieves new state-of-the-art results on several challenging datasets including
the Pascal VOC and the Penguins dataset. Our method even outperforms those that
use stronger supervision such as depth features, multi-point annotations, and
bounding-box labels
Improving Object Detection with Deep Convolutional Networks via Bayesian Optimization and Structured Prediction
Object detection systems based on the deep convolutional neural network (CNN)
have recently made ground- breaking advances on several object detection
benchmarks. While the features learned by these high-capacity neural networks
are discriminative for categorization, inaccurate localization is still a major
source of error for detection. Building upon high-capacity CNN architectures,
we address the localization problem by 1) using a search algorithm based on
Bayesian optimization that sequentially proposes candidate regions for an
object bounding box, and 2) training the CNN with a structured loss that
explicitly penalizes the localization inaccuracy. In experiments, we
demonstrated that each of the proposed methods improves the detection
performance over the baseline method on PASCAL VOC 2007 and 2012 datasets.
Furthermore, two methods are complementary and significantly outperform the
previous state-of-the-art when combined.Comment: CVPR 201
Perceiving Physical Equation by Observing Visual Scenarios
Inferring universal laws of the environment is an important ability of human
intelligence as well as a symbol of general AI. In this paper, we take a step
toward this goal such that we introduce a new challenging problem of inferring
invariant physical equation from visual scenarios. For instance, teaching a
machine to automatically derive the gravitational acceleration formula by
watching a free-falling object. To tackle this challenge, we present a novel
pipeline comprised of an Observer Engine and a Physicist Engine by respectively
imitating the actions of an observer and a physicist in the real world.
Generally, the Observer Engine watches the visual scenarios and then extracting
the physical properties of objects. The Physicist Engine analyses these data
and then summarizing the inherent laws of object dynamics. Specifically, the
learned laws are expressed by mathematical equations such that they are more
interpretable than the results given by common probabilistic models.
Experiments on synthetic videos have shown that our pipeline is able to
discover physical equations on various physical worlds with different visual
appearances.Comment: NIPS 2018 Workshop on Modeling the Physical Worl
Attend Refine Repeat: Active Box Proposal Generation via In-Out Localization
The problem of computing category agnostic bounding box proposals is utilized
as a core component in many computer vision tasks and thus has lately attracted
a lot of attention. In this work we propose a new approach to tackle this
problem that is based on an active strategy for generating box proposals that
starts from a set of seed boxes, which are uniformly distributed on the image,
and then progressively moves its attention on the promising image areas where
it is more likely to discover well localized bounding box proposals. We call
our approach AttractioNet and a core component of it is a CNN-based category
agnostic object location refinement module that is capable of yielding accurate
and robust bounding box predictions regardless of the object category.
We extensively evaluate our AttractioNet approach on several image datasets
(i.e. COCO, PASCAL, ImageNet detection and NYU-Depth V2 datasets) reporting on
all of them state-of-the-art results that surpass the previous work in the
field by a significant margin and also providing strong empirical evidence that
our approach is capable to generalize to unseen categories. Furthermore, we
evaluate our AttractioNet proposals in the context of the object detection task
using a VGG16-Net based detector and the achieved detection performance on COCO
manages to significantly surpass all other VGG16-Net based detectors while even
being competitive with a heavily tuned ResNet-101 based detector. Code as well
as box proposals computed for several datasets are available at::
https://github.com/gidariss/AttractioNet.Comment: Technical report. Code as well as box proposals computed for several
datasets are available at:: https://github.com/gidariss/AttractioNe
Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance
Understanding images with people often entails understanding their
\emph{interactions} with other objects or people. As such, given a novel image,
a vision system ought to infer which other objects/people play an important
role in a given person's activity. However, existing methods are limited to
learning action-specific interactions (e.g., how the pose of a tennis player
relates to the position of his racquet when serving the ball) for improved
recognition, making them unequipped to reason about novel interactions with
actions or objects unobserved in the training data.
We propose to predict the "interactee" in novel images---that is, to localize
the \emph{object} of a person's action. Given an arbitrary image with a
detected person, the goal is to produce a saliency map indicating the most
likely positions and scales where that person's interactee would be found. To
that end, we explore ways to learn the generic, action-independent connections
between (a) representations of a person's pose, gaze, and scene cues and (b)
the interactee object's position and scale. We provide results on a newly
collected UT Interactee dataset spanning more than 10,000 images from SUN,
PASCAL, and COCO. We show that the proposed interaction-informed saliency
metric has practical utility for four tasks: contextual object detection, image
retargeting, predicting object importance, and data-driven natural language
scene description. All four scenarios reveal the value in linking the subject
to its object in order to understand the story of an image
Count-ception: Counting by Fully Convolutional Redundant Counting
Counting objects in digital images is a process that should be replaced by
machines. This tedious task is time consuming and prone to errors due to
fatigue of human annotators. The goal is to have a system that takes as input
an image and returns a count of the objects inside and justification for the
prediction in the form of object localization. We repose a problem, originally
posed by Lempitsky and Zisserman, to instead predict a count map which contains
redundant counts based on the receptive field of a smaller regression network.
The regression network predicts a count of the objects that exist inside this
frame. By processing the image in a fully convolutional way each pixel is going
to be accounted for some number of times, the number of windows which include
it, which is the size of each window, (i.e., 32x32 = 1024). To recover the true
count we take the average over the redundant predictions. Our contribution is
redundant counting instead of predicting a density map in order to average over
errors. We also propose a novel deep neural network architecture adapted from
the Inception family of networks called the Count-ception network. Together our
approach results in a 20% relative improvement (2.9 to 2.3 MAE) over the state
of the art method by Xie, Noble, and Zisserman in 2016.Comment: Under Revie
Flash Photography for Data-Driven Hidden Scene Recovery
Vehicles, search and rescue personnel, and endoscopes use flash lights to
locate, identify, and view objects in their surroundings. Here we show the
first steps of how all these tasks can be done around corners with consumer
cameras. Recent techniques for NLOS imaging using consumer cameras have not
been able to both localize and identify the hidden object. We introduce a
method that couples traditional geometric understanding and data-driven
techniques. To avoid the limitation of large dataset gathering, we train the
data-driven models on rendered samples to computationally recover the hidden
scene on real data. The method has three independent operating modes: 1) a
regression output to localize a hidden object in 2D, 2) an identification
output to identify the object type or pose, and 3) a generative network to
reconstruct the hidden scene from a new viewpoint. The method is able to
localize 12cm wide hidden objects in 2D with 1.7cm accuracy. The method also
identifies the hidden object class with 87.7% accuracy (compared to 33.3%
random accuracy). This paper also provides an analysis on the distribution of
information that encodes the occluded object in the accessible scene. We show
that, unlike previously thought, the area that extends beyond the corner is
essential for accurate object localization and identification
Tree-Structured Reinforcement Learning for Sequential Object Localization
Existing object proposal algorithms usually search for possible object
regions over multiple locations and scales separately, which ignore the
interdependency among different objects and deviate from the human perception
procedure. To incorporate global interdependency between objects into object
localization, we propose an effective Tree-structured Reinforcement Learning
(Tree-RL) approach to sequentially search for objects by fully exploiting both
the current observation and historical search paths. The Tree-RL approach
learns multiple searching policies through maximizing the long-term reward that
reflects localization accuracies over all the objects. Starting with taking the
entire image as a proposal, the Tree-RL approach allows the agent to
sequentially discover multiple objects via a tree-structured traversing scheme.
Allowing multiple near-optimal policies, Tree-RL offers more diversity in
search paths and is able to find multiple objects with a single feed-forward
pass. Therefore, Tree-RL can better cover different objects with various scales
which is quite appealing in the context of object proposal. Experiments on
PASCAL VOC 2007 and 2012 validate the effectiveness of the Tree-RL, which can
achieve comparable recalls with current object proposal algorithms via much
fewer candidate windows.Comment: Advances in Neural Information Processing Systems 201
HAR-Net: Joint Learning of Hybrid Attention for Single-stage Object Detection
Object detection has been a challenging task in computer vision. Although
significant progress has been made in object detection with deep neural
networks, the attention mechanism is far from development. In this paper, we
propose the hybrid attention mechanism for single-stage object detection.
First, we present the modules of spatial attention, channel attention and
aligned attention for single-stage object detection. In particular, stacked
dilated convolution layers with symmetrically fixed rates are constructed to
learn spatial attention. The channel attention is proposed with the cross-level
group normalization and squeeze-and-excitation module. Aligned attention is
constructed with organized deformable filters. Second, the three kinds of
attention are unified to construct the hybrid attention mechanism. We then
embed the hybrid attention into Retina-Net and propose the efficient
single-stage HAR-Net for object detection. The attention modules and the
proposed HAR-Net are evaluated on the COCO detection dataset. Experiments
demonstrate that hybrid attention can significantly improve the detection
accuracy and the HAR-Net can achieve the state-of-the-art 45.8\% mAP,
outperform existing single-stage object detectors
LocNet: Improving Localization Accuracy for Object Detection
We propose a novel object localization methodology with the purpose of
boosting the localization accuracy of state-of-the-art object detection
systems. Our model, given a search region, aims at returning the bounding box
of an object of interest inside this region. To accomplish its goal, it relies
on assigning conditional probabilities to each row and column of this region,
where these probabilities provide useful information regarding the location of
the boundaries of the object inside the search region and allow the accurate
inference of the object bounding box under a simple probabilistic framework.
For implementing our localization model, we make use of a convolutional
neural network architecture that is properly adapted for this task, called
LocNet. We show experimentally that LocNet achieves a very significant
improvement on the mAP for high IoU thresholds on PASCAL VOC2007 test set and
that it can be very easily coupled with recent state-of-the-art object
detection systems, helping them to boost their performance. Finally, we
demonstrate that our detection approach can achieve high detection accuracy
even when it is given as input a set of sliding windows, thus proving that it
is independent of box proposal methods.Comment: Extended technical report -- short version to appear as oral paper on
CVPR 2016. Code: https://github.com/gidariss/LocNet
- …