12,742 research outputs found
Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
We aim for zero-shot localization and classification of human actions in
video. Where traditional approaches rely on global attribute or object
classification scores for their zero-shot knowledge transfer, our main
contribution is a spatial-aware object embedding. To arrive at spatial
awareness, we build our embedding on top of freely available actor and object
detectors. Relevance of objects is determined in a word embedding space and
further enforced with estimated spatial preferences. Besides local object
awareness, we also embed global object awareness into our embedding to maximize
actor and object interaction. Finally, we exploit the object positions and
sizes in the spatial-aware embedding to demonstrate a new spatio-temporal
action retrieval scenario with composite queries. Action localization and
classification experiments on four contemporary action video datasets support
our proposal. Apart from state-of-the-art results in the zero-shot localization
and classification settings, our spatial-aware embedding is even competitive
with recent supervised action localization alternatives.Comment: ICC
Counting with Focus for Free
This paper aims to count arbitrary objects in images. The leading counting
approaches start from point annotations per object from which they construct
density maps. Then, their training objective transforms input images to density
maps through deep convolutional networks. We posit that the point annotations
serve more supervision purposes than just constructing density maps. We
introduce ways to repurpose the points for free. First, we propose supervised
focus from segmentation, where points are converted into binary maps. The
binary maps are combined with a network branch and accompanying loss function
to focus on areas of interest. Second, we propose supervised focus from global
density, where the ratio of point annotations to image pixels is used in
another branch to regularize the overall density estimation. To assist both the
density estimation and the focus from segmentation, we also introduce an
improved kernel size estimator for the point annotations. Experiments on six
datasets show that all our contributions reduce the counting error, regardless
of the base network, resulting in state-of-the-art accuracy using only a single
network. Finally, we are the first to count on WIDER FACE, allowing us to show
the benefits of our approach in handling varying object scales and crowding
levels. Code is available at
https://github.com/shizenglin/Counting-with-Focus-for-FreeComment: ICCV, 201
Crossover trimers connecting continuous and discrete scaling regimes
For a system of two identical fermions and one distinguishable particle
interacting via a short-range potential with a large s-wave scattering length,
the Efimov trimers and Kartavtsev-Malykh trimers exist in different regimes of
the mass ratio. The Efimov trimers are known to exhibit a discrete scaling
invariance, while the Kartavtsev-Malykh trimers feature a continuous scaling
invariance. We point out that a third type of trimers, "crossover trimers",
exist universally regardless of short-range details of the potential. These
crossover trimers have neither the discrete nor continuous scaling invariance.
We show that the crossover trimers continuously connect the discrete and
continuous scaling regimes as the mass ratio and the scattering length are
varied. We identify the regions for the Kartavtsev-Malykh trimers, Efimov
trimers, crossover trimers, and non-universal trimers as a function of the mass
ratio and the s-wave scattering length by investigating the scaling property
and model-independence of the trimers.Comment: 14 pages, 9 figure
Localizing Actions from Video Labels and Pseudo-Annotations
The goal of this paper is to determine the spatio-temporal location of
actions in video. Where training from hard to obtain box annotations is the
norm, we propose an intuitive and effective algorithm that localizes actions
from their class label only. We are inspired by recent work showing that
unsupervised action proposals selected with human point-supervision perform as
well as using expensive box annotations. Rather than asking users to provide
point supervision, we propose fully automatic visual cues that replace manual
point annotations. We call the cues pseudo-annotations, introduce five of them,
and propose a correlation metric for automatically selecting and combining
them. Thorough evaluation on challenging action localization datasets shows
that we reach results comparable to results with full box supervision. We also
show that pseudo-annotations can be leveraged during testing to improve weakly-
and strongly-supervised localizers.Comment: BMV
No Spare Parts: Sharing Part Detectors for Image Categorization
This work aims for image categorization using a representation of distinctive
parts. Different from existing part-based work, we argue that parts are
naturally shared between image categories and should be modeled as such. We
motivate our approach with a quantitative and qualitative analysis by
backtracking where selected parts come from. Our analysis shows that in
addition to the category parts defining the class, the parts coming from the
background context and parts from other image categories improve categorization
performance. Part selection should not be done separately for each category,
but instead be shared and optimized over all categories. To incorporate part
sharing between categories, we present an algorithm based on AdaBoost to
jointly optimize part sharing and selection, as well as fusion with the global
image representation. We achieve results competitive to the state-of-the-art on
object, scene, and action categories, further improving over deep convolutional
neural networks
- …