143 research outputs found
Online Metric-Weighted Linear Representations for Robust Visual Tracking
In this paper, we propose a visual tracker based on a metric-weighted linear
representation of appearance. In order to capture the interdependence of
different feature dimensions, we develop two online distance metric learning
methods using proximity comparison information and structured output learning.
The learned metric is then incorporated into a linear representation of
appearance.
We show that online distance metric learning significantly improves the
robustness of the tracker, especially on those sequences exhibiting drastic
appearance changes. In order to bound growth in the number of training samples,
we design a time-weighted reservoir sampling method.
Moreover, we enable our tracker to automatically perform object
identification during the process of object tracking, by introducing a
collection of static template samples belonging to several object classes of
interest. Object identification results for an entire video sequence are
achieved by systematically combining the tracking information and visual
recognition at each frame. Experimental results on challenging video sequences
demonstrate the effectiveness of the method for both inter-frame tracking and
object identification.Comment: 51 pages. Appearing in IEEE Transactions on Pattern Analysis and
Machine Intelligenc
Representation Learning for Scale-free Networks
Network embedding aims to learn the low-dimensional representations of
vertexes in a network, while structure and inherent properties of the network
is preserved. Existing network embedding works primarily focus on preserving
the microscopic structure, such as the first- and second-order proximity of
vertexes, while the macroscopic scale-free property is largely ignored.
Scale-free property depicts the fact that vertex degrees follow a heavy-tailed
distribution (i.e., only a few vertexes have high degrees) and is a critical
property of real-world networks, such as social networks. In this paper, we
study the problem of learning representations for scale-free networks. We first
theoretically analyze the difficulty of embedding and reconstructing a
scale-free network in the Euclidean space, by converting our problem to the
sphere packing problem. Then, we propose the "degree penalty" principle for
designing scale-free property preserving network embedding algorithm: punishing
the proximity between high-degree vertexes. We introduce two implementations of
our principle by utilizing the spectral techniques and a skip-gram model
respectively. Extensive experiments on six datasets show that our algorithms
are able to not only reconstruct heavy-tailed distributed degree distribution,
but also outperform state-of-the-art embedding models in various network mining
tasks, such as vertex classification and link prediction.Comment: 8 figures; accepted by AAAI 201
Zero-Shot Recognition using Dual Visual-Semantic Mapping Paths
Zero-shot recognition aims to accurately recognize objects of unseen classes
by using a shared visual-semantic mapping between the image feature space and
the semantic embedding space. This mapping is learned on training data of seen
classes and is expected to have transfer ability to unseen classes. In this
paper, we tackle this problem by exploiting the intrinsic relationship between
the semantic space manifold and the transfer ability of visual-semantic
mapping. We formalize their connection and cast zero-shot recognition as a
joint optimization problem. Motivated by this, we propose a novel framework for
zero-shot recognition, which contains dual visual-semantic mapping paths. Our
analysis shows this framework can not only apply prior semantic knowledge to
infer underlying semantic manifold in the image feature space, but also
generate optimized semantic embedding space, which can enhance the transfer
ability of the visual-semantic mapping to unseen classes. The proposed method
is evaluated for zero-shot recognition on four benchmark datasets, achieving
outstanding results.Comment: Accepted as a full paper in IEEE Computer Vision and Pattern
Recognition (CVPR) 201
Urban Dreams of Migrants: A Case Study of Migrant Integration in Shanghai
Unprecedented human mobility has driven the rapid urbanization around the
world. In China, the fraction of population dwelling in cities increased from
17.9% to 52.6% between 1978 and 2012. Such large-scale migration poses
challenges for policymakers and important questions for researchers. To
investigate the process of migrant integration, we employ a one-month complete
dataset of telecommunication metadata in Shanghai with 54 million users and 698
million call logs. We find systematic differences between locals and migrants
in their mobile communication networks and geographical locations. For
instance, migrants have more diverse contacts and move around the city with a
larger radius than locals after they settle down. By distinguishing new
migrants (who recently moved to Shanghai) from settled migrants (who have been
in Shanghai for a while), we demonstrate the integration process of new
migrants in their first three weeks. Moreover, we formulate classification
problems to predict whether a person is a migrant. Our classifier is able to
achieve an F1-score of 0.82 when distinguishing settled migrants from locals,
but it remains challenging to identify new migrants because of class imbalance.
This classification setup holds promise for identifying new migrants who will
successfully integrate into locals (new migrants that misclassified as locals).Comment: A modified version. The paper was accepted by AAAI 201
Video Question Answering via Attribute-Augmented Attention Network Learning
Video Question Answering is a challenging problem in visual information
retrieval, which provides the answer to the referenced video content according
to the question. However, the existing visual question answering approaches
mainly tackle the problem of static image question, which may be ineffectively
for video question answering due to the insufficiency of modeling the temporal
dynamics of video contents. In this paper, we study the problem of video
question answering by modeling its temporal dynamics with frame-level attention
mechanism. We propose the attribute-augmented attention network learning
framework that enables the joint frame-level attribute detection and unified
video representation learning for video question answering. We then incorporate
the multi-step reasoning process for our proposed attention network to further
improve the performance. We construct a large-scale video question answering
dataset. We conduct the experiments on both multiple-choice and open-ended
video question answering tasks to show the effectiveness of the proposed
method.Comment: Accepted for SIGIR 201
- …