2,176 research outputs found
Going Deeper for Multilingual Visual Sentiment Detection
This technical report details several improvements to the visual concept
detector banks built on images from the Multilingual Visual Sentiment Ontology
(MVSO). The detector banks are trained to detect a total of 9,918
sentiment-biased visual concepts from six major languages: English, Spanish,
Italian, French, German and Chinese. In the original MVSO release,
adjective-noun pair (ANP) detectors were trained for the six languages using an
AlexNet-styled architecture by fine-tuning from DeepSentiBank. Here, through a
more extensive set of experiments, parameter tuning, and training runs, we
detail and release higher accuracy models for detecting ANPs across six
languages from the same image pool and setting as in the original release using
a more modern architecture, GoogLeNet, providing comparable or better
performance with reduced network parameter cost.
In addition, since the image pool in MVSO can be corrupted by user noise from
social interactions, we partitioned out a sub-corpus of MVSO images based on
tag-restricted queries for higher fidelity labels. We show that as a result of
these higher fidelity labels, higher performing AlexNet-styled ANP detectors
can be trained using the tag-restricted image subset as compared to the models
in full corpus. We release all these newly trained models for public research
use along with the list of tag-restricted images from the MVSO dataset.Comment: technical report, 7 page
On the Difficulty of Nearest Neighbor Search
Fast approximate nearest neighbor (NN) search in large databases is becoming
popular. Several powerful learning-based formulations have been proposed
recently. However, not much attention has been paid to a more fundamental
question: how difficult is (approximate) nearest neighbor search in a given
data set? And which data properties affect the difficulty of nearest neighbor
search and how? This paper introduces the first concrete measure called
Relative Contrast that can be used to evaluate the influence of several crucial
data characteristics such as dimensionality, sparsity, and database size
simultaneously in arbitrary normed metric spaces. Moreover, we present a
theoretical analysis to prove how the difficulty measure (relative contrast)
determines/affects the complexity of Local Sensitive Hashing, a popular
approximate NN search method. Relative contrast also provides an explanation
for a family of heuristic hashing algorithms with good practical performance
based on PCA. Finally, we show that most of the previous works in measuring NN
search meaningfulness/difficulty can be derived as special asymptotic cases for
dense vectors of the proposed measure.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
CamSwarm: Instantaneous Smartphone Camera Arrays for Collaborative Photography
Camera arrays (CamArrays) are widely used in commercial filming projects for
achieving special visual effects such as bullet time effect, but are very
expensive to set up. We propose CamSwarm, a low-cost and lightweight
alternative to professional CamArrays for consumer applications. It allows the
construction of a collaborative photography platform from multiple mobile
devices anywhere and anytime, enabling new capturing and editing experiences
that a single camera cannot provide. Our system allows easy team formation;
uses real-time visualization and feedback to guide camera positioning; provides
a mechanism for synchronized capturing; and finally allows the user to
efficiently browse and edit the captured imagery. Our user study suggests that
CamSwarm is easy to use; the provided real-time guidance is helpful; and the
full system achieves high quality results promising for non-professional use.
A demo video is provided at https://www.youtube.com/watch?v=LgkHcvcyTTM
Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
We address temporal action localization in untrimmed long videos. This is
important because videos in real applications are usually unconstrained and
contain multiple action instances plus video content of background scenes or
other activities. To address this challenging issue, we exploit the
effectiveness of deep networks in temporal action localization via three
segment-based 3D ConvNets: (1) a proposal network identifies candidate segments
in a long video that may contain actions; (2) a classification network learns
one-vs-all action classification model to serve as initialization for the
localization network; and (3) a localization network fine-tunes on the learned
classification network to localize each action instance. We propose a novel
loss function for the localization network to explicitly consider temporal
overlap and therefore achieve high temporal localization accuracy. Only the
proposal network and the localization network are used during prediction. On
two large-scale benchmarks, our approach achieves significantly superior
performances compared with other state-of-the-art systems: mAP increases from
1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014,
when the overlap threshold for evaluation is set to 0.5.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
201
Event Specific Multimodal Pattern Mining with Image-Caption Pairs
In this paper we describe a novel framework and algorithms for discovering
image patch patterns from a large corpus of weakly supervised image-caption
pairs generated from news events. Current pattern mining techniques attempt to
find patterns that are representative and discriminative, we stipulate that our
discovered patterns must also be recognizable by humans and preferably with
meaningful names. We propose a new multimodal pattern mining approach that
leverages the descriptive captions often accompanying news images to learn
semantically meaningful image patch patterns. The mutltimodal patterns are then
named using words mined from the associated image captions for each pattern. A
novel evaluation framework is provided that demonstrates our patterns are 26.2%
more semantically meaningful than those discovered by the state of the art
vision only pipeline, and that we can provide tags for the discovered images
patches with 54.5% accuracy with no direct supervision. Our methods also
discover named patterns beyond those covered by the existing image datasets
like ImageNet. To the best of our knowledge this is the first algorithm
developed to automatically mine image patch patterns that have strong semantic
meaning specific to high-level news events, and then evaluate these patterns
based on that criteria
Generic Instance Search and Re-identification from One Example via Attributes and Categories
This paper aims for generic instance search from one example where the
instance can be an arbitrary object like shoes, not just near-planar and
one-sided instances like buildings and logos. First, we evaluate
state-of-the-art instance search methods on this problem. We observe that what
works for buildings loses its generality on shoes. Second, we propose to use
automatically learned category-specific attributes to address the large
appearance variations present in generic instance search. Searching among
instances from the same category as the query, the category-specific attributes
outperform existing approaches by a large margin on shoes and cars and perform
on par with the state-of-the-art on buildings. Third, we treat person
re-identification as a special case of generic instance search. On the popular
VIPeR dataset, we reach state-of-the-art performance with the same method.
Fourth, we extend our method to search objects without restriction to the
specifically known category. We show that the combination of category-level
information and the category-specific attributes is superior to the alternative
method combining category-level information with low-level features such as
Fisher vector.Comment: This technical report is an extended version of our previous
conference paper 'Attributes and Categories for Generic Instance Search from
One Example' (CVPR 2015
Building A Large Concept Bank for Representing Events in Video
Concept-based video representation has proven to be effective in complex
event detection. However, existing methods either manually design concepts or
directly adopt concept libraries not specifically designed for events. In this
paper, we propose to build Concept Bank, the largest concept library consisting
of 4,876 concepts specifically designed to cover 631 real-world events. To
construct the Concept Bank, we first gather a comprehensive event collection
from WikiHow, a collaborative writing project that aims to build the world's
largest manual for any possible How-To event. For each event, we then search
Flickr and discover relevant concepts from the tags of the returned images. We
train a Multiple Kernel Linear SVM for each discovered concept as a concept
detector in Concept Bank. We organize the concepts into a five-layer tree
structure, in which the higher-level nodes correspond to the event categories
while the leaf nodes are the event-specific concepts discovered for each event.
Based on such tree ontology, we develop a semantic matching method to select
relevant concepts for each textual event query, and then apply the
corresponding concept detectors to generate concept-based video
representations. We use TRECVID Multimedia Event Detection 2013 and Columbia
Consumer Video open source event definitions and videos as our test sets and
show very promising results on two video event detection tasks: event modeling
over concept space and zero-shot event retrieval. To the best of our knowledge,
this is the largest concept library covering the largest number of real-world
events.Comment: 25 pages, 9 figure
Learning to Hash for Indexing Big Data - A Survey
The explosive growth in big data has attracted much attention in designing
efficient indexing and search methods recently. In many critical applications
such as large-scale search and pattern matching, finding the nearest neighbors
to a query is a fundamental research problem. However, the straightforward
solution using exhaustive comparison is infeasible due to the prohibitive
computational complexity and memory requirement. In response, Approximate
Nearest Neighbor (ANN) search based on hashing techniques has become popular
due to its promising performance in both efficiency and accuracy. Prior
randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore
data-independent hash functions with random projections or permutations.
Although having elegant theoretic guarantees on the search quality in certain
metric spaces, performance of randomized hashing has been shown insufficient in
many real-world applications. As a remedy, new approaches incorporating
data-driven learning methods in development of advanced hash functions have
emerged. Such learning to hash methods exploit information such as data
distributions or class labels when optimizing the hash codes or functions.
Importantly, the learned hash codes are able to preserve the proximity of
neighboring data in the original feature spaces in the hash code spaces. The
goal of this paper is to provide readers with systematic understanding of
insights, pros and cons of the emerging techniques. We provide a comprehensive
survey of the learning to hash framework and representative techniques of
various types, including unsupervised, semi-supervised, and supervised. In
addition, we also summarize recent hashing approaches utilizing the deep
learning models. Finally, we discuss the future direction and trends of
research in this area
PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN
We aim to tackle a novel vision task called Weakly Supervised Visual Relation
Detection (WSVRD) to detect "subject-predicate-object" relations in an image
with object relation groundtruths available only at the image level. This is
motivated by the fact that it is extremely expensive to label the combinatorial
relations between objects at the instance level. Compared to the extensively
studied problem, Weakly Supervised Object Detection (WSOD), WSVRD is more
challenging as it needs to examine a large set of regions pairs, which is
computationally prohibitive and more likely stuck in a local optimal solution
such as those involving wrong spatial context. To this end, we present a
Parallel, Pairwise Region-based, Fully Convolutional Network (PPR-FCN) for
WSVRD. It uses a parallel FCN architecture that simultaneously performs pair
selection and classification of single regions and region pairs for object and
relation detection, while sharing almost all computation shared over the entire
image. In particular, we propose a novel position-role-sensitive score map with
pairwise RoI pooling to efficiently capture the crucial context associated with
a pair of objects. We demonstrate the superiority of PPR-FCN over all baselines
in solving the WSVRD challenge by using results of extensive experiments over
two visual relation benchmarks.Comment: To appear in International Conference on Computer Vision (ICCV) 2017,
Venice, Ital
PanoSwarm: Collaborative and Synchronized Multi-Device Panoramic Photography
Taking a picture has been traditionally a one-persons task. In this paper we
present a novel system that allows multiple mobile devices to work
collaboratively in a synchronized fashion to capture a panorama of a highly
dynamic scene, creating an entirely new photography experience that encourages
social interactions and teamwork. Our system contains two components: a client
app that runs on all participating devices, and a server program that monitors
and communicates with each device. In a capturing session, the server collects
in realtime the viewfinder images of all devices and stitches them on-the-fly
to create a panorama preview, which is then streamed to all devices as visual
guidance. The system also allows one camera to be the host and to send direct
visual instructions to others to guide camera adjustment. When ready, all
devices take pictures at the same time for panorama stitching. Our preliminary
study suggests that the proposed system can help users capture high quality
panoramas with an enjoyable teamwork experience.
A demo video of the system in action is provided at
http://youtu.be/PwQ6k_ZEQSs
- …