14,624 research outputs found
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Tracking Large-Scale Video Remix in Real-World Events
Social information networks, such as YouTube, contains traces of both
explicit online interaction (such as "like", leaving a comment, or subscribing
to video feed), and latent interactions (such as quoting, or remixing parts of
a video). We propose visual memes, or frequently re-posted short video
segments, for tracking such latent video interactions at scale. Visual memes
are extracted by scalable detection algorithms that we develop, with high
accuracy. We further augment visual memes with text, via a statistical model of
latent topics. We model content interactions on YouTube with visual memes,
defining several measures of influence and building predictive models for meme
popularity. Experiments are carried out on with over 2 million video shots from
more than 40,000 videos on two prominent news events in 2009: the election in
Iran and the swine flu epidemic. In these two events, a high percentage of
videos contain remixed content, and it is apparent that traditional news media
and citizen journalists have different roles in disseminating remixed content.
We perform two quantitative evaluations for annotating visual memes and
predicting their popularity. The joint statistical model of visual memes and
words outperform a concurrence model, and the average error is ~2% for
predicting meme volume and ~17% for their lifespan.Comment: 11 pages, accepted for journal publicatio
Hinge-Loss Markov Random Fields and Probabilistic Soft Logic
A fundamental challenge in developing high-impact machine learning
technologies is balancing the need to model rich, structured domains with the
ability to scale to big data. Many important problem areas are both richly
structured and large scale, from social and biological networks, to knowledge
graphs and the Web, to images, video, and natural language. In this paper, we
introduce two new formalisms for modeling structured data, and show that they
can both capture rich structure and scale to big data. The first, hinge-loss
Markov random fields (HL-MRFs), is a new kind of probabilistic graphical model
that generalizes different approaches to convex inference. We unite three
approaches from the randomized algorithms, probabilistic graphical models, and
fuzzy logic communities, showing that all three lead to the same inference
objective. We then define HL-MRFs by generalizing this unified objective. The
second new formalism, probabilistic soft logic (PSL), is a probabilistic
programming language that makes HL-MRFs easy to define using a syntax based on
first-order logic. We introduce an algorithm for inferring most-probable
variable assignments (MAP inference) that is much more scalable than
general-purpose convex optimization methods, because it uses message passing to
take advantage of sparse dependency structures. We then show how to learn the
parameters of HL-MRFs. The learned HL-MRFs are as accurate as analogous
discrete models, but much more scalable. Together, these algorithms enable
HL-MRFs and PSL to model rich, structured data at scales not previously
possible
Improved Search in Hamming Space using Deep Multi-Index Hashing
Similarity-preserving hashing is a widely-used method for nearest neighbour
search in large-scale image retrieval tasks. There has been considerable
research on generating efficient image representation via the
deep-network-based hashing methods. However, the issue of efficient searching
in the deep representation space remains largely unsolved. To this end, we
propose a simple yet efficient deep-network-based multi-index hashing method
for simultaneously learning the powerful image representation and the efficient
searching. To achieve these two goals, we introduce the multi-index hashing
(MIH) mechanism into the proposed deep architecture, which divides the binary
codes into multiple substrings. Due to the non-uniformly distributed codes will
result in inefficiency searching, we add the two balanced constraints at
feature-level and instance-level, respectively. Extensive evaluations on
several benchmark image retrieval datasets show that the learned balanced
binary codes bring dramatic speedups and achieve comparable performance over
the existing baselines
A Survey on Learning to Hash
Nearest neighbor search is a problem of finding the data points from the
database such that the distances from them to the query point are the smallest.
Learning to hash is one of the major solutions to this problem and has been
widely studied recently. In this paper, we present a comprehensive survey of
the learning to hash algorithms, categorize them according to the manners of
preserving the similarities into: pairwise similarity preserving, multiwise
similarity preserving, implicit similarity preserving, as well as quantization,
and discuss their relations. We separate quantization from pairwise similarity
preserving as the objective function is very different though quantization, as
we show, can be derived from preserving the pairwise similarities. In addition,
we present the evaluation protocols, and the general performance analysis, and
point out that the quantization algorithms perform superiorly in terms of
search accuracy, search time cost, and space cost. Finally, we introduce a few
emerging topics.Comment: To appear in IEEE Transactions On Pattern Analysis and Machine
Intelligence (TPAMI
The Sloan Digital Sky Survey and its Archive
The next-generation astronomy archives will cover most of the universe at
fine resolution in many wavelengths. One of the first of these projects, the
Sloan Digital Sky Survey (SDSS) will create a 5-wavelength catalog over 10,000
square degrees of the sky. The 200 million objects in the multi-terabyte
database will have mostly numerical attributes, defining a space of 100+
dimensions. Points in this space have highly correlated distributions. The
archive will enable astronomers to explore the data interactively. Data access
will be aided by multidimensional spatial indices. The data will be partitioned
in many ways. Small tag objects consisting of the most popular attributes speed
up frequent searches. Splitting the data among multiple servers enables
parallel, scalable I/O. Hashing techniques allow efficient clustering and
pairwise comparison algorithms. Randomly sampled subsets allow debugging
otherwise large queries at the desktop. Central servers will operate a data
pump that supports sweeping searches that touch most of the data.Comment: 10 pages, ADASS '99 conferenc
Joint learning of interpretation and distillation
The extra trust brought by the model interpretation has made it an
indispensable part of machine learning systems. But to explain a distilled
model's prediction, one may either work with the student model itself, or turn
to its teacher model. This leads to a more fundamental question: if a distilled
model should give a similar prediction for a similar reason as its teacher
model on the same input? This question becomes even more crucial when the two
models have dramatically different structure, taking GBDT2NN for example. This
paper conducts an empirical study on the new approach to explaining each
prediction of GBDT2NN, and how imitating the explanation can further improve
the distillation process as an auxiliary learning task. Experiments on several
benchmarks show that the proposed methods achieve better performance on both
explanations and predictions
Q-STAR:A Perceptual Video Quality Model Considering Impact of Spatial, Temporal, and Amplitude Resolutions
In this paper, we investigate the impact of spatial, temporal and amplitude
resolution (STAR) on the perceptual quality of a compressed video. Subjective
quality tests were carried out on a mobile device. Seven source sequences are
included in the tests and for each source sequence we have 27 test
configurations generated by JSVM encoder (3 QP levels, 3 spatial resolutions,
and 3 temporal resolutions), resulting a total of 189 processed video sequences
(PVSs). Videos coded at different spatial resolutions are displayed at the full
screen size of the mobile platform. Subjective data reveal that the impact of
spatial resolution (SR), temporal resolution (TR) and quantization stepsize
(QS) can each be captured by a function with a single content-dependent
parameter. The joint impact of SR, TR and QS can be accurately modeled by the
product of these three functions with only three parameters. We further find
that the quality decay rates with SR and QS, respectively are independent of
TR, and likewise, the decay rate with TR is independent of SR and QS,
respectively. However, there is a significant interaction between the effects
of SR and QS. The overall quality model is further validated on five other
datasets with very high accuracy. The complete model correlates well with the
subjective ratings with a Pearson Correlation Coefficient (PCC) of 0.991.Comment: 13 page
Scalable Similarity Learning using Large Margin Neighborhood Embedding
Classifying large-scale image data into object categories is an important
problem that has received increasing research attention. Given the huge amount
of data, non-parametric approaches such as nearest neighbor classifiers have
shown promising results, especially when they are underpinned by a learned
distance or similarity measurement. Although metric learning has been well
studied in the past decades, most existing algorithms are impractical to handle
large-scale data sets. In this paper, we present an image similarity learning
method that can scale well in both the number of images and the dimensionality
of image descriptors. To this end, similarity comparison is restricted to each
sample's local neighbors and a discriminative similarity measure is induced
from large margin neighborhood embedding. We also exploit the ensemble of
projections so that high-dimensional features can be processed in a set of
lower-dimensional subspaces in parallel without much performance compromise.
The similarity function is learned online using a stochastic gradient descent
algorithm in which the triplet sampling strategy is customized for quick
convergence of classification performance. The effectiveness of our proposed
model is validated on several data sets with scales varying from tens of
thousands to one million images. Recognition accuracies competitive with the
state-of-the-art performance are achieved with much higher efficiency and
scalability
Image Provenance Analysis at Scale
Prior art has shown it is possible to estimate, through image processing and
computer vision techniques, the types and parameters of transformations that
have been applied to the content of individual images to obtain new images.
Given a large corpus of images and a query image, an interesting further step
is to retrieve the set of original images whose content is present in the query
image, as well as the detailed sequences of transformations that yield the
query image given the original images. This is a problem that recently has
received the name of image provenance analysis. In these times of public media
manipulation ( e.g., fake news and meme sharing), obtaining the history of
image transformations is relevant for fact checking and authorship
verification, among many other applications. This article presents an
end-to-end processing pipeline for image provenance analysis, which works at
real-world scale. It employs a cutting-edge image filtering solution that is
custom-tailored for the problem at hand, as well as novel techniques for
obtaining the provenance graph that expresses how the images, as nodes, are
ancestrally connected. A comprehensive set of experiments for each stage of the
pipeline is provided, comparing the proposed solution with state-of-the-art
results, employing previously published datasets. In addition, this work
introduces a new dataset of real-world provenance cases from the social media
site Reddit, along with baseline results.Comment: 13 pages, 6 figure
- …