14,624 research outputs found

    A Comprehensive Survey on Cross-modal Retrieval

    Full text link
    In recent years, cross-modal retrieval has drawn much attention due to the rapid growth of multimodal data. It takes one type of data as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different modalities of data remains a challenge. Various methods have been proposed to deal with such a problem. In this paper, we first review a number of representative methods for cross-modal retrieval and classify them into two main groups: 1) real-valued representation learning, and 2) binary representation learning. Real-valued representation learning methods aim to learn real-valued common representations for different modalities of data. To speed up the cross-modal retrieval, a number of binary representation learning methods are proposed to map different modalities of data into a common Hamming space. Then, we introduce several multimodal datasets in the community, and show the experimental results on two commonly used multimodal datasets. The comparison reveals the characteristic of different kinds of cross-modal retrieval methods, which is expected to benefit both practical applications and future research. Finally, we discuss open problems and future research directions.Comment: 20 pages, 11 figures, 9 table

    Tracking Large-Scale Video Remix in Real-World Events

    Full text link
    Social information networks, such as YouTube, contains traces of both explicit online interaction (such as "like", leaving a comment, or subscribing to video feed), and latent interactions (such as quoting, or remixing parts of a video). We propose visual memes, or frequently re-posted short video segments, for tracking such latent video interactions at scale. Visual memes are extracted by scalable detection algorithms that we develop, with high accuracy. We further augment visual memes with text, via a statistical model of latent topics. We model content interactions on YouTube with visual memes, defining several measures of influence and building predictive models for meme popularity. Experiments are carried out on with over 2 million video shots from more than 40,000 videos on two prominent news events in 2009: the election in Iran and the swine flu epidemic. In these two events, a high percentage of videos contain remixed content, and it is apparent that traditional news media and citizen journalists have different roles in disseminating remixed content. We perform two quantitative evaluations for annotating visual memes and predicting their popularity. The joint statistical model of visual memes and words outperform a concurrence model, and the average error is ~2% for predicting meme volume and ~17% for their lifespan.Comment: 11 pages, accepted for journal publicatio

    Hinge-Loss Markov Random Fields and Probabilistic Soft Logic

    Full text link
    A fundamental challenge in developing high-impact machine learning technologies is balancing the need to model rich, structured domains with the ability to scale to big data. Many important problem areas are both richly structured and large scale, from social and biological networks, to knowledge graphs and the Web, to images, video, and natural language. In this paper, we introduce two new formalisms for modeling structured data, and show that they can both capture rich structure and scale to big data. The first, hinge-loss Markov random fields (HL-MRFs), is a new kind of probabilistic graphical model that generalizes different approaches to convex inference. We unite three approaches from the randomized algorithms, probabilistic graphical models, and fuzzy logic communities, showing that all three lead to the same inference objective. We then define HL-MRFs by generalizing this unified objective. The second new formalism, probabilistic soft logic (PSL), is a probabilistic programming language that makes HL-MRFs easy to define using a syntax based on first-order logic. We introduce an algorithm for inferring most-probable variable assignments (MAP inference) that is much more scalable than general-purpose convex optimization methods, because it uses message passing to take advantage of sparse dependency structures. We then show how to learn the parameters of HL-MRFs. The learned HL-MRFs are as accurate as analogous discrete models, but much more scalable. Together, these algorithms enable HL-MRFs and PSL to model rich, structured data at scales not previously possible

    Improved Search in Hamming Space using Deep Multi-Index Hashing

    Full text link
    Similarity-preserving hashing is a widely-used method for nearest neighbour search in large-scale image retrieval tasks. There has been considerable research on generating efficient image representation via the deep-network-based hashing methods. However, the issue of efficient searching in the deep representation space remains largely unsolved. To this end, we propose a simple yet efficient deep-network-based multi-index hashing method for simultaneously learning the powerful image representation and the efficient searching. To achieve these two goals, we introduce the multi-index hashing (MIH) mechanism into the proposed deep architecture, which divides the binary codes into multiple substrings. Due to the non-uniformly distributed codes will result in inefficiency searching, we add the two balanced constraints at feature-level and instance-level, respectively. Extensive evaluations on several benchmark image retrieval datasets show that the learned balanced binary codes bring dramatic speedups and achieve comparable performance over the existing baselines

    A Survey on Learning to Hash

    Full text link
    Nearest neighbor search is a problem of finding the data points from the database such that the distances from them to the query point are the smallest. Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations. We separate quantization from pairwise similarity preserving as the objective function is very different though quantization, as we show, can be derived from preserving the pairwise similarities. In addition, we present the evaluation protocols, and the general performance analysis, and point out that the quantization algorithms perform superiorly in terms of search accuracy, search time cost, and space cost. Finally, we introduce a few emerging topics.Comment: To appear in IEEE Transactions On Pattern Analysis and Machine Intelligence (TPAMI

    The Sloan Digital Sky Survey and its Archive

    Full text link
    The next-generation astronomy archives will cover most of the universe at fine resolution in many wavelengths. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) will create a 5-wavelength catalog over 10,000 square degrees of the sky. The 200 million objects in the multi-terabyte database will have mostly numerical attributes, defining a space of 100+ dimensions. Points in this space have highly correlated distributions. The archive will enable astronomers to explore the data interactively. Data access will be aided by multidimensional spatial indices. The data will be partitioned in many ways. Small tag objects consisting of the most popular attributes speed up frequent searches. Splitting the data among multiple servers enables parallel, scalable I/O. Hashing techniques allow efficient clustering and pairwise comparison algorithms. Randomly sampled subsets allow debugging otherwise large queries at the desktop. Central servers will operate a data pump that supports sweeping searches that touch most of the data.Comment: 10 pages, ADASS '99 conferenc

    Joint learning of interpretation and distillation

    Full text link
    The extra trust brought by the model interpretation has made it an indispensable part of machine learning systems. But to explain a distilled model's prediction, one may either work with the student model itself, or turn to its teacher model. This leads to a more fundamental question: if a distilled model should give a similar prediction for a similar reason as its teacher model on the same input? This question becomes even more crucial when the two models have dramatically different structure, taking GBDT2NN for example. This paper conducts an empirical study on the new approach to explaining each prediction of GBDT2NN, and how imitating the explanation can further improve the distillation process as an auxiliary learning task. Experiments on several benchmarks show that the proposed methods achieve better performance on both explanations and predictions

    Q-STAR:A Perceptual Video Quality Model Considering Impact of Spatial, Temporal, and Amplitude Resolutions

    Full text link
    In this paper, we investigate the impact of spatial, temporal and amplitude resolution (STAR) on the perceptual quality of a compressed video. Subjective quality tests were carried out on a mobile device. Seven source sequences are included in the tests and for each source sequence we have 27 test configurations generated by JSVM encoder (3 QP levels, 3 spatial resolutions, and 3 temporal resolutions), resulting a total of 189 processed video sequences (PVSs). Videos coded at different spatial resolutions are displayed at the full screen size of the mobile platform. Subjective data reveal that the impact of spatial resolution (SR), temporal resolution (TR) and quantization stepsize (QS) can each be captured by a function with a single content-dependent parameter. The joint impact of SR, TR and QS can be accurately modeled by the product of these three functions with only three parameters. We further find that the quality decay rates with SR and QS, respectively are independent of TR, and likewise, the decay rate with TR is independent of SR and QS, respectively. However, there is a significant interaction between the effects of SR and QS. The overall quality model is further validated on five other datasets with very high accuracy. The complete model correlates well with the subjective ratings with a Pearson Correlation Coefficient (PCC) of 0.991.Comment: 13 page

    Scalable Similarity Learning using Large Margin Neighborhood Embedding

    Full text link
    Classifying large-scale image data into object categories is an important problem that has received increasing research attention. Given the huge amount of data, non-parametric approaches such as nearest neighbor classifiers have shown promising results, especially when they are underpinned by a learned distance or similarity measurement. Although metric learning has been well studied in the past decades, most existing algorithms are impractical to handle large-scale data sets. In this paper, we present an image similarity learning method that can scale well in both the number of images and the dimensionality of image descriptors. To this end, similarity comparison is restricted to each sample's local neighbors and a discriminative similarity measure is induced from large margin neighborhood embedding. We also exploit the ensemble of projections so that high-dimensional features can be processed in a set of lower-dimensional subspaces in parallel without much performance compromise. The similarity function is learned online using a stochastic gradient descent algorithm in which the triplet sampling strategy is customized for quick convergence of classification performance. The effectiveness of our proposed model is validated on several data sets with scales varying from tens of thousands to one million images. Recognition accuracies competitive with the state-of-the-art performance are achieved with much higher efficiency and scalability

    Image Provenance Analysis at Scale

    Full text link
    Prior art has shown it is possible to estimate, through image processing and computer vision techniques, the types and parameters of transformations that have been applied to the content of individual images to obtain new images. Given a large corpus of images and a query image, an interesting further step is to retrieve the set of original images whose content is present in the query image, as well as the detailed sequences of transformations that yield the query image given the original images. This is a problem that recently has received the name of image provenance analysis. In these times of public media manipulation ( e.g., fake news and meme sharing), obtaining the history of image transformations is relevant for fact checking and authorship verification, among many other applications. This article presents an end-to-end processing pipeline for image provenance analysis, which works at real-world scale. It employs a cutting-edge image filtering solution that is custom-tailored for the problem at hand, as well as novel techniques for obtaining the provenance graph that expresses how the images, as nodes, are ancestrally connected. A comprehensive set of experiments for each stage of the pipeline is provided, comparing the proposed solution with state-of-the-art results, employing previously published datasets. In addition, this work introduces a new dataset of real-world provenance cases from the social media site Reddit, along with baseline results.Comment: 13 pages, 6 figure