20 research outputs found
The Power of Asymmetry in Binary Hashing
When approximating binary similarity using the hamming distance between short
binary hashes, we show that even if the similarity is symmetric, we can have
shorter and more accurate hashes by using two distinct code maps. I.e. by
approximating the similarity between and as the hamming distance
between and , for two distinct binary codes , rather than as
the hamming distance between and .Comment: Accepted to NIPS 2013, 9 pages, 5 figure
Anti-sparse coding for approximate nearest neighbor search
This paper proposes a binarization scheme for vectors of high dimension based
on the recent concept of anti-sparse coding, and shows its excellent
performance for approximate nearest neighbor search. Unlike other binarization
schemes, this framework allows, up to a scaling factor, the explicit
reconstruction from the binary representation of the original vector. The paper
also shows that random projections which are used in Locality Sensitive Hashing
algorithms, are significantly outperformed by regular frames for both synthetic
and real data if the number of bits exceeds the vector dimensionality, i.e.,
when high precision is required.Comment: submitted to ICASSP'2012; RR-7771 (2011
Exploiting multimedia in creating and analysing multimedia Web archives
The data contained on the web and the social web are inherently multimedia and consist of a mixture of textual, visual and audio modalities. Community memories embodied on the web and social web contain a rich mixture of data from these modalities. In many ways, the web is the greatest resource ever created by human-kind. However, due to the dynamic and distributed nature of the web, its content changes, appears and disappears on a daily basis. Web archiving provides a way of capturing snapshots of (parts of) the web for preservation and future analysis. This paper provides an overview of techniques we have developed within the context of the EU funded ARCOMEM (ARchiving COmmunity MEMories) project to allow multimedia web content to be leveraged during the archival process and for post-archival analysis. Through a set of use cases, we explore several practical applications of multimedia analytics within the realm of web archiving, web archive analysis and multimedia data on the web in general
SADIH: Semantic-Aware DIscrete Hashing
Due to its low storage cost and fast query speed, hashing has been recognized
to accomplish similarity search in large-scale multimedia retrieval
applications. Particularly supervised hashing has recently received
considerable research attention by leveraging the label information to preserve
the pairwise similarities of data points in the Hamming space. However, there
still remain two crucial bottlenecks: 1) the learning process of the full
pairwise similarity preservation is computationally unaffordable and unscalable
to deal with big data; 2) the available category information of data are not
well-explored to learn discriminative hash functions. To overcome these
challenges, we propose a unified Semantic-Aware DIscrete Hashing (SADIH)
framework, which aims to directly embed the transformed semantic information
into the asymmetric similarity approximation and discriminative hashing
function learning. Specifically, a semantic-aware latent embedding is
introduced to asymmetrically preserve the full pairwise similarities while
skillfully handle the cumbersome n times n pairwise similarity matrix.
Meanwhile, a semantic-aware autoencoder is developed to jointly preserve the
data structures in the discriminative latent semantic space and perform data
reconstruction. Moreover, an efficient alternating optimization algorithm is
proposed to solve the resulting discrete optimization problem. Extensive
experimental results on multiple large-scale datasets demonstrate that our
SADIH can clearly outperform the state-of-the-art baselines with the additional
benefit of lower computational costs.Comment: Accepted by The Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI-19
Memory vectors for similarity search in high-dimensional spaces
We study an indexing architecture to store and search in a database of
high-dimensional vectors from the perspective of statistical signal processing
and decision theory. This architecture is composed of several memory units,
each of which summarizes a fraction of the database by a single representative
vector. The potential similarity of the query to one of the vectors stored in
the memory unit is gauged by a simple correlation with the memory unit's
representative vector. This representative optimizes the test of the following
hypothesis: the query is independent from any vector in the memory unit vs. the
query is a simple perturbation of one of the stored vectors.
Compared to exhaustive search, our approach finds the most similar database
vectors significantly faster without a noticeable reduction in search quality.
Interestingly, the reduction of complexity is provably better in
high-dimensional spaces. We empirically demonstrate its practical interest in a
large-scale image search scenario with off-the-shelf state-of-the-art
descriptors.Comment: Accepted to IEEE Transactions on Big Dat
Asymmetric Hamming Embedding
International audienceThis paper proposes an asymmetric Hamming Embedding scheme for large scale image search based on local descriptors. The comparison of two descriptors relies on an vector-to-binary code comparison, which limits the quantization error associated with the query compared with the original Hamming Embedding method. The approach is used in combination with an inverted file structure that offers high efficiency, comparable to that of a regular bag-of-features retrieval systems. The comparison is performed on two popular datasets. Our method consistently improves the search quality over the symmetric version. The trade-off between memory usage and precision is evaluated, showing that the method is especially useful for short binary signatures
A fuzzy asymmetric TOPSIS model for optimizing investment in online advertising campaigns
The high penetration of the Internet and e-commerce in Spain during recent years has increased companies' interest in this medium for advertising planning. In this context Google offers a great advertising inventory and perfectly segmented content pages. This work is concerned with the optimization of online advertising investments based on pay-per-click campaigns. Our main goal is to rank and select different alternative keyword sets aimed at maximizing the awareness of and traffic to a company's website. The keyword selection problem with online advertising purposes is clearly a multiple-criteria decision-making problem additionally characterized by the imprecise, ambiguous and uncertain nature of the available data. To address this problem, we propose a technique for order of preference by similarity to ideal solution (TOPSIS)-based approach, which allows us to rank the alternative keyword sets, taking into account the fuzzy nature of the available data. The TOPSIS is based on the concept that the chosen alternative should have the shortest distance from the positive ideal solution and the longest distance from the negative ideal solution. In this work, due to the characteristics of the studied problem, we propose the use of an asymmetric distance, allowing us to work with ideal solutions that differ from the maximum or the minimum. The suitability of the proposed model is illustrated with an empirical case of a stock exchange broker's advertising investment problem aimed at generating awareness about the brand and increasing the traffic to the corporative website
PROFICIENT PRODUCT QUANTIZATION FOR LARGE-SCALE HIGH DIMENSIONAL DATA USING APPROXIMATE NEAREST NEIGHBOR SEARCH
K-nearest neighbor's classification and regression is broadly utilized as a part of data mining because of its easiness and precision. At the point when a prediction is required for an inconspicuous data case, the KNN algorithm will search through the preparation dataset for the k most comparable occasions. Finding the esteem k is application subordinate, thus a nearby esteem is set which expands the exactness of the issue. Grouping the question the lion's share class of its k neighbors is called K-nearest neighbors classification. In this paper the occurrence or question be grouped is known as the issue protest or venture in short. Worldwide KNN approach utilizes the entire data for searching the k-nearest neighbors of the venture. For data KNN approach is utilized where test objects are arbitrarily chosen from the preparation data space. Keeping in mind the end goal to enhance the exactness of finding the correct k-neighbors of nearby KNN, among various ANN approaches proposed in the current years, the ones in light of vector quantization emerge, accomplishing best in class comes about. Product quantization (PQ) decays vectors into subspaces for independent handling, taking into account quick query based separation estimations. This postulation work intends to lessen the intricacy of AQ by changing a solitary most costly stride in the process â that of vector encoding. Both the remarkable search execution and high expenses of AQ originated from its all-inclusive statement, along these lines by forcing some novel outside imperatives it is conceivable to accomplish a superior trade off: lessen many-sided quality while holding the precision advantage over other ANN strategies
Consistent visual words mining with adaptive sampling
International audienceState-of-the-art large-scale object retrieval systems usually combine efficient Bag-of-Words indexing models with a spatial verification re-ranking stage to improve query performance. In this paper we propose to directly discover spatially verified visual words as a batch process. Contrary to previous related methods based on feature sets hashing or clustering, we suggest not trading recall for efficiency by sticking on an accurate two-stage matching strategy. The problem then rather becomes a sampling issue: how to effectively and efficiently select relevant query regions while minimizing the number of tentative probes? We therefore introduce an adaptive weighted sampling scheme, starting with some prior distribution and iteratively converging to unvisited regions. Interestingly, the proposed paradigm is generalizable to any input prior distribution, including specific visual concept detectors or efficient hashing-based methods. We show in the experiments that the proposed method allows to discover highly interpretable visual words while providing excellent recall and image representativity