Search CORE

59 research outputs found

Hashing for Similarity Search: A Survey

Author: Ji Jianqiu
Shen Heng Tao
Song Jingkuan
Wang Jingdong
Publication venue
Publication date: 13/08/2014
Field of study

Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

arXiv.org e-Print Archive

CiteSeerX

k-Nearest Neighbor Search for Large Scale High Dimensional data

Author: Mai Ngoc Anh Vu
Publication venue
Publication date: 25/03/2016
Field of study

Tohoku University徳山

Tohoku University Repository (TOUR) / 東北大学機関リポジトリ

Lattice-based locality sensitive hashing is optimal

Author: Chandrasekaran K. (Karthekeyan)
Dadush D.N. (Daniel)
Gandikota V. (Venkata)
Grigorescu E. (Elena)
Publication venue
Publication date: 11/01/2018
Field of study

Locality sensitive hashing (LSH) was introduced by Indyk and Motwani (STOC ‘98) to give the first sublinear time algorithm for the c-approximate nearest neighbor (ANN) problem using only polynomial space. At a high level, an LSH family hashes “nearby” points to the same bucket and “far away” points to different buckets. The quality of measure of an LSH family is its LSH exponent, which helps determine both query time and space usage. In a seminal work, Andoni and Indyk (FOCS ‘06) constructed an LSH family based on random ball partitionings of space that achieves an LSH exponent of 1/c2 for the ℓ2 norm, which was later shown to be optimal by Motwani, Naor and Panigrahy (SIDMA ‘07) and O’Donnell, Wu and Zhou (TOCT ‘14). Although optimal in the LSH exponent, the ball partitioning approach is computationally expensive. So, in the same work, Andoni and Indyk proposed a simpler and more practical hashing scheme based on Euclidean lattices and provided computational results using the 24-dimensional Leech lattice. However, no theoretical analysis of the scheme was given, thus leaving open the question of finding the exponent of lattice based LSH. In this work, we resolve this question by showing the existence of lattices achieving the optimal LSH exponent of 1/c2 using techniques from the geometry of numbers. At a more conceptual level, our results show that optimal LSH space partitions can have periodic structure. Understanding the extent to which additional structure can be imposed on these partitions, e.g. to yield low space and query complexity, remains an important open problem

CWI's Institutional Repository

Spreading vectors for similarity search

Author: Douze Matthijs
Jégou Hervé
Sablayrolles Alexandre
Schmid Cordelia
Publication venue
Publication date: 06/05/2019
Field of study

Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods. State-of-the-art techniques learn parameters of quantizers on training data for optimal performance, thus adapting quantizers to the data. In this work, we propose to reverse this paradigm and adapt the data to the quantizer: we train a neural net which last layer forms a fixed parameter-free quantizer, such as pre-defined points of a hyper-sphere. As a proxy objective, we design and train a neural network that favors uniformity in the spherical latent space, while preserving the neighborhood structure after the mapping. We propose a new regularizer derived from the Kozachenko--Leonenko differential entropy estimator to enforce uniformity and combine it with a locality-aware triplet loss. Experiments show that our end-to-end approach outperforms most learned quantization methods, and is competitive with the state of the art on widely adopted benchmarks. Furthermore, we show that training without the quantization step results in almost no difference in accuracy, but yields a generic catalyzer that can be applied with any subsequent quantizer.Comment: Published at ICLR 201

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Binary Adaptive Embeddings from Order Statistics of Random Projections

Author: Magli Enrico
Valsesia Diego
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

We use some of the largest order statistics of the random projections of a reference signal to construct a binary embedding that is adapted to signals correlated with such signal. The embedding is characterized from the analytical standpoint and shown to provide improved performance on tasks such as classification in a reduced-dimensionality space

arXiv.org e-Print Archive

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

THREE TEMPORAL PERSPECTIVES ON DECENTRALIZED LOCATION-AWARE COMPUTING: PAST, PRESENT, FUTURE

Author: Chapuis Bertil
Publication venue: Université de Lausanne, Faculté des hautes études commerciales
Publication date: 01/01/2018
Field of study

Durant les quatre dernières décennies, la miniaturisation a permis la diffusion à large échelle des ordinateurs, les rendant omniprésents. Aujourd’hui, le nombre d’objets connectés à Internet ne cesse de croitre et cette tendance n’a pas l’air de ralentir. Ces objets, qui peuvent être des téléphones mobiles, des véhicules ou des senseurs, génèrent de très grands volumes de données qui sont presque toujours associés à un contexte spatiotemporel. Le volume de ces données est souvent si grand que leur traitement requiert la création de système distribués qui impliquent la coopération de plusieurs ordinateurs. La capacité de traiter ces données revêt une importance sociétale. Par exemple: les données collectées lors de trajets en voiture permettent aujourd’hui d’éviter les em-bouteillages ou de partager son véhicule. Un autre exemple: dans un avenir proche, les données collectées à l’aide de gyroscopes capables de détecter les trous dans la chaussée permettront de mieux planifier les interventions de maintenance à effectuer sur le réseau routier. Les domaines d’applications sont par conséquent nombreux, de même que les problèmes qui y sont associés. Les articles qui composent cette thèse traitent de systèmes qui partagent deux caractéristiques clés: un contexte spatiotemporel et une architecture décentralisée. De plus, les systèmes décrits dans ces articles s’articulent autours de trois axes temporels: le présent, le passé, et le futur. Les systèmes axés sur le présent permettent à un très grand nombre d’objets connectés de communiquer en fonction d’un contexte spatial avec des temps de réponses proche du temps réel. Nos contributions dans ce domaine permettent à ce type de système décentralisé de s’adapter au volume de donnée à traiter en s’étendant sur du matériel bon marché. Les systèmes axés sur le passé ont pour but de faciliter l’accès a de très grands volumes données spatiotemporelles collectées par des objets connectés. En d’autres termes, il s’agit d’indexer des trajectoires et d’exploiter ces indexes. Nos contributions dans ce domaine permettent de traiter des jeux de trajectoires particulièrement denses, ce qui n’avait pas été fait auparavant. Enfin, les systèmes axés sur le futur utilisent les trajectoires passées pour prédire les trajectoires que des objets connectés suivront dans l’avenir. Nos contributions permettent de prédire les trajectoires suivies par des objets connectés avec une granularité jusque là inégalée. Bien qu’impliquant des domaines différents, ces contributions s’articulent autour de dénominateurs communs des systèmes sous-jacents, ouvrant la possibilité de pouvoir traiter ces problèmes avec plus de généricité dans un avenir proche. -- During the past four decades, due to miniaturization computing devices have become ubiquitous and pervasive. Today, the number of objects connected to the Internet is in- creasing at a rapid pace and this trend does not seem to be slowing down. These objects, which can be smartphones, vehicles, or any kind of sensors, generate large amounts of data that are almost always associated with a spatio-temporal context. The amount of this data is often so large that their processing requires the creation of a distributed system, which involves the cooperation of several computers. The ability to process these data is important for society. For example: the data collected during car journeys already makes it possible to avoid traffic jams or to know about the need to organize a carpool. Another example: in the near future, the maintenance interventions to be carried out on the road network will be planned with data collected using gyroscopes that detect potholes. The application domains are therefore numerous, as are the prob- lems associated with them. The articles that make up this thesis deal with systems that share two key characteristics: a spatio-temporal context and a decentralized architec- ture. In addition, the systems described in these articles revolve around three temporal perspectives: the present, the past, and the future. Systems associated with the present perspective enable a very large number of connected objects to communicate in near real-time, according to a spatial context. Our contributions in this area enable this type of decentralized system to be scaled-out on commodity hardware, i.e., to adapt as the volume of data that arrives in the system increases. Systems associated with the past perspective, often referred to as trajectory indexes, are intended for the access to the large volume of spatio-temporal data collected by connected objects. Our contributions in this area makes it possible to handle particularly dense trajectory datasets, a problem that has not been addressed previously. Finally, systems associated with the future per- spective rely on past trajectories to predict the trajectories that the connected objects will follow. Our contributions predict the trajectories followed by connected objects with a previously unmet granularity. Although involving different domains, these con- tributions are structured around the common denominators of the underlying systems, which opens the possibility of being able to deal with these problems more generically in the near future

Serveur académique lausannois

Quantifying Object Similarity: Applying Locality Sensitive Hashing for Comparing Material Culture

Author: Altaweel M
Squitieri A
Publication venue: 'Elsevier BV'
Publication date: 01/11/2020
Field of study

We present a novel technique that compares and quantifies images used here to compare similarities between material cultures. This method is based on locality sensitive hashing (LSH), which uses a relatively fast and flexible algorithm to compare image data and determine their level of similarity. This technique is applied to a dataset of sculpture faces from the Aegean, Anatolia, Cyprus, Egypt, Iran, Indus/Gandhara, the Levant, and Mesopotamia. Results indicate that the objects can be differentiated based on regional differences and show similarities to other locations that share specific material culture traits. Images from known locations enable a network of compared objects to be constructed, where inverse closeness centrality and link weights are used to indicate areas that have a greater or less cultural similarity to other regions. Different periods are assessed, and the results demonstrate that objects from earlier than the 9th century BCE show greater similarity to other local and Egyptian items. Objects from between the 9th and 4th centuries BCE increasingly show inter-regional similarity,with the eastern Mediterranean, including the Aegean, Anatolia, Egypt, and Cyprus,having close similarity to multiple regions. After the 4 th century BCE, greater sculptural similarity is found across a wide area, including the Aegean, Cyprus, Egypt,Mesopotamia, and Gandhara. In general, sculptures from more distant areas increase in similarity in later periods, that is starting from the 9th century BCE. The results demonstrate that the technique can be applied to quantifying object similarity and extended to a broad range of archaeological objects, while also being a tool for rapid analysis that requires minimal data compared to some machine learning techniques.The code and data are provided as part of the outputs

UCL Discovery

Large-scale Content-based Visual Information Retrieval

Author: Joly Alexis
Publication venue: HAL CCSD
Publication date: 26/05/2015
Field of study

Rather than restricting search to the use of metadata, content-based information retrieval methods attempt to index, search and browse digital objects by means of signatures or features describing their actual content. Such methods have been intensively studied in the multimedia community to allow managing the massive amount of raw multimedia documents created every day (e.g. video will account to 84% of U.S. internet traffic by 2018). Recent years have consequently witnessed a consistent growth of content-aware and multi-modal search engines deployed on massive multimedia data. Popular multimedia search applications such as Google images, Youtube, Shazam, Tineye or MusicID clearly demonstrated that the first generation of large-scale audio-visual search technologies is now mature enough to be deployed on real-world big data. All these successful applications did greatly benefit from 15 years of research on multimedia analysis and efficient content-based indexing techniques. Yet the maturity reached by the first generation of content-based search engines does not preclude an intensive research activity in the field. There is actually still a lot of hard problems to be solved before we can retrieve any information in images or sounds as easily as we do in text documents. Content-based search methods actually have to reach a finer understanding of the contents as well as a higher semantic level. This requires modeling the raw signals by more and more complex and numerous features, so that the algorithms for analyzing, indexing and searching such features have to evolve accordingly. This thesis describes several of my works related to large-scale content-based information retrieval. The different contributions are presented in a bottom-up fashion reflecting a typical three-tier software architecture of an end-to-end multimedia information retrieval system. The lowest layer is only concerned with managing, indexing and searching large sets of high-dimensional feature vectors, whatever their origin or role in the upper levels (visual or audio features, global or part-based descriptions, low or high semantic level, etc. ). The middle layer rather works at the document level and is in charge of analyzing, indexing and searching collections of documents. It typically extracts and embeds the low-level features, implements the querying mechanisms and post-processes the results returned by the lower layer. The upper layer works at the applicative level and is in charge of providing useful and interactive functionalities to the end-user. It typically implements the front-end of the search application, the crawler and the orchestration of the different indexing and search services

Thèses en Ligne

INRIA a CCSD electronic archive server