59 research outputs found

    Hashing for Similarity Search: A Survey

    Full text link
    Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

    Lattice-based locality sensitive hashing is optimal

    Get PDF
    Locality sensitive hashing (LSH) was introduced by Indyk and Motwani (STOC ‘98) to give the first sublinear time algorithm for the c-approximate nearest neighbor (ANN) problem using only polynomial space. At a high level, an LSH family hashes “nearby” points to the same bucket and “far away” points to different buckets. The quality of measure of an LSH family is its LSH exponent, which helps determine both query time and space usage. In a seminal work, Andoni and Indyk (FOCS ‘06) constructed an LSH family based on random ball partitionings of space that achieves an LSH exponent of 1/c2 for the ℓ2 norm, which was later shown to be optimal by Motwani, Naor and Panigrahy (SIDMA ‘07) and O’Donnell, Wu and Zhou (TOCT ‘14). Although optimal in the LSH exponent, the ball partitioning approach is computationally expensive. So, in the same work, Andoni and Indyk proposed a simpler and more practical hashing scheme based on Euclidean lattices and provided computational results using the 24-dimensional Leech lattice. However, no theoretical analysis of the scheme was given, thus leaving open the question of finding the exponent of lattice based LSH. In this work, we resolve this question by showing the existence of lattices achieving the optimal LSH exponent of 1/c2 using techniques from the geometry of numbers. At a more conceptual level, our results show that optimal LSH space partitions can have periodic structure. Understanding the extent to which additional structure can be imposed on these partitions, e.g. to yield low space and query complexity, remains an important open problem

    Spreading vectors for similarity search

    Full text link
    Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods. State-of-the-art techniques learn parameters of quantizers on training data for optimal performance, thus adapting quantizers to the data. In this work, we propose to reverse this paradigm and adapt the data to the quantizer: we train a neural net which last layer forms a fixed parameter-free quantizer, such as pre-defined points of a hyper-sphere. As a proxy objective, we design and train a neural network that favors uniformity in the spherical latent space, while preserving the neighborhood structure after the mapping. We propose a new regularizer derived from the Kozachenko--Leonenko differential entropy estimator to enforce uniformity and combine it with a locality-aware triplet loss. Experiments show that our end-to-end approach outperforms most learned quantization methods, and is competitive with the state of the art on widely adopted benchmarks. Furthermore, we show that training without the quantization step results in almost no difference in accuracy, but yields a generic catalyzer that can be applied with any subsequent quantizer.Comment: Published at ICLR 201

    Binary Adaptive Embeddings from Order Statistics of Random Projections

    Get PDF
    We use some of the largest order statistics of the random projections of a reference signal to construct a binary embedding that is adapted to signals correlated with such signal. The embedding is characterized from the analytical standpoint and shown to provide improved performance on tasks such as classification in a reduced-dimensionality space

    THREE TEMPORAL PERSPECTIVES ON DECENTRALIZED LOCATION-AWARE COMPUTING: PAST, PRESENT, FUTURE

    Get PDF
    Durant les quatre derniĂšres dĂ©cennies, la miniaturisation a permis la diffusion Ă  large Ă©chelle des ordinateurs, les rendant omniprĂ©sents. Aujourd’hui, le nombre d’objets connectĂ©s Ă  Internet ne cesse de croitre et cette tendance n’a pas l’air de ralentir. Ces objets, qui peuvent ĂȘtre des tĂ©lĂ©phones mobiles, des vĂ©hicules ou des senseurs, gĂ©nĂšrent de trĂšs grands volumes de donnĂ©es qui sont presque toujours associĂ©s Ă  un contexte spatiotemporel. Le volume de ces donnĂ©es est souvent si grand que leur traitement requiert la crĂ©ation de systĂšme distribuĂ©s qui impliquent la coopĂ©ration de plusieurs ordinateurs. La capacitĂ© de traiter ces donnĂ©es revĂȘt une importance sociĂ©tale. Par exemple: les donnĂ©es collectĂ©es lors de trajets en voiture permettent aujourd’hui d’éviter les em-bouteillages ou de partager son vĂ©hicule. Un autre exemple: dans un avenir proche, les donnĂ©es collectĂ©es Ă  l’aide de gyroscopes capables de dĂ©tecter les trous dans la chaussĂ©e permettront de mieux planifier les interventions de maintenance Ă  effectuer sur le rĂ©seau routier. Les domaines d’applications sont par consĂ©quent nombreux, de mĂȘme que les problĂšmes qui y sont associĂ©s. Les articles qui composent cette thĂšse traitent de systĂšmes qui partagent deux caractĂ©ristiques clĂ©s: un contexte spatiotemporel et une architecture dĂ©centralisĂ©e. De plus, les systĂšmes dĂ©crits dans ces articles s’articulent autours de trois axes temporels: le prĂ©sent, le passĂ©, et le futur. Les systĂšmes axĂ©s sur le prĂ©sent permettent Ă  un trĂšs grand nombre d’objets connectĂ©s de communiquer en fonction d’un contexte spatial avec des temps de rĂ©ponses proche du temps rĂ©el. Nos contributions dans ce domaine permettent Ă  ce type de systĂšme dĂ©centralisĂ© de s’adapter au volume de donnĂ©e Ă  traiter en s’étendant sur du matĂ©riel bon marchĂ©. Les systĂšmes axĂ©s sur le passĂ© ont pour but de faciliter l’accĂšs a de trĂšs grands volumes donnĂ©es spatiotemporelles collectĂ©es par des objets connectĂ©s. En d’autres termes, il s’agit d’indexer des trajectoires et d’exploiter ces indexes. Nos contributions dans ce domaine permettent de traiter des jeux de trajectoires particuliĂšrement denses, ce qui n’avait pas Ă©tĂ© fait auparavant. Enfin, les systĂšmes axĂ©s sur le futur utilisent les trajectoires passĂ©es pour prĂ©dire les trajectoires que des objets connectĂ©s suivront dans l’avenir. Nos contributions permettent de prĂ©dire les trajectoires suivies par des objets connectĂ©s avec une granularitĂ© jusque lĂ  inĂ©galĂ©e. Bien qu’impliquant des domaines diffĂ©rents, ces contributions s’articulent autour de dĂ©nominateurs communs des systĂšmes sous-jacents, ouvrant la possibilitĂ© de pouvoir traiter ces problĂšmes avec plus de gĂ©nĂ©ricitĂ© dans un avenir proche. -- During the past four decades, due to miniaturization computing devices have become ubiquitous and pervasive. Today, the number of objects connected to the Internet is in- creasing at a rapid pace and this trend does not seem to be slowing down. These objects, which can be smartphones, vehicles, or any kind of sensors, generate large amounts of data that are almost always associated with a spatio-temporal context. The amount of this data is often so large that their processing requires the creation of a distributed system, which involves the cooperation of several computers. The ability to process these data is important for society. For example: the data collected during car journeys already makes it possible to avoid traffic jams or to know about the need to organize a carpool. Another example: in the near future, the maintenance interventions to be carried out on the road network will be planned with data collected using gyroscopes that detect potholes. The application domains are therefore numerous, as are the prob- lems associated with them. The articles that make up this thesis deal with systems that share two key characteristics: a spatio-temporal context and a decentralized architec- ture. In addition, the systems described in these articles revolve around three temporal perspectives: the present, the past, and the future. Systems associated with the present perspective enable a very large number of connected objects to communicate in near real-time, according to a spatial context. Our contributions in this area enable this type of decentralized system to be scaled-out on commodity hardware, i.e., to adapt as the volume of data that arrives in the system increases. Systems associated with the past perspective, often referred to as trajectory indexes, are intended for the access to the large volume of spatio-temporal data collected by connected objects. Our contributions in this area makes it possible to handle particularly dense trajectory datasets, a problem that has not been addressed previously. Finally, systems associated with the future per- spective rely on past trajectories to predict the trajectories that the connected objects will follow. Our contributions predict the trajectories followed by connected objects with a previously unmet granularity. Although involving different domains, these con- tributions are structured around the common denominators of the underlying systems, which opens the possibility of being able to deal with these problems more generically in the near future

    Quantifying Object Similarity: Applying Locality Sensitive Hashing for Comparing Material Culture

    Get PDF
    We present a novel technique that compares and quantifies images used here to compare similarities between material cultures. This method is based on locality sensitive hashing (LSH), which uses a relatively fast and flexible algorithm to compare image data and determine their level of similarity. This technique is applied to a dataset of sculpture faces from the Aegean, Anatolia, Cyprus, Egypt, Iran, Indus/Gandhara, the Levant, and Mesopotamia. Results indicate that the objects can be differentiated based on regional differences and show similarities to other locations that share specific material culture traits. Images from known locations enable a network of compared objects to be constructed, where inverse closeness centrality and link weights are used to indicate areas that have a greater or less cultural similarity to other regions. Different periods are assessed, and the results demonstrate that objects from earlier than the 9th century BCE show greater similarity to other local and Egyptian items. Objects from between the 9th and 4th centuries BCE increasingly show inter-regional similarity,with the eastern Mediterranean, including the Aegean, Anatolia, Egypt, and Cyprus,having close similarity to multiple regions. After the 4 th century BCE, greater sculptural similarity is found across a wide area, including the Aegean, Cyprus, Egypt,Mesopotamia, and Gandhara. In general, sculptures from more distant areas increase in similarity in later periods, that is starting from the 9th century BCE. The results demonstrate that the technique can be applied to quantifying object similarity and extended to a broad range of archaeological objects, while also being a tool for rapid analysis that requires minimal data compared to some machine learning techniques.The code and data are provided as part of the outputs

    Large-scale Content-based Visual Information Retrieval

    Get PDF
    Rather than restricting search to the use of metadata, content-based information retrieval methods attempt to index, search and browse digital objects by means of signatures or features describing their actual content. Such methods have been intensively studied in the multimedia community to allow managing the massive amount of raw multimedia documents created every day (e.g. video will account to 84% of U.S. internet traffic by 2018). Recent years have consequently witnessed a consistent growth of content-aware and multi-modal search engines deployed on massive multimedia data. Popular multimedia search applications such as Google images, Youtube, Shazam, Tineye or MusicID clearly demonstrated that the first generation of large-scale audio-visual search technologies is now mature enough to be deployed on real-world big data. All these successful applications did greatly benefit from 15 years of research on multimedia analysis and efficient content-based indexing techniques. Yet the maturity reached by the first generation of content-based search engines does not preclude an intensive research activity in the field. There is actually still a lot of hard problems to be solved before we can retrieve any information in images or sounds as easily as we do in text documents. Content-based search methods actually have to reach a finer understanding of the contents as well as a higher semantic level. This requires modeling the raw signals by more and more complex and numerous features, so that the algorithms for analyzing, indexing and searching such features have to evolve accordingly. This thesis describes several of my works related to large-scale content-based information retrieval. The different contributions are presented in a bottom-up fashion reflecting a typical three-tier software architecture of an end-to-end multimedia information retrieval system. The lowest layer is only concerned with managing, indexing and searching large sets of high-dimensional feature vectors, whatever their origin or role in the upper levels (visual or audio features, global or part-based descriptions, low or high semantic level, etc. ). The middle layer rather works at the document level and is in charge of analyzing, indexing and searching collections of documents. It typically extracts and embeds the low-level features, implements the querying mechanisms and post-processes the results returned by the lower layer. The upper layer works at the applicative level and is in charge of providing useful and interactive functionalities to the end-user. It typically implements the front-end of the search application, the crawler and the orchestration of the different indexing and search services
    • 

    corecore