Search CORE

9 research outputs found

Efficient distributed locality sensitive hashing

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

An Efficient Probabilistic Algorithm to Detect Periodic Patterns in Spatio-Temporal Datasets

Author: Galdames Patricio
Gutiérrez-Soto Claudio
Palomino Marco
Publication venue
Publication date: 01/06/2024
Field of study

Author Contributions: Conceptualization, C.G.-S.; methodology, C.G.-S.; software, C.G.-S.; validation, C.G.-S., P.G. and M.A.P.; formal analysis, C.G.-S.; investigation, C.G.-S., P.G. and M.A.P.; data curation, C.G.-S.; writing—original draft preparation, C.G.-S., P.G. and M.A.P.; writing—review and editing, M.A.P.; funding acquisition, C.G.-S. and M.A.P. All authors have read and agreed to the published version of the manuscript.Peer reviewe

Aberdeen University Research

Directory of Open Access Journals

Identificando plágio externo com Locality-sensitive hashing

Author: Duarte Fellipe Ribeiro
Publication venue: 'Programa de Pos-graduacao em Ciencias Contabeis da UFRJ'
Publication date: 01/06/2017
Field of study

Heuristic Retrieval task aims to retrieve a set of documents from which the external plagiarism detection identifies plagiarized pieces of text. In this context, we present Minmax Circular Sector Arcs algorithms that treats HR task as an approximate k-nearest neighbor search problem. Moreover, Minmax Circular Sector Arcs algorithms aim to retrieve the set of documents with greater amounts of plagiarized fragments, while reducing the amount of time to accomplish the HR task. Our theoretical framework is based on two aspects: (i) a triangular property to encode a range of sketches on a unique value; and (ii) a Circular Sector Arc property which enables (i) to be more accurate. Both properties were proposed for handling high-dimensional spaces, hashing them to a lower number of hash values. Our two Minmax Circular Sector Arcs methods, Minmax Circular Sector Arcs Lower Bound and Minmax Circular Sector Arcs Full Bound, achieved Recall levels slightly more imprecise than Minmaxwise hashing in exchange for a better Speedup in document indexing and query extraction and retrieval time in high-dimensional plagiarism related datasets.A tarefa de recuperação heurística tem como objetivo resgatar um conjunto de documentos dos quais a identificação de plágio externo identifica de pedaços de texto plagiado. Neste contexto, o presente trabalho apresenta os algoritmos Minmax Circular Sector Arcs que lidam com a tarefa de recuperação heurística como um problema de busca aproximada dos vizinhos mais próximos. Ademais, os algoritmos Minmax Circular Sector Arcs têm como objetivo recuperar documentos com grande quantidade de fragmentos plagiados enquanto reduz a quantidade de tempo para realizar a tarefa recuperação heurística. O ferramental teórico proposto é baseado em dois aspectos: (i) uma propriedade triangular que codifica um conjunto de esbo¸cos em um valor único; e (ii) a propriedade baseada em Arcos de Setores Circulares que melhoram a precisão de (i). Ambas as propriedades foram propostas para lidar com espaços de alta dimensionalidade, representando-os em um número pequendo de valores de hash. Os dois métodos Minmax Circular Sector Arcs aqui propostos, alcunhados de Minmax Circular Sector Arcs Lower Bound e Minmax Circular Sector Arcs Full Bound alcançaram níveis de recall singelamente mais imprecisos que o método Minmaxwise em troca de uma aceleração durante a indexação de documentos e da redução do tempo de extração e busca de consultas em coleções de dados de plágio de alta dimensionalidade

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Pantheon

Scalable Nearest Neighbor Search with Compact Codes

Author: Eghbali Sepehr
Publication venue: 'University of Waterloo'
Publication date: 25/11/2019
Field of study

An important characteristic of the recent decade is the dramatic growth in the use and generation of data. From collections of images, documents and videos, to genetic data, and to network traffic statistics, modern technologies and cheap storage have made it possible to accumulate huge datasets. But how can we effectively use all this data? The growing sizes of the modern datasets make it crucial to develop new algorithms and tools capable of sifting through this data efficiently. A central computational primitive for analyzing large datasets is the Nearest Neighbor Search problem in which the goal is to preprocess a set of objects, so that later, given a query object, one can find the data object closest to the query. In most situations involving high-dimensional objects, the exhaustive search which compares the query with every item in the dataset has a prohibitive cost both for runtime and memory space. This thesis focuses on the design of algorithms and tools for fast and cost efficient nearest neighbor search. The proposed techniques advocate the use of compressed and discrete codes for representing the neighborhood structure of data in a compact way. Transforming high-dimensional items, such as raw images, into similarity-preserving compact codes has both computational and storage advantages as compact codes can be stored efficiently using only a few bits per data item, and more importantly they can be compared extremely fast using bit-wise or look-up table operators. Motivated by this view, the present work explores two main research directions: 1) finding mappings that better preserve the given notion of similarity while keeping the codes as compressed as possible, and 2) building efficient data structures that support non-exhaustive search among the compact codes. Our large-scale experimental results reported on various benchmarks including datasets upto one billion items, show boost in retrieval performance in comparison to the state-of-the-art

University of Waterloo's Institutional Repository