Search CORE

22 research outputs found

16S rRNA metagenome clustering and diversity estimation using locality sensitive hashing

Author: Daniel Barbará
Huzefa Rangwala
Zeehasham Rasheed
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Crossref

Springer - Publisher Connector

PM-LSH:A fast and accurate LSH framework for high-dimensional approximate NN search

Author: Hung Nguyen Quoc Viet
Jensen Christian S.
Liu Hang
Weng Lianggui
Xi Zhao
Zheng Bolong
Publication venue: 'VLDB Endowment'
Publication date: 01/01/2020
Field of study

VBN

DB-LSH: Locality-Sensitive Hashing with Query-based Dynamic Bucketing

Author: Tian Yao
Zhao Xi
Zhou Xiaofang
Publication venue
Publication date: 20/07/2022
Field of study

Among many solutions to the high-dimensional approximate nearest neighbor (ANN) search problem, locality sensitive hashing (LSH) is known for its sub-linear query time and robust theoretical guarantee on query accuracy. Traditional LSH methods can generate a small number of candidates quickly from hash tables but suffer from large index sizes and hash boundary problems. Recent studies to address these issues often incur extra overhead to identify eligible candidates or remove false positives, making query time no longer sub-linear. To address this dilemma, in this paper we propose a novel LSH scheme called DB-LSH which supports efficient ANN search for large high-dimensional datasets. It organizes the projected spaces with multi-dimensional indexes rather than using fixed-width hash buckets. Our approach can significantly reduce the space cost as by avoiding the need to maintain many hash tables for different bucket sizes. During the query phase of DB-LSH, a small number of high-quality candidates can be generated efficiently by dynamically constructing query-based hypercubic buckets with the required widths through index-based window queries. For a dataset of

n

d

-dimensional points with approximation ratio

c

, our rigorous theoretical analysis shows that DB-LSH achieves a smaller query cost

{O(n^{\rho^*} d\log n)}

, where

{\rho^*}

is bounded by

{1/c^{\alpha}}

while the bound is

{1/c}

in the existing work. An extensive range of experiments on real-world data demonstrates the superiority of DB-LSH over state-of-the-art methods on both efficiency and accuracy.Comment: Accepted by ICDE 202

arXiv.org e-Print Archive

Fast anomaly detection with locality-sensitive hashing and hyperparameter autotuning

Author: Alonso-Betanzos Amparo
Bolón-Canedo Verónica
Eiras-Franco Carlos
Marreiros Goreti
Meira Jorge
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

This paper presents LSHAD, an anomaly detection (AD) method based on Locality Sensitive Hashing (LSH), capable of dealing with large-scale datasets. The resulting algorithm is highly parallelizable and its implementation in Apache Spark further increases its ability to handle very large datasets. Moreover, the algorithm incorporates an automatic hyperparameter tuning mechanism so that users do not have to implement costly manual tuning. Our LSHAD method is novel as both hyperparameter automation and distributed properties are not usual in AD techniques. Our results for experiments with LSHAD across a variety of datasets point to state-of-the-art AD performance while handling much larger datasets than state-of-the-art alternatives. In addition, evaluation results for the tradeoff between AD performance and scalability show that our method offers significant advantages over competing methods.This research has been financially supported in part by the Spanish Ministerio de Economía y Competitividad (project PID-2019-109238GB-C22) and by the Xunta de Galicia (grants ED431C 2018/34 and ED431G 2019/01) through European Union ERDF funds. CITIC, as a research center accredited by the Galician University System, is funded by the Consellería de Cultura, Educación e Universidades of the Xunta de Galicia, supported 80% through ERDF Funds (ERDF Operational Programme Galicia 2014–2020) and 20% by the Secretaría Xeral de Universidades (Grant ED431G 2019/01).This work was also supported by National Funds through the Portuguese FCT - Fundação para a Ciência e a Tecnologia (projects UIDB/00760/2020 and UIDP/00760/2020).info:eu-repo/semantics/publishedVersio

Repositorio da Universidade da Coruña

Repositório Científico do Instituto Politécnico do Porto

Editorial for the 5th Bibliometric-enhanced Information Retrieval Workshop at ECIR 2017

Author: Guillaume Cabanac
Ingo Frommholz
Philipp Mayr
Publication venue
Publication date: 06/03/2020
Field of study

CiteSeerX

Scholarly Big Data Quality Assessment: A Case Study of Document Linking and Conflation with S2ORC

Author: Giles C. Lee
Hiltabrand Ryan
Soós Dominik
Wu Jian
Publication venue: ODU Digital Commons
Publication date: 01/01/2022
Field of study

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation

Old Dominion University

Implementation of voyage data recording device using a digital forensics-based hash algorithm

Author: Kim Gwan-Hyung
Seong Ki-Taek
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/12/2019
Field of study

Identifying the causes of marine accidents is difficult because of problems in scene preservation, reenactment, and procuring of witnesses. Thanks to new regulations, larger vessels are now required to carry voyage data recorders (VDRs) and automatic identification systems (AISs). However, the content of these devices, which is created, stored, and managed digitally, has security vulnerabilities such as the potential for data modification. Therefore, when managing digital records it is important to guarantee reliability. To this end, we suggest a digital forensics-based digital records migration method using a hash algorithm to guarantee the integrity and authenticity of digital records

Crossref

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

Cloud-Scale Entity Resolution: Current State and Open Challenges

Author: Eike Schallehn
Gunter Saake
Xiao Chen
Publication venue: RonPub
Publication date: 01/01/2018
Field of study

Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field

RonPub -- Research Online Publishing