Search CORE

434 research outputs found

PASS-JOIN: A Partition-based Method for Similarity Joins

Author: Deng Dong
Feng Jianhua
Li Guoliang
Wang Jiannan
Publication venue
Publication date: 01/01/2011
Field of study

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real datasets.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

SUPPORTING ADVANCED INTERACTIVE SEARCH USING INVERTED INDEX

Author: ZHENG YUXIN
Publication venue
Publication date: 31/07/2015
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Efficient and effective KNN sequence search with approximate n-grams

Author: Arasu A.
Gravano L.
Li C.
Navarro G.
Yang Z.
Publication venue: 'VLDB Endowment'
Publication date
Field of study

Crossref

Lossless seeds for searching short patterns with high error rates

Author: Salson Mikaël
Touzet Hélène
Vroland Christophe
Publication venue: HAL CCSD
Publication date: 01/10/2014
Field of study

International audienceWe address the problem of approximate pattern matching using the Levenshtein distance. Given a text T and a pattern P , find alllocations in T that differ by at most k errors from P . For that purpose, we propose a filtration algorithm that is based on a novel type of seeds,combining exact parts and parts with a fixed number of errors. Experimental tests show that the method is specifically well-suited for short patterns with a large number of error

HAL - Lille 3

INRIA a CCSD electronic archive server

Ab initio detection of fuzzy amino acid tandem repeats in protein sequences

Author: Pellegrini Marco
Renda M. Elena
VECCHIO A
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Background Tandem repetitions within protein amino acid sequences often correspond to regular secondary structures and form multi-repeat 3D assemblies of varied size and function. Developing internal repetitions is one of the evolutionary mechanisms that proteins employ to adapt their structure and function under evolutionary pressure. While there is keen interest in understanding such phenomena, detection of repeating structures based only on sequence analysis is considered an arduous task, since structure and function is often preserved even under considerable sequence divergence (fuzzy tandem repeats). Results In this paper we present PTRStalker, a new algorithm for ab-initio detection of fuzzy tandem repeats in protein amino acid sequences. In the reported results we show that by feeding PTRStalker with amino acid sequences from the UniProtKB/Swiss-Prot database we detect novel tandemly repeated structures not captured by other state-of-the-art tools. Experiments with membrane proteins indicate that PTRStalker can detect global symmetries in the primary structure which are then reflected in the tertiary structure. Conclusions PTRStalker is able to detect fuzzy tandem repeating structures in protein sequences, with performance beyond the current state-of-the art. Such a tool may be a valuable support to investigating protein structural properties when tertiary X-ray data is not available

Springer - Publisher Connector

Archivio della Ricerca - Università di Pisa

PubMed Central

Efficiently indexing sparse wide tables in community systems

Author: HUI MEI
Publication venue
Publication date: 25/05/2010
Field of study

Master'sMASTER OF SCIENC

ScholarBank@NUS

Rank-aware, Approximate Query Processing on the Semantic Web

Author: Wagner Andreas Josef
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2014
Field of study

Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion

KITopen