434 research outputs found

    PASS-JOIN: A Partition-based Method for Similarity Joins

    Full text link
    As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real datasets.Comment: VLDB201

    SUPPORTING ADVANCED INTERACTIVE SEARCH USING INVERTED INDEX

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Lossless seeds for searching short patterns with high error rates

    Get PDF
    International audienceWe address the problem of approximate pattern matching using the Levenshtein distance. Given a text T and a pattern P , find alllocations in T that differ by at most k errors from P . For that purpose, we propose a filtration algorithm that is based on a novel type of seeds,combining exact parts and parts with a fixed number of errors. Experimental tests show that the method is specifically well-suited for short patterns with a large number of error

    Ab initio detection of fuzzy amino acid tandem repeats in protein sequences

    Get PDF
    Background Tandem repetitions within protein amino acid sequences often correspond to regular secondary structures and form multi-repeat 3D assemblies of varied size and function. Developing internal repetitions is one of the evolutionary mechanisms that proteins employ to adapt their structure and function under evolutionary pressure. While there is keen interest in understanding such phenomena, detection of repeating structures based only on sequence analysis is considered an arduous task, since structure and function is often preserved even under considerable sequence divergence (fuzzy tandem repeats). Results In this paper we present PTRStalker, a new algorithm for ab-initio detection of fuzzy tandem repeats in protein amino acid sequences. In the reported results we show that by feeding PTRStalker with amino acid sequences from the UniProtKB/Swiss-Prot database we detect novel tandemly repeated structures not captured by other state-of-the-art tools. Experiments with membrane proteins indicate that PTRStalker can detect global symmetries in the primary structure which are then reflected in the tertiary structure. Conclusions PTRStalker is able to detect fuzzy tandem repeating structures in protein sequences, with performance beyond the current state-of-the art. Such a tool may be a valuable support to investigating protein structural properties when tertiary X-ray data is not available

    Efficiently indexing sparse wide tables in community systems

    Get PDF
    Master'sMASTER OF SCIENC

    Rank-aware, Approximate Query Processing on the Semantic Web

    Get PDF
    Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion
    corecore