53,757 research outputs found

    Average-Case Optimal Approximate Circular String Matching

    Full text link
    Approximate string matching is the problem of finding all factors of a text t of length n that are at a distance at most k from a pattern x of length m. Approximate circular string matching is the problem of finding all factors of t that are at a distance at most k from x or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time O(n(k + log m)/m). Optimal average-case search time can also be achieved by the algorithms for multiple approximate string matching (Fredriksson and Navarro, 2004) using x and its rotations as the set of multiple patterns. Here we reduce the preprocessing time and space requirements compared to that approach

    Linking Datasets on Organizations Using Half A Billion Open Collaborated Records

    Full text link
    Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers may turn to approximate string matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations presented. Worse, many entities have multiple names that are dissimilar (e.g., "Fannie Mae" and "Federal National Mortgage Association"), a case where string matching has little hope of succeeding. This paper introduces data from a prominent employment-related networking site (LinkedIn) as a tool to address these problems. We propose interconnected approaches to leveraging the massive amount of information from LinkedIn regarding organizational name-to-name links. The first approach builds a machine learning model for predicting matches from character strings, treating the trillions of user-contributed organizational name pairs as a training corpus: this approach constructs a string matching metric that explicitly maximizes match probabilities. A second approach identifies relationships between organization names using network representations of the LinkedIn data. A third approach combines the first and second. We document substantial improvements over fuzzy matching in applications, making all methods accessible in open-source software ("LinkOrgs")

    Faster Longest Common Extension Queries in Strings over General Alphabets

    Get PDF
    Longest common extension queries (often called longest common prefix queries) constitute a fundamental building block in multiple string algorithms, for example computing runs and approximate pattern matching. We show that a sequence of qq LCE queries for a string of size nn over a general ordered alphabet can be realized in O(qloglogn+nlogn)O(q \log \log n+n\log^*n) time making only O(q+n)O(q+n) symbol comparisons. Consequently, all runs in a string over a general ordered alphabet can be computed in O(nloglogn)O(n \log \log n) time making O(n)O(n) symbol comparisons. Our results improve upon a solution by Kosolobov (Information Processing Letters, 2016), who gave an algorithm with O(nlog2/3n)O(n \log^{2/3} n) running time and conjectured that O(n)O(n) time is possible. We make a significant progress towards resolving this conjecture. Our techniques extend to the case of general unordered alphabets, when the time increases to O(qlogn+nlogn)O(q\log n + n\log^*n). The main tools are difference covers and the disjoint-sets data structure.Comment: Accepted to CPM 201

    The Complexity of the Approximate Multiple Pattern Matching Problem for Random Strings

    Get PDF
    We describe a multiple string pattern matching algorithm which is well-suited for approximate search and dictionaries composed of words of different lengths. We prove that this algorithm has optimal complexity rate up to a multiplicative constant, for arbitrary dictionaries. This extends to arbitrary dictionaries the classical results of Yao [SIAM J. Comput. 8, 1979], and Chang and Marr [Proc. CPM94, 1994]

    Improved algorithms for string searching problems

    Get PDF
    We present improved practically efficient algorithms for several string searching problems, where we search for a short string called the pattern in a longer string called the text. We are mainly interested in the online problem, where the text is not preprocessed, but we also present a light indexing approach to speed up exact searching of a single pattern. The new algorithms can be applied e.g. to many problems in bioinformatics and other content scanning and filtering problems. In addition to exact string matching, we develop algorithms for several other variations of the string matching problem. We study algorithms for approximate string matching, where a limited number of errors is allowed in the occurrences of the pattern, and parameterized string matching, where a substring of the text matches the pattern if the characters of the substring can be renamed in such a way that the renamed substring matches the pattern exactly. We also consider searching multiple patterns simultaneously and searching weighted patterns, where the weight of a character at a given position reflects the probability of that character occurring at that position. Many of the new algorithms use the backward matching principle, where the characters of the text that are aligned with the pattern are read backward, i.e. from right to left. Another common characteristic of the new algorithms is the use of q-grams, i.e. q consecutive characters are handled as a single character. Many of the new algorithms are bit parallel, i.e. they pack several variables to a single computer word and update all these variables with a single instruction. We show that the q-gram backward string matching algorithms that solve the exact, approximate, or multiple string matching problems are optimal on average. We also show that the q-gram backward string matching algorithm for the parameterized string matching problem is sublinear on average for a class of moderately repetitive patterns. All the presented algorithms are also shown to be fast in practice when compared to earlier algorithms. We also propose an alphabet sampling technique to speed up exact string matching. We choose a subset of the alphabet and select the corresponding subsequence of the text. String matching is then performed on this reduced subsequence and the found matches are verified in the original text. We show how to choose the sampled alphabet optimally and show that the technique speeds up string matching especially for moderate to long patterns

    Matching records in multiple databases using a hybridization of several technologies.

    Get PDF
    A major problem with integrating information from multiple databases is that the same data objects can exist in inconsistent data formats across databases and a variety of attribute variations, making it difficult to identify matching objects using exact string matching. In this research, a variety of models and methods have been developed and tested to alleviate this problem. A major motivation for this research is that the lack of efficient tools for patient record matching still exists for health care providers. This research is focused on the approximate matching of patient records with third party payer databases. This is a major need for all medical treatment facilities and hospitals that try to match patient treatment records with records of insurance companies, Medicare, Medicaid and the veteran\u27s administration. Therefore, the main objectives of this research effort are to provide an approximate matching framework that can draw upon multiple input service databases, construct an identity, and match to third party payers with the highest possible accuracy in object identification and minimal user interactions. This research describes the object identification system framework that has been developed from a hybridization of several technologies, which compares the object\u27s shared attributes in order to identify matching object. Methodologies and techniques from other fields, such as information retrieval, text correction, and data mining, are integrated to develop a framework to address the patient record matching problem. This research defines the quality of a match in multiple databases by using quality metrics, such as Precision, Recall, and F-measure etc, which are commonly used in Information Retrieval. The performance of resulting decision models are evaluated through extensive experiments and found to perform very well. The matching quality performance metrics, such as precision, recall, F-measure, and accuracy, are over 99%, ROC index are over 99.50% and mismatching rates are less than 0.18% for each model generated based on different data sets. This research also includes a discussion of the problems in patient records matching; an overview of relevant literature for the record matching problem and extensive experimental evaluation of the methodologies, such as string similarity functions and machine learning that are utilized. Finally, potential improvements and extensions to this work are also presented

    OCRspell: An interactive spelling correction system for OCR errors in text

    Full text link
    In this thesis we describe a spelling correction system designed specifically for OCR (Optical Character Recognition) generated text that selects candidate words through the use of information gathered from multiple knowledge sources. This system for text correction is based on static and dynamic device mappings, approximate string matching, and n-gram analysis. Our statistically based, Bayesian system incorporates a learning feature that collects confusion information at the collection and document levels. An evaluation of the new system is presented as well

    The Hybrid of Jaro-Winkler and Rabin-Karp Algorithm in Detecting Indonesian Text Similarity

    Get PDF
    The String-matching technique is part of the similarity technique. This technique can detect the similarity level of the text. The Rabin-Karp is an algorithm of string-matching type. The Rabin-Karp is capable of multiple patterns searching but does not match a single pattern. The Jaro-Winkler Distance algorithm can find strings within approximate string matching. This algorithm is very suitable and gives the best results on the matching of two short strings. This study aims to overcome the shortcomings of the Rabin-Karp algorithm in the single pattern search process by combining the Jaro-Winkler and Rabin-Karp algorithm methods. The merging process started from pre-processing and forming the K-Gram data. Then, it was followed by the calculation of the hash value for each K-Gram by the Rabin-Karp algorithm. The process of finding the same hash score and calculating the percentage level of data similarity used the Jaro-Winkler algorithm. The test was done by comparing words, sentences, and journal abstracts that have been rearranged. The average percentage of the test results for the similarity level of words in the combination algorithm has increased. In contrast, the results of the percentage test for the level of similarity of sentences and journal abstracts have decreased. The experimental results showed that the combination of the Jaro-Winkler algorithm on the Rabin-Karp algorithm can improve the similarity of text accuracy
    corecore