222,556 research outputs found

    Exact string matching algorithms : survey, issues, and future research directions

    Get PDF
    String matching has been an extensively studied research domain in the past two decades due to its various applications in the fields of text, image, signal, and speech processing. As a result, choosing an appropriate string matching algorithm for current applications and addressing challenges is difficult. Understanding different string matching approaches (such as exact string matching and approximate string matching algorithms), integrating several algorithms, and modifying algorithms to address related issues are also difficult. This paper presents a survey on single-pattern exact string matching algorithms. The main purpose of this survey is to propose new classification, identify new directions and highlight the possible challenges, current trends, and future works in the area of string matching algorithms with a core focus on exact string matching algorithms. © 2013 IEEE

    Improving Database Quality through Eliminating Duplicate Records

    Get PDF
    Redundant or duplicate data are the most troublesome problem in database management and applications. Approximate field matching is the key solution to resolve the problem by identifying semantically equivalent string values in syntactically different representations. This paper considers token-based solutions and proposes a general field matching framework to generalize the field matching problem in different domains. By introducing a concept of String Matching Points (SMP) in string comparison, string matching accuracy and efficiency are improved, compared with other commonly-applied field matching algorithms. The paper discusses the development of field matching algorithms from the developed general framework. The framework and corresponding algorithm are tested on a public data set of the NASA publication abstract database. The approach can be applied to address the similar problems in other databases

    Matroid Online Bipartite Matching and Vertex Cover

    Full text link
    The Adwords and Online Bipartite Matching problems have enjoyed a renewed attention over the past decade due to their connection to Internet advertising. Our community has contributed, among other things, new models (notably stochastic) and extensions to the classical formulations to address the issues that arise from practical needs. In this paper, we propose a new generalization based on matroids and show that many of the previous results extend to this more general setting. Because of the rich structures and expressive power of matroids, our new setting is potentially of interest both in theory and in practice. In the classical version of the problem, the offline side of a bipartite graph is known initially while vertices from the online side arrive one at a time along with their incident edges. The objective is to maintain a decent approximate matching from which no edge can be removed. Our generalization, called Matroid Online Bipartite Matching, additionally requires that the set of matched offline vertices be independent in a given matroid. In particular, the case of partition matroids corresponds to the natural scenario where each advertiser manages multiple ads with a fixed total budget. Our algorithms attain the same performance as the classical version of the problems considered, which are often provably the best possible. We present 1−1/e1-1/e-competitive algorithms for Matroid Online Bipartite Matching under the small bid assumption, as well as a 1−1/e1-1/e-competitive algorithm for Matroid Online Bipartite Matching in the random arrival model. A key technical ingredient of our results is a carefully designed primal-dual waterfilling procedure that accommodates for matroid constraints. This is inspired by the extension of our recent charging scheme for Online Bipartite Vertex Cover.Comment: 19 pages, to appear in EC'1

    Representing Population Dynamics from Administrative and Consumer Registers

    Get PDF
    This research attempts to derive representative metrics of household dynamics and migration by analysing changes between two annual composite registers of the UK population. Through appropriate data cleaning and linkage techniques, it is possible to match addresses and record changes in their size and composition over a two year period. The paper also demonstrates that it is feasible to approximate migration trends by filtering and matching records of household units and individuals whom are not recorded at the same address in both datasets

    Lossless seeds for searching short patterns with high error rates

    Get PDF
    International audienceWe address the problem of approximate pattern matching using the Levenshtein distance. Given a text T and a pattern P , find alllocations in T that differ by at most k errors from P . For that purpose, we propose a filtration algorithm that is based on a novel type of seeds,combining exact parts and parts with a fixed number of errors. Experimental tests show that the method is specifically well-suited for short patterns with a large number of error

    Linking Datasets on Organizations Using Half A Billion Open Collaborated Records

    Full text link
    Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers may turn to approximate string matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations presented. Worse, many entities have multiple names that are dissimilar (e.g., "Fannie Mae" and "Federal National Mortgage Association"), a case where string matching has little hope of succeeding. This paper introduces data from a prominent employment-related networking site (LinkedIn) as a tool to address these problems. We propose interconnected approaches to leveraging the massive amount of information from LinkedIn regarding organizational name-to-name links. The first approach builds a machine learning model for predicting matches from character strings, treating the trillions of user-contributed organizational name pairs as a training corpus: this approach constructs a string matching metric that explicitly maximizes match probabilities. A second approach identifies relationships between organization names using network representations of the LinkedIn data. A third approach combines the first and second. We document substantial improvements over fuzzy matching in applications, making all methods accessible in open-source software ("LinkOrgs")
    • …
    corecore