222,556 research outputs found
Exact string matching algorithms : survey, issues, and future research directions
String matching has been an extensively studied research domain in the past two decades due to its various applications in the fields of text, image, signal, and speech processing. As a result, choosing an appropriate string matching algorithm for current applications and addressing challenges is difficult. Understanding different string matching approaches (such as exact string matching and approximate string matching algorithms), integrating several algorithms, and modifying algorithms to address related issues are also difficult. This paper presents a survey on single-pattern exact string matching algorithms. The main purpose of this survey is to propose new classification, identify new directions and highlight the possible challenges, current trends, and future works in the area of string matching algorithms with a core focus on exact string matching algorithms. © 2013 IEEE
Improving Database Quality through Eliminating Duplicate Records
Redundant or duplicate data are the most troublesome problem in database management and applications. Approximate field matching is the key solution to resolve the problem by identifying semantically equivalent string values in syntactically different representations. This paper considers token-based solutions and proposes a general field matching framework to generalize the field matching problem in different domains. By introducing a concept of String Matching Points (SMP) in string comparison, string matching accuracy and efficiency are improved, compared with other commonly-applied field matching algorithms. The paper discusses the development of field matching algorithms from the developed general framework. The framework and corresponding algorithm are tested on a public data set of the NASA publication abstract database. The approach can be applied to address the similar problems in other databases
Matroid Online Bipartite Matching and Vertex Cover
The Adwords and Online Bipartite Matching problems have enjoyed a renewed
attention over the past decade due to their connection to Internet advertising.
Our community has contributed, among other things, new models (notably
stochastic) and extensions to the classical formulations to address the issues
that arise from practical needs. In this paper, we propose a new generalization
based on matroids and show that many of the previous results extend to this
more general setting. Because of the rich structures and expressive power of
matroids, our new setting is potentially of interest both in theory and in
practice.
In the classical version of the problem, the offline side of a bipartite
graph is known initially while vertices from the online side arrive one at a
time along with their incident edges. The objective is to maintain a decent
approximate matching from which no edge can be removed. Our generalization,
called Matroid Online Bipartite Matching, additionally requires that the set of
matched offline vertices be independent in a given matroid. In particular, the
case of partition matroids corresponds to the natural scenario where each
advertiser manages multiple ads with a fixed total budget.
Our algorithms attain the same performance as the classical version of the
problems considered, which are often provably the best possible. We present
-competitive algorithms for Matroid Online Bipartite Matching under the
small bid assumption, as well as a -competitive algorithm for Matroid
Online Bipartite Matching in the random arrival model. A key technical
ingredient of our results is a carefully designed primal-dual waterfilling
procedure that accommodates for matroid constraints. This is inspired by the
extension of our recent charging scheme for Online Bipartite Vertex Cover.Comment: 19 pages, to appear in EC'1
Representing Population Dynamics from Administrative and Consumer Registers
This research attempts to derive representative metrics of household dynamics and migration by analysing changes between two annual composite registers of the UK population. Through appropriate data cleaning and linkage techniques, it is possible to match addresses and record changes in their size and composition over a two year period. The paper also demonstrates that it is feasible to approximate migration trends by filtering and matching records of household units and individuals whom are not recorded at the same address in both datasets
Lossless seeds for searching short patterns with high error rates
International audienceWe address the problem of approximate pattern matching using the Levenshtein distance. Given a text T and a pattern P , find alllocations in T that differ by at most k errors from P . For that purpose, we propose a filtration algorithm that is based on a novel type of seeds,combining exact parts and parts with a fixed number of errors. Experimental tests show that the method is specifically well-suited for short patterns with a large number of error
Linking Datasets on Organizations Using Half A Billion Open Collaborated Records
Scholars studying organizations often work with multiple datasets lacking
shared unique identifiers or covariates. In such situations, researchers may
turn to approximate string matching methods to combine datasets. String
matching, although useful, faces fundamental challenges. Even when two strings
appear similar to humans, fuzzy matching often does not work because it fails
to adapt to the informativeness of the character combinations presented. Worse,
many entities have multiple names that are dissimilar (e.g., "Fannie Mae" and
"Federal National Mortgage Association"), a case where string matching has
little hope of succeeding. This paper introduces data from a prominent
employment-related networking site (LinkedIn) as a tool to address these
problems. We propose interconnected approaches to leveraging the massive amount
of information from LinkedIn regarding organizational name-to-name links. The
first approach builds a machine learning model for predicting matches from
character strings, treating the trillions of user-contributed organizational
name pairs as a training corpus: this approach constructs a string matching
metric that explicitly maximizes match probabilities. A second approach
identifies relationships between organization names using network
representations of the LinkedIn data. A third approach combines the first and
second. We document substantial improvements over fuzzy matching in
applications, making all methods accessible in open-source software
("LinkOrgs")
- …