Search CORE

98,294 research outputs found

An Algorithm for Matching Heterogeneous Financial Databases: a Case Study for COMPUSTAT/CRSP and I/B/E/S Databases

Author: Huerta Ramon
Rodriguez-Lujan Irene
Publication venue: 'Redfame Publishing'
Publication date: 01/01/2016
Field of study

Rigorous and proper linking of financial databases is a necessary step to test trading strategies incorporating multimodal sources of information. This paper proposes a machine learning solution to match companies in heterogeneous financial databases. Our method, named Financial Attribute Selection Distance (FASD), has two stages, each of them corresponding to one of the two interrelated tasks commonly involved in heterogeneous database matching problems: schema matching and entity matching. FASD's schema matching procedure is based on the Kullback-Leibler divergence of string and numeric attributes. FASD's entity matching solution relies on learning a company distance flexible enough to deal with the numeric and string attribute links found by the schema matching algorithm and incorporate different string matching approaches such as edit-based and token-based metrics. The parameters of the distance are optimized using the F-score as cost function. FASD is able to match the joint Compustat/CRSP and Institutional Brokers' Estimate System (I/B/E/S) databases with an F-score over 0.94 using only a hundred of manually labeled company links

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Redfame Publishing: E-Journals

Biblos-e Archivo

Improved bounds for testing Dyck languages

Author: Fischer Eldar
Magniez Frédéric
Starikovskaya Tatiana
Publication venue
Publication date: 20/07/2017
Field of study

In this paper we consider the problem of deciding membership in Dyck languages, a fundamental family of context-free languages, comprised of well-balanced strings of parentheses. In this problem we are given a string of length

n

in the alphabet of parentheses of

m

types and must decide if it is well-balanced. We consider this problem in the property testing setting, where one would like to make the decision while querying as few characters of the input as possible. Property testing of strings for Dyck language membership for

m=1

, with a number of queries independent of the input size

n

, was provided in [Alon, Krivelevich, Newman and Szegedy, SICOMP 2001]. Property testing of strings for Dyck language membership for

m \ge 2

was first investigated in [Parnas, Ron and Rubinfeld, RSA 2003]. They showed an upper bound and a lower bound for distinguishing strings belonging to the language from strings that are far (in terms of the Hamming distance) from the language, which are respectively (up to polylogarithmic factors) the

2/3

power and the

1/11

power of the input size

n

. Here we improve the power of

n

in both bounds. For the upper bound, we introduce a recursion technique, that together with a refinement of the methods in the original work provides a test for any power of

n

larger than

2/5

. For the lower bound, we introduce a new problem called Truestring Equivalence, which is easily reducible to the

2

-type Dyck language property testing problem. For this new problem, we show a lower bound of

n

to the power of

1/5

arXiv.org e-Print Archive

Hal-Diderot

Automated census record linking: a machine learning approach

Author: Feigenbaum James J.
Publication venue
Publication date: 28/03/2016
Field of study

Thanks to the availability of new historical census sources and advances in record linking technology, economic historians are becoming big data genealogists. Linking individuals over time and between databases has opened up new avenues for research into intergenerational mobility, assimilation, discrimination, and the returns to education. To take advantage of these new research opportunities, scholars need to be able to accurately and efficiently match historical records and produce an unbiased dataset of links for downstream analysis. I detail a standard and transparent census matching technique for constructing linked samples that can be replicated across a variety of cases. The procedure applies insights from machine learning classification and text comparison to the well known problem of record linkage, but with a focus on the sorts of costs and benefits of working with historical data. I begin by extracting a subset of possible matches for each record, and then use training data to tune a matching algorithm that attempts to minimize both false positives and false negatives, taking into account the inherent noise in historical records. To make the procedure precise, I trace its application to an example from my own work, linking children from the 1915 Iowa State Census to their adult-selves in the 1940 Federal Census. In addition, I provide guidance on a number of practical questions, including how large the training data needs to be relative to the sample.This research has been supported by the NSF-IGERT Multidisciplinary Program in Inequality & Social Policy at Harvard University (Grant No. 0333403)

Boston University Institutional Repository (OpenBU)

Critical couplings and string tensions via lattice matching of RG decimations

Author: A. A. Migdal
A. A. Migdal
A. A. Migdal
A. A. Migdal
E. T. Tomboulis
E. T. Tomboulis
X. Cheng
X. Cheng
Publication venue: 'American Physical Society (APS)'
Publication date: 17/06/2012
Field of study

We calculate critical couplings and string tensions in SU(2) and SU(3) pure lattice gauge theory by a simple and inexpensive technique of two-lattice matching of RG block transformations. The transformations are potential moving decimations generating plaquette actions with large number of group characters and exhibit rapid approach to a unique renormalized trajectory. Fixing the critical coupling

\beta_c(N_\tau)

at one value of temporal lattice length

N_\tau

by MC simulation, the critical couplings for any other value of

N_\tau

are then obtained by lattice matching of the block decimations. We obtain

\beta_c(N_\tau)

values over the range

N_\tau = 3 - 32

and find agreement with MC simulation results to within a few percent in all cases. A similar procedure allows the calculation of string tensions with similarly good agreement with MC data.Comment: 12 pages, Latex, 1 figur

arXiv.org e-Print Archive

Crossref

Near-Linear Time Insertion-Deletion Codes and (1+ $\varepsilon$ )-Approximating Edit Distance via Indexing

Author: Approximating
Efficiently
Goldwasser Shafi
Haeupler Bernhard
Haeupler Bernhard
Polylogarithmic
Selected
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/04/2019
Field of study

We introduce fast-decodable indexing schemes for edit distance which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string

I

. In particular, for every length

n

and every

\varepsilon >0

, one can in near linear time construct a string

I \in \Sigma'^n

with

|\Sigma'| = O_{\varepsilon}(1)

, such that, indexing any string

S \in \Sigma^n

, symbol-by-symbol, with

I

results in a string

S' \in \Sigma''^n

where

\Sigma'' = \Sigma \times \Sigma'

for which edit distance computations are easy, i.e., one can compute a

(1+\varepsilon)

-approximation of the edit distance between

S'

and any other string in

O(n \text{poly}(\log n))

time. Our indexing schemes can be used to improve the decoding complexity of state-of-the-art error correcting codes for insertions and deletions. In particular, they lead to near-linear time decoding algorithms for the insertion-deletion codes of [Haeupler, Shahrasbi; STOC `17] and faster decoding algorithms for list-decodable insertion-deletion codes of [Haeupler, Shahrasbi, Sudan; ICALP `18]. Interestingly, the latter codes are a crucial ingredient in the construction of fast-decodable indexing schemes

arXiv.org e-Print Archive

Crossref

Finding approximate palindromes in strings

Author: Alexandre H.L. Porto
Apostolico
Baeza-Yates
Bondy
Breslauer
Galil
Gusfield
Jurka
Knuth
Landau
Landau
Levenstein
Manacher
Myers
Sankoff
Stephen
Ukkonen
Ukkonen
Valmir C. Barbosa
Wu
Publication venue: 'Elsevier BV'
Publication date: 01/01/2002
Field of study

We introduce a novel definition of approximate palindromes in strings, and provide an algorithm to find all maximal approximate palindromes in a string with up to

k

errors. Our definition is based on the usual edit operations of approximate pattern matching, and the algorithm we give, for a string of size

n

on a fixed alphabet, runs in

O(k^2 n)

time. We also discuss two implementation-related improvements to the algorithm, and demonstrate their efficacy in practice by means of both experiments and an average-case analysis

arXiv.org e-Print Archive

CiteSeerX

Crossref