1,841 research outputs found
A Comparison of Blocking Methods for Record Linkage
Record linkage seeks to merge databases and to remove duplicates when unique
identifiers are not available. Most approaches use blocking techniques to
reduce the computational complexity associated with record linkage. We review
traditional blocking techniques, which typically partition the records
according to a set of field attributes, and consider two variants of a method
known as locality sensitive hashing, sometimes referred to as "private
blocking." We compare these approaches in terms of their recall, reduction
ratio, and computational complexity. We evaluate these methods using different
synthetic datafiles and conclude with a discussion of privacy-related issues.Comment: 22 pages, 2 tables, 7 figure
Recommended from our members
Dynamic sorted neighborhood indexing for real-time entity resolution
Real-time Entity Resolution (ER) is the process of matching query records in subsecond time with records in a database that represent the same real-world entity. Indexing techniques are generally used to efficiently extract a set of candidate records from the database that are similar to a query record, and that are to be compared with the query record in more detail. The sorted neighborhood indexing method, which sorts a database and compares records within a sliding window, has been successfully used for ER of large static databases. However, because it is based on static sorted arrays and is designed for batch ER that resolves all records in a database rather than resolving those relating to a single query record, this technique is not suitable for real-time ER on dynamic databases that are constantly updated. We propose a tree-based technique that facilitates dynamic indexing based on the sorted neighborhood method, which can be used for real-time ER, and investigate both static and adaptive window approaches. We propose an approach to reduce query matching times by precalculating the similarities between attribute values stored in neighboring tree nodes. We also propose a multitree solution where different sorting keys are used to reduce the effects of errors and variations in attribute values on matching quality by building several distinct index trees. We experimentally evaluate our proposed techniques on large real datasets, as well as on synthetic data with different data quality characteristics. Our results show that as the index grows, no appreciable increase occurs in both record insertion and query times, and that using multiple trees gives noticeable improvements on matching quality with only a small increase in query time. Compared to earlier indexing techniques for real-time ER, our approach achieves significantly reduced indexing and query matching times while maintaining high matching accuracy
Allosteric control of cyclic di-GMP signaling
Cyclic di-guanosine monophosphate is a bacterial second messenger that has been implicated in biofilm formation, antibiotic resistance, and persistence of pathogenic bacteria in their animal host. Although the enzymes responsible for the regulation of cellular levels of c-di-GMP, diguanylate cyclases (DGC) and phosphodiesterases, have been identified recently, little information is available on the molecular mechanisms involved in controlling the activity of these key enzymes or on the specific interactions of c-di-GMP with effector proteins. By using a combination of genetic, biochemical, and modeling techniques we demonstrate that an allosteric binding site for c-di-GMP (I-site) is responsible for non-competitive product inhibition of DGCs. The I-site was mapped in both multi- and single domain DGC proteins and is fully contained within the GGDEF domain itself. In vivo selection experiments and kinetic analysis of the evolved I-site mutants led to the definition of an RXXD motif as the core c-di-GMP binding site. Based on these results and based on the observation that the I-site is conserved in a majority of known and potential DGC proteins, we propose that product inhibition of DGCs is of fundamental importance for c-di-GMP signaling and cellular homeostasis. The definition of the I-site binding pocket provides an entry point into unraveling the molecular mechanisms of ligand-protein interactions involved in c-di-GMP signaling and makes DGCs a valuable target for drug design to develop new strategies against biofilm-related diseases
Symmetry of two terminal, non-linear electric conduction
The well-established symmetry relations for linear transport phenomena can
not, in general, be applied in the non-linear regime. Here we propose a set of
symmetry relations with respect to bias voltage and magnetic field for the
non-linear conductance of two-terminal electric conductors. We experimentally
confirm these relations using phase-coherent, semiconductor quantum dots.Comment: 4 pages, 4 figure
Cisplatin and taxol activate different signal pathways regulating cellular injury-induced expression of GADD153.
Signal transduction pathways activated by injury play a central role in coordinating the cellular responses that determine whether a cell survives or dies. GADD153 expression increases markedly in response to some types of cellular injury and the product of this gene causes cell cycle arrest. Using induction of GADD153 as a model, we have investigated the activation of the cellular injury response after treatment with taxol and cisplatin (cDDP). Activation of the GADD153 promoter coupled to the luciferase gene and transfected into human ovarian carcinoma 2008 cells correlated well with the increase in endogenous GADD153 mRNA after treatment with taxol but not after treatment with cDDP. Following treatment with cDDP, the increase in endogenous GADD153 mRNA was 10-fold greater than the increase in GADD153 promoter activity. Likewise, at equitoxic levels of exposure (IC80), cDDP produced a 5-fold greater increase in endogenous GADD153 mRNA than taxol. The tyrosine kinase inhibitor tyrophostin B46 had no significant effect on the ability of taxol to activate the GADD153 promoter, but inhibited activation of the GADD153 promoter by cDDP in a concentration-dependent manner. Tyrphostin B46 synergistically enhanced the cytotoxicity of cisplatin; however, the same exposure had no significant effect on the cytotoxicity of taxol. We conclude that (1) taxol and cDDP activate GADD153 promoter activity through different mechanisms; (2) the signal transduction pathway mediating induction by cDDP involves a tyrosine kinase inhibitable by tyrphostin B46; and (3) that inhibition of this signal transduction pathway by tyrphostin synergistically enhances cDDP toxicity
Generalized Bayesian Record Linkage and Regression with Exact Error Propagation
Record linkage (de-duplication or entity resolution) is the process of
merging noisy databases to remove duplicate entities. While record linkage
removes duplicate entities from such databases, the downstream task is any
inferential, predictive, or post-linkage task on the linked data. One goal of
the downstream task is obtaining a larger reference data set, allowing one to
perform more accurate statistical analyses. In addition, there is inherent
record linkage uncertainty passed to the downstream task. Motivated by the
above, we propose a generalized Bayesian record linkage method and consider
multiple regression analysis as the downstream task. Records are linked via a
random partition model, which allows for a wide class to be considered. In
addition, we jointly model the record linkage and downstream task, which allows
one to account for the record linkage uncertainty exactly. Moreover, one is
able to generate a feedback propagation mechanism of the information from the
proposed Bayesian record linkage model into the downstream task. This feedback
effect is essential to eliminate potential biases that can jeopardize resulting
downstream task. We apply our methodology to multiple linear regression, and
illustrate empirically that the "feedback effect" is able to improve the
performance of record linkage.Comment: 18 pages, 5 figure
- …