Search CORE

454 research outputs found

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Author: A Elmagarmid
G Baudat
G Navarro
G Salton
IP Fellegi
J Bleiholder
L Bertossi
O Benjelloun
P Christen
P Christen
S Ceri
TM Cover
TN Herzog
V Rastogi
W Fan
Publication venue
Publication date: 24/08/2015
Field of study

Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.Comment: To appear in Proc. SUM, 201

arXiv.org e-Print Archive

Crossref

CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data

Author: Boulesteix A-L
Daumer M
Slawski M
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

For the last eight years, microarray-based class prediction has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the p > n setting where the number of predictors by far exceeds the number of observations, hence the term “ill-posed-problem”. Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for inexperienced users with limited statistical background or for statisticians without experience in this area. The multiplicity of available methods for class prediction based on high-dimensional data is an additional practical challenge for inexperienced researchers. In this article, we introduce a new Bioconductor package called CMA (standing for “Classification for MicroArrays”) for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at http://bioconductor.org/packages/2.3/bioc/html/CMA.html

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Open Access LMU

CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data

Author: Campisi Patrizio
Neri Alessandro
Papari Giuseppe
Petkov Nicolai
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

PubMed Central

Open Access LMU

Archivio della Ricerca - Università di Roma 3

University of Groningen Digital Archive

Dissertations of the University of Groningen

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Author: Bahmani Zeinab
Bertossi Leopoldo
Vasiloglou Nikolaos
Publication venue
Publication date: 18/01/2017
Field of study

Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called "matching dependencies" (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating four components of ER: (a) Building a classifier for duplicate/non-duplicate record pairs built using machine learning (ML) techniques; (b) Use of MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language "LogiQL" -an extended form of Datalog supported by the "LogicBlox" platform- for all activities related to data processing, and the specification and enforcement of MDs.Comment: Final journal version, with some minor technical corrections. Extended version of arXiv:1508.0601

arXiv.org e-Print Archive

Carleton University's Institutional Repository