454 research outputs found
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Entity resolution (ER), an important and common data cleaning problem, is
about detecting data duplicate representations for the same external entities,
and merging them into single representations. Relatively recently, declarative
rules called matching dependencies (MDs) have been proposed for specifying
similarity conditions under which attribute values in database records are
merged. In this work we show the process and the benefits of integrating three
components of ER: (a) Classifiers for duplicate/non-duplicate record pairs
built using machine learning (ML) techniques, (b) MDs for supporting both the
blocking phase of ML and the merge itself; and (c) The use of the declarative
language LogiQL -an extended form of Datalog supported by the LogicBlox
platform- for data processing, and the specification and enforcement of MDs.Comment: To appear in Proc. SUM, 201
CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data
For the last eight years, microarray-based class prediction has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the p > n setting where the number of predictors by far exceeds the number of observations, hence the term “ill-posed-problem”. Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for inexperienced users with limited statistical background or for statisticians without experience in this area. The multiplicity of available methods for class prediction based on high-dimensional data
is an additional practical challenge for inexperienced researchers. In this article, we introduce a new Bioconductor package called CMA (standing for “Classification for MicroArrays”) for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at http://bioconductor.org/packages/2.3/bioc/html/CMA.html
CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data
For the last eight years, microarray-based class prediction has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the p > n setting where the number of predictors by far exceeds the number of observations, hence the term “ill-posed-problem”. Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for inexperienced users with limited statistical background or for statisticians without experience in this area. The multiplicity of available methods for class prediction based on high-dimensional data
is an additional practical challenge for inexperienced researchers. In this article, we introduce a new Bioconductor package called CMA (standing for “Classification for MicroArrays”) for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at http://bioconductor.org/packages/2.3/bioc/html/CMA.html
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Entity resolution (ER), an important and common data cleaning problem, is
about detecting data duplicate representations for the same external entities,
and merging them into single representations. Relatively recently, declarative
rules called "matching dependencies" (MDs) have been proposed for specifying
similarity conditions under which attribute values in database records are
merged. In this work we show the process and the benefits of integrating four
components of ER: (a) Building a classifier for duplicate/non-duplicate record
pairs built using machine learning (ML) techniques; (b) Use of MDs for
supporting the blocking phase of ML; (c) Record merging on the basis of the
classifier results; and (d) The use of the declarative language "LogiQL" -an
extended form of Datalog supported by the "LogicBlox" platform- for all
activities related to data processing, and the specification and enforcement of
MDs.Comment: Final journal version, with some minor technical corrections.
Extended version of arXiv:1508.0601
- …