Search CORE

32 research outputs found

HoloDetect: Few-Shot Learning for Error Detection

Author: Bengio Yoshua
Elmagarmid Ahmed K.
Globerson Amir
Goodfellow Ian
Guo Chuan
Hinton G. E.
Rahm Erhard
Ratcliff John W.
Zhang Yu
Zhu Xiaojin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/04/2019
Field of study

We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages

arXiv.org e-Print Archive

Crossref

Data Science as a New Frontier for Design

Author: Mines Kazakçi
Osman Akin
Publication venue
Publication date: 20/03/2015
Field of study

The purpose of this paper is to contribute to the challenge of transferring know-how, theories and methods from design research to the design processes in information science and technologies. More specifically, we shall consider a domain, namely data-science, that is becoming rapidly a globally invested research and development axis with strong imperatives for innovation given the data deluge we are currently facing. We argue that, in order to rise to the data-related challenges that the society is facing, data-science initiatives should ensure a renewal of traditional research methodologies that are still largely based on trial-error processes depending on the talent and insights of a single (or a restricted group of) researchers. It is our claim that design theories and methods can provide, at least to some extent, the much-needed framework. We will use a worldwide data-science challenge organized to study a technical problem in physics, namely the detection of Higgs boson, as a use case to demonstrate some of the ways in which design theory and methods can help in analyzing and shaping the innovation dynamics in such projects.Comment: International Conference on Engineering Design, Jul 2015, Milan, Ital

arXiv.org e-Print Archive

HAL-MINES ParisTech

HAL-Polytechnique

Self-training in significance space of support vectors for imbalanced biomedical event data

Author
Publication venue: BioMed Central
Publication date: 23/04/2015
Field of study

Springer - Publisher Connector

A performance comparison of oversampling methods for data generation in imbalanced learning tasks

Author: Dattagupta Samrat Jayanta
Publication venue
Publication date: 02/02/2018
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Marketing Research e CRMClass Imbalance problem is one of the most fundamental challenges faced by the machine learning community. The imbalance refers to number of instances in the class of interest being relatively low, as compared to the rest of the data. Sampling is a common technique for dealing with this problem. A number of over - sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine some common oversampling approaches for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated

Repositório da Universidade Nova de Lisboa

Semi-automated screening of biomedical citations for systematic reviews

Author: A Aronson
A Blum
A Cohen
A Wilcox
B Settles
B Wallace
Byron C Wallace
C Blake
C Cole
C Counsell
Carla Brodley
Chih-Chung
Christopher H Schmid
CJL Chih-Wei Hsu
D Chen
DD Lewis
E Perrin
F Camous
G Druck
G Schohn
H Kilicoglu
Joseph Lau
K Brinker
KS Goh
KS Jones
L Breiman
L Hunter
M Barza
M Chung
M Yetisgen-Yildiz
N Japkowicz
P Wheeler
P Zweigenbaum
S Dasgupta
S Ertekin
S Kotsiantis
S Tong
T Joachims
T Terasawa
Thomas A Trikalinos
VN Vapnik
W Yu
Y Aphinyanaphongs
YAC Aphinyanaphongs
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Early Stopping of a Neural Network via the Receiver Operating Curve.

Author: Yu Daoping
Publication venue: Digital Commons @ East Tennessee State University
Publication date: 13/08/2010
Field of study

This thesis presents the area under the ROC (Receiver Operating Characteristics) curve, or abbreviated AUC, as an alternate measure for evaluating the predictive performance of ANNs (Artificial Neural Networks) classifiers. Conventionally, neural networks are trained to have total error converge to zero which may give rise to over-fitting problems. To ensure that they do not over fit the training data and then fail to generalize well in new data, it appears effective to stop training as early as possible once getting AUC sufficiently large via integrating ROC/AUC analysis into the training process. In order to reduce learning costs involving the imbalanced data set of the uneven class distribution, random sampling and k-means clustering are implemented to draw a smaller subset of representatives from the original training data set. Finally, the confidence interval for the AUC is estimated in a non-parametric approach

East Tennessee State University

Introducing artificial data generation in active learning for land use/land cover classification

Author: Bacao Fernando
Douzas Georgios
Fonseca Joao
Publication venue: 'MDPI AG'
Publication date: 01/07/2021
Field of study

Fonseca, J., Douzas, G., & Bacao, F. (2021). Increasing the effectiveness of active learning: Introducing artificial data generation in active learning for land use/land cover classification. Remote Sensing, 13(13), 1-20. [2619]. https://doi.org/10.3390/rs13132619In remote sensing, Active Learning (AL) has become an important technique to collect informative ground truth data “on-demand” for supervised classification tasks. Despite its effectiveness, it is still significantly reliant on user interaction, which makes it both expensive and time consuming to implement. Most of the current literature focuses on the optimization of AL by modifying the selection criteria and the classifiers used. Although improvements in these areas will result in more effective data collection, the use of artificial data sources to reduce human–computer interaction remains unexplored. In this paper, we introduce a new component to the typical AL framework, the data generator, a source of artificial data to reduce the amount of user-labeled data required in AL. The implementation of the proposed AL framework is done using Geometric SMOTE as the data generator. We compare the new AL framework to the original one using similar acquisition functions and classifiers over three AL-specific performance metrics in seven benchmark datasets. We show that this modification of the AL framework significantly reduces cost and time requirements for a successful AL implementation in all of the datasets used in the experiment.publishersversionpublishe

Directory of Open Access Journals

Repositório da Universidade Nova de Lisboa