4,341 research outputs found
Separation of pulsar signals from noise with supervised machine learning algorithms
We evaluate the performance of four different machine learning (ML)
algorithms: an Artificial Neural Network Multi-Layer Perceptron (ANN MLP ),
Adaboost, Gradient Boosting Classifier (GBC), XGBoost, for the separation of
pulsars from radio frequency interference (RFI) and other sources of noise,
using a dataset obtained from the post-processing of a pulsar search pi peline.
This dataset was previously used for cross-validation of the SPINN-based
machine learning engine, used for the reprocessing of HTRU-S survey data
arXiv:1406.3627. We have used Synthetic Minority Over-sampling Technique
(SMOTE) to deal with high class imbalance in the dataset. We report a variety
of quality scores from all four of these algorithms on both the non-SMOTE and
SMOTE datasets. For all the above ML methods, we report high accuracy and
G-mean in both the non-SMOTE and SMOTE cases. We study the feature importances
using Adaboost, GBC, and XGBoost and also from the minimum Redundancy Maximum
Relevance approach to report algorithm-agnostic feature ranking. From these
methods, we find that the signal to noise of the folded profile to be the best
feature. We find that all the ML algorithms report FPRs about an order of
magnitude lower than the corresponding FPRs obtained in arXiv:1406.3627, for
the same recall value.Comment: 14 pages, 2 figures. Accepted for publication in Astronomy and
Computin
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH
Training of Machine Learning (ML) models in real contexts often deals with
big data sets and high-class imbalance samples where the class of interest is
unrepresented (minority class). Practical solutions using classical ML models
address the problem of large data sets using parallel/distributed
implementations of training algorithms, approximate model-based solutions, or
applying instance selection (IS) algorithms to eliminate redundant information.
However, the combined problem of big and high imbalanced datasets has been less
addressed. This work proposes three new methods for IS to be able to deal with
large and imbalanced data sets. The proposed methods use Locality Sensitive
Hashing (LSH) as a base clustering technique, and then three different sampling
methods are applied on top of the clusters (or buckets) generated by LSH. The
algorithms were developed in the Apache Spark framework, guaranteeing their
scalability. The experiments carried out in three different datasets suggest
that the proposed IS methods can improve the performance of a base ML model
between 5% and 19% in terms of the geometric mean.Comment: 23 pages, 15 figure
Meta-learning for dynamic tuning of active learning on stream classification
Supervised data stream learning depends on the incoming sample's true label to update a classifier's model. In real life, obtaining the ground truth for each instance is a challenging process; it is highly costly and time consuming. Active Learning has already bridged this gap by finding a reduced set of instances to support the creation of a reliable stream classifier. However, identifying a reduced number of informative instances to support a suitable classifier update and drift adaptation is very tricky. To better adapt to concept drifts using a reduced number of samples, we propose an online tuning of the Uncertainty Sampling threshold using a meta-learning approach. Our approach exploits statistical meta-features from adaptive windows to meta-recommend a suitable threshold to address the trade-off between the number of labelling queries and high accuracy. Experiments exposed that the proposed approach provides the best trade-off between accuracy and query reduction by dynamic tuning the uncertainty threshold using lightweight meta-features
A systematic review of data quality issues in knowledge discovery tasks
Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust
- …