14,341 research outputs found
OSTSC: Over Sampling for Time Series Classification in R
The OSTSC package is a powerful oversampling approach for classifying
univariant, but multinomial time series data in R. This article provides a
brief overview of the oversampling methodology implemented by the package. A
tutorial of the OSTSC package is provided. We begin by providing three test
cases for the user to quickly validate the functionality in the package. To
demonstrate the performance impact of OSTSC, we then provide two medium size
imbalanced time series datasets. Each example applies a TensorFlow
implementation of a Long Short-Term Memory (LSTM) classifier - a type of a
Recurrent Neural Network (RNN) classifier - to imbalanced time series. The
classifier performance is compared with and without oversampling. Finally,
larger versions of these two datasets are evaluated to demonstrate the
scalability of the package. The examples demonstrate that the OSTSC package
improves the performance of RNN classifiers applied to highly imbalanced time
series data. In particular, OSTSC is observed to increase the AUC of LSTM from
0.543 to 0.784 on a high frequency trading dataset consisting of 30,000 time
series observations
Grabit: Gradient Tree-Boosted Tobit Models for Default Prediction
A frequent problem in binary classification is class imbalance between a
minority and a majority class such as defaults and non-defaults in default
prediction. In this article, we introduce a novel binary classification model,
the Grabit model, which is obtained by applying gradient tree boosting to the
Tobit model. We show how this model can leverage auxiliary data to obtain
increased predictive accuracy for imbalanced data. We apply the Grabit model to
predicting defaults on loans made to Swiss small and medium-sized enterprises
(SME) and obtain a large and significant improvement in predictive performance
compared to other state-of-the-art approaches
Cost-Sensitive Feature Selection by Optimizing F-Measures
Feature selection is beneficial for improving the performance of general
machine learning tasks by extracting an informative subset from the
high-dimensional features. Conventional feature selection methods usually
ignore the class imbalance problem, thus the selected features will be biased
towards the majority class. Considering that F-measure is a more reasonable
performance measure than accuracy for imbalanced data, this paper presents an
effective feature selection algorithm that explores the class imbalance issue
by optimizing F-measures. Since F-measure optimization can be decomposed into a
series of cost-sensitive classification problems, we investigate the
cost-sensitive feature selection by generating and assigning different costs to
each class with rigorous theory guidance. After solving a series of
cost-sensitive feature selection problems, features corresponding to the best
F-measure will be selected. In this way, the selected features will fully
represent the properties of all classes. Experimental results on popular
benchmarks and challenging real-world data sets demonstrate the significance of
cost-sensitive feature selection for the imbalanced data setting and validate
the effectiveness of the proposed method
Automatically Detecting Self-Reported Birth Defect Outcomes on Twitter for Large-scale Epidemiological Research
In recent work, we identified and studied a small cohort of Twitter users
whose pregnancies with birth defect outcomes could be observed via their
publicly available tweets. Exploiting social media's large-scale potential to
complement the limited methods for studying birth defects, the leading cause of
infant mortality, depends on the further development of automatic methods. The
primary objective of this study was to take the first step towards scaling the
use of social media for observing pregnancies with birth defect outcomes,
namely, developing methods for automatically detecting tweets by users
reporting their birth defect outcomes. We annotated and pre-processed
approximately 23,000 tweets that mention birth defects in order to train and
evaluate supervised machine learning algorithms, including feature-engineered
and deep learning-based classifiers. We also experimented with various
under-sampling and over-sampling approaches to address the class imbalance. A
Support Vector Machine (SVM) classifier trained on the original, imbalanced
data set, with n-grams, word clusters, and structural features, achieved the
best baseline performance for the positive classes: an F1-score of 0.65 for the
"defect" class and 0.51 for the "possible defect" class. Our contributions
include (i) natural language processing (NLP) and supervised machine learning
methods for automatically detecting tweets by users reporting their birth
defect outcomes, (ii) a comparison of feature-engineered and deep
learning-based classifiers trained on imbalanced, under-sampled, and
over-sampled data, and (iii) an error analysis that could inform classification
improvements using our publicly available corpus. Future work will focus on
automating user-level analyses for cohort inclusion
Instance Selection Improves Geometric Mean Accuracy: A Study on Imbalanced Data Classification
A natural way of handling imbalanced data is to attempt to equalise the class
frequencies and train the classifier of choice on balanced data. For two-class
imbalanced problems, the classification success is typically measured by the
geometric mean (GM) of the true positive and true negative rates. Here we prove
that GM can be improved upon by instance selection, and give the theoretical
conditions for such an improvement. We demonstrate that GM is non-monotonic
with respect to the number of retained instances, which discourages systematic
instance selection. We also show that balancing the distribution frequencies is
inferior to a direct maximisation of GM. To verify our theoretical findings, we
carried out an experimental study of 12 instance selection methods for
imbalanced data, using 66 standard benchmark data sets. The results reveal
possible room for new instance selection methods for imbalanced data.Comment: 11 pages, 7 figure
Separation of pulsar signals from noise with supervised machine learning algorithms
We evaluate the performance of four different machine learning (ML)
algorithms: an Artificial Neural Network Multi-Layer Perceptron (ANN MLP ),
Adaboost, Gradient Boosting Classifier (GBC), XGBoost, for the separation of
pulsars from radio frequency interference (RFI) and other sources of noise,
using a dataset obtained from the post-processing of a pulsar search pi peline.
This dataset was previously used for cross-validation of the SPINN-based
machine learning engine, used for the reprocessing of HTRU-S survey data
arXiv:1406.3627. We have used Synthetic Minority Over-sampling Technique
(SMOTE) to deal with high class imbalance in the dataset. We report a variety
of quality scores from all four of these algorithms on both the non-SMOTE and
SMOTE datasets. For all the above ML methods, we report high accuracy and
G-mean in both the non-SMOTE and SMOTE cases. We study the feature importances
using Adaboost, GBC, and XGBoost and also from the minimum Redundancy Maximum
Relevance approach to report algorithm-agnostic feature ranking. From these
methods, we find that the signal to noise of the folded profile to be the best
feature. We find that all the ML algorithms report FPRs about an order of
magnitude lower than the corresponding FPRs obtained in arXiv:1406.3627, for
the same recall value.Comment: 14 pages, 2 figures. Accepted for publication in Astronomy and
Computin
Increased Prediction Accuracy in the Game of Cricket using Machine Learning
Player selection is one the most important tasks for any sport and cricket is
no exception. The performance of the players depends on various factors such as
the opposition team, the venue, his current form etc. The team management, the
coach and the captain select 11 players for each match from a squad of 15 to 20
players. They analyze different characteristics and the statistics of the
players to select the best playing 11 for each match. Each batsman contributes
by scoring maximum runs possible and each bowler contributes by taking maximum
wickets and conceding minimum runs. This paper attempts to predict the
performance of players as how many runs will each batsman score and how many
wickets will each bowler take for both the teams. Both the problems are
targeted as classification problems where number of runs and number of wickets
are classified in different ranges. We used na\"ive bayes, random forest,
multiclass SVM and decision tree classifiers to generate the prediction models
for both the problems. Random Forest classifier was found to be the most
accurate for both the problems
Additional Representations for Improving Synthetic Aperture Sonar Classification Using Convolutional Neural Networks
Object classification in synthetic aperture sonar (SAS) imagery is usually a
data starved and class imbalanced problem. There are few objects of interest
present among much benign seafloor. Despite these problems, current
classification techniques discard a large portion of the collected SAS
information. In particular, a beamformed SAS image, which we call a single-look
complex (SLC) image, contains complex pixels composed of real and imaginary
parts. For human consumption, the SLC is converted to a magnitude-phase
representation and the phase information is discarded. Even more problematic,
the magnitude information usually exhibits a large dynamic range (>80dB) and
must be dynamic range compressed for human display. Often it is this dynamic
range compressed representation, originally designed for human consumption,
which is fed into a classifier. Consequently, the classification process is
completely void of the phase information. In this work, we show improvements in
classification performance using the phase information from the SLC as well as
information from an alternate source: photographs. We perform statistical
testing to demonstrate the validity of our results.Comment: Accepted for the Institute of Acoustics 4th International Conference
on Synthetic Aperture Sonar and Radar Sept 201
Characterizing the structural diversity of complex networks across domains
The structure of complex networks has been of interest in many scientific and
engineering disciplines over the decades. A number of studies in the field have
been focused on finding the common properties among different kinds of networks
such as heavy-tail degree distribution, small-worldness and modular structure
and they have tried to establish a theory of structural universality in complex
networks. However, there is no comprehensive study of network structure across
a diverse set of domains in order to explain the structural diversity we
observe in the real-world networks. In this paper, we study 986 real-world
networks of diverse domains ranging from ecological food webs to online social
networks along with 575 networks generated from four popular network models.
Our study utilizes a number of machine learning techniques such as random
forest and confusion matrix in order to show the relationships among network
domains in terms of network structure. Our results indicate that there are some
partitions of network categories in which networks are hard to distinguish
based purely on network structure. We have found that these partitions of
network categories tend to have similar underlying functions, constraints
and/or generative mechanisms of networks even though networks in the same
partition have different origins, e.g., biological processes, results of
engineering by human being, etc. This suggests that the origin of a network,
whether it's biological, technological or social, may not necessarily be a
decisive factor of the formation of similar network structure. Our findings
shed light on the possible direction along which we could uncover the hidden
principles for the structural diversity of complex networks.Comment: 23 pages, 11 figures, 2 tables; originally published as K. Ikehara,
"The Structure of Complex Networks across Domains." MS Thesis, University of
Colorado Boulder (2016
Who Will Retweet This? Automatically Identifying and Engaging Strangers on Twitter to Spread Information
There has been much effort on studying how social media sites, such as
Twitter, help propagate information in different situations, including
spreading alerts and SOS messages in an emergency. However, existing work has
not addressed how to actively identify and engage the right strangers at the
right time on social media to help effectively propagate intended information
within a desired time frame. To address this problem, we have developed two
models: (i) a feature-based model that leverages peoples' exhibited social
behavior, including the content of their tweets and social interactions, to
characterize their willingness and readiness to propagate information on
Twitter via the act of retweeting; and (ii) a wait-time model based on a user's
previous retweeting wait times to predict her next retweeting time when asked.
Based on these two models, we build a recommender system that predicts the
likelihood of a stranger to retweet information when asked, within a specific
time window, and recommends the top-N qualified strangers to engage with. Our
experiments, including live studies in the real world, demonstrate the
effectiveness of our work
- …