20,795 research outputs found
CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification
Class imbalance classification is a challenging research problem in data
mining and machine learning, as most of the real-life datasets are often
imbalanced in nature. Existing learning algorithms maximise the classification
accuracy by correctly classifying the majority class, but misclassify the
minority class. However, the minority class instances are representing the
concept with greater interest than the majority class instances in real-life
applications. Recently, several techniques based on sampling methods
(under-sampling of the majority class and over-sampling the minority class),
cost-sensitive learning methods, and ensemble learning have been used in the
literature for classifying imbalanced datasets. In this paper, we introduce a
new clustering-based under-sampling approach with boosting (AdaBoost)
algorithm, called CUSBoost, for effective imbalanced classification. The
proposed algorithm provides an alternative to RUSBoost (random under-sampling
with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost)
algorithms. We evaluated the performance of CUSBoost algorithm with the
state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost,
SMOTEBoost on 13 imbalance binary and multi-class datasets with various
imbalance ratios. The experimental results show that the CUSBoost is a
promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201
Massive Open Online Courses Temporal Profiling for Dropout Prediction
Massive Open Online Courses (MOOCs) are attracting the attention of people
all over the world. Regardless the platform, numbers of registrants for online
courses are impressive but in the same time, completion rates are
disappointing. Understanding the mechanisms of dropping out based on the
learner profile arises as a crucial task in MOOCs, since it will allow
intervening at the right moment in order to assist the learner in completing
the course. In this paper, the dropout behaviour of learners in a MOOC is
thoroughly studied by first extracting features that describe the behavior of
learners within the course and then by comparing three classifiers (Logistic
Regression, Random Forest and AdaBoost) in two tasks: predicting which users
will have dropped out by a certain week and predicting which users will drop
out on a specific week. The former has showed to be considerably easier, with
all three classifiers performing equally well. However, the accuracy for the
second task is lower, and Logistic Regression tends to perform slightly better
than the other two algorithms. We found that features that reflect an active
attitude of the user towards the MOOC, such as submitting their assignment,
posting on the Forum and filling their Profile, are strong indicators of
persistence.Comment: 8 pages, ICTAI1
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …