413 research outputs found
An empirical evaluation of imbalanced data strategies from a practitioner's point of view
This research tested the following well known strategies to deal with binary
imbalanced data on 82 different real life data sets (sampled to imbalance rates
of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline
(just the base classifier). As base classifiers we used SVM with RBF kernel,
random forests, and gradient boosting machines and we measured the quality of
the resulting classifier using 6 different metrics (Area under the curve,
Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced
accuracy). The best strategy strongly depends on the metric used to measure the
quality of the classifier. For AUC and accuracy class weight and the baseline
perform better; for F-measure and MCC, SMOTE performs better; and for G-mean
and balanced accuracy, underbagging
Detection of Dispersed Radio Pulses: A machine learning approach to candidate identification and classification
Searching for extraterrestrial, transient signals in astronomical data sets
is an active area of current research. However, machine learning techniques are
lacking in the literature concerning single-pulse detection. This paper
presents a new, two-stage approach for identifying and classifying dispersed
pulse groups (DPGs) in single-pulse search output. The first stage identified
DPGs and extracted features to characterize them using a new peak
identification algorithm which tracks sloping tendencies around local maxima in
plots of signal-to-noise ratio vs. dispersion measure. The second stage used
supervised machine learning to classify DPGs. We created four benchmark data
sets: one unbalanced and three balanced versions using three different
imbalance treatments.We empirically evaluated 48 classifiers by training and
testing binary and multiclass versions of six machine learning algorithms on
each of the four benchmark versions. While each classifier had advantages and
disadvantages, all classifiers with imbalance treatments had higher recall
values than those with unbalanced data, regardless of the machine learning
algorithm used. Based on the benchmarking results, we selected a subset of
classifiers to classify the full, unlabelled data set of over 1.5 million DPGs
identified in 42,405 observations made by the Green Bank Telescope. Overall,
the classifiers using a multiclass ensemble tree learner in combination with
two oversampling imbalance treatments were the most efficient; they identified
additional known pulsars not in the benchmark data set and provided six
potential discoveries, with significantly less false positives than the other
classifiers.Comment: 13 pages, accepted for publication in MNRAS, ref. MN-15-1713-MJ.R
Predicting Happiness - Comparison of Supervised Machine Learning Techniques Performance on a Multiclass Classification Problem
In the modern world, especially in contemporary economies and politics, a population\u27s subjective well-being is a frequent subject of the public debate. As comparisons of happiness levels in different countries are published, different circumstances and their effect on the value of the subjective well-being reported by people are also analysed. However, a significant amount of the research related to subjective well-being and its determinants is still based upon survey answers and employing conventional statistical methods providing details regarding correlations and causality between different factors and subjective well-being. Application of Supervised Machine Learning techniques for prediction of subjective well-being may provide new ways of understanding how individual factors contribute to the concept value and allow for addressing any issues, which may potentially affect mental and physical health. The focus of this research is to use the survey data and make predictions regarding subjective well-being (a multiclass target) using Supervised Machine Learning models. In particular, the study is aimed at comparing the performance of two techniques: Decision Tree and Neural Networks. The „C4.5 algorithm‟ used by the Decision Trees is considered as the benchmark algorithm, to which other supervised learning algorithms should be compared. At the same time, Neural Networks were previously proven to have high predictive power, even with multiclass categorisation problems. Two experiments are conducted as part of this research, one using original highly imbalanced data; the other using the dataset balanced using SMOTE. The experimental results gathered show that for the first experiment there is no statistically significant difference (
Application of advanced machine learning techniques to early network traffic classification
The fast-paced evolution of the Internet is drawing a complex context which
imposes demanding requirements to assure end-to-end Quality of Service. The
development of advanced intelligent approaches in networking is envisioning
features that include autonomous resource allocation, fast reaction against
unexpected network events and so on. Internet Network Traffic Classification
constitutes a crucial source of information for Network Management, being decisive
in assisting the emerging network control paradigms. Monitoring traffic flowing
through network devices support tasks such as: network orchestration, traffic
prioritization, network arbitration and cyberthreats detection, amongst others.
The traditional traffic classifiers became obsolete owing to the rapid Internet
evolution. Port-based classifiers suffer from significant accuracy losses due to port
masking, meanwhile Deep Packet Inspection approaches have severe user-privacy
limitations. The advent of Machine Learning has propelled the application of
advanced algorithms in diverse research areas, and some learning approaches have
proved as an interesting alternative to the classic traffic classification approaches.
Addressing Network Traffic Classification from a Machine Learning perspective
implies numerous challenges demanding research efforts to achieve feasible
classifiers. In this dissertation, we endeavor to formulate and solve important
research questions in Machine-Learning-based Network Traffic Classification. As a
result of numerous experiments, the knowledge provided in this research constitutes
an engaging case of study in which network traffic data from two different
environments are successfully collected, processed and modeled.
Firstly, we approached the Feature Extraction and Selection processes providing our
own contributions. A Feature Extractor was designed to create Machine-Learning
ready datasets from real traffic data, and a Feature Selection Filter based on fast
correlation is proposed and tested in several classification datasets. Then, the
original Network Traffic Classification datasets are reduced using our Selection
Filter to provide efficient classification models. Many classification models based on
CART Decision Trees were analyzed exhibiting excellent outcomes in identifying
various Internet applications. The experiments presented in this research comprise
a comparison amongst ensemble learning schemes, an exploratory study on Class
Imbalance and solutions; and an analysis of IP-header predictors for early traffic
classification. This thesis is presented in the form of compendium of JCR-indexed
scientific manuscripts and, furthermore, one conference paper is included.
In the present work we study a wide number of learning approaches employing the
most advance methodology in Machine Learning. As a result, we identify the
strengths and weaknesses of these algorithms, providing our own solutions to
overcome the observed limitations. Shortly, this thesis proves that Machine
Learning offers interesting advanced techniques that open prominent prospects in
Internet Network Traffic Classification.Departamento de Teoría de la Señal y Comunicaciones e Ingeniería TelemáticaDoctorado en Tecnologías de la Información y las Telecomunicacione
Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach
Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches
Combine Sampling - Least Square Support Vector Machine Untuk Klasifikasi Multi Class Imbalanced Data
Analisis Klasifikasi adalah proses menemukan model terbaik dari classifier
untuk memprediksi kelas dari suatu objek atau data yang label kelasnya tidak
diketahui. Pada kehidupan nyata, khususnya di bidang medis sering kali ditemui
klasifikasi multi class dengan kondisi himpunan data imbalanced. Kondisi
imbalanced data menjadi masalah dalam klasifikasi multi class karena mesin
classifier learning akan condong memprediksi ke kelas data yang banyak (mayoritas)
dibanding dengan kelas minoritas. Akibatnya, dihasilkan akurasi prediksi yang baik
terhadap kelas data training yang banyak (kelas mayoritas) sedangkan untuk kelas
data training yang sedikit (kelas minoritas) akan dihasilkan akurasi prediksi yang
buruk. Oleh Karena itu, pada penelitian ini akan diterapkan metode Combine
Sampling (SMOTE+Tomek Links) LS-SVM untuk klasifikasi multi class imbalanced
dengan menggunakan data medis. Data yang digunakan adalah data thyroid, kanker
payudara dan kanker serviks. Percobaan tersebut menggunakan q-fold cross
validation (q=5) dan (q=10). LS-SVM One Againt One (OAO) digunakan untuk
klasifikasi multi class. Optimasi parameter fungsi kernel RBF dan C menggunakan
PSO-GSA Hasil menunjukan bahwa metode yang terbaik untuk digunakan dalam
memprediksi status pasien penderita Thyroid, kanker payudara dan kanker serviks
adalah metode Combine Sampling Least Square Support Vector Machine PSO-GSA.
Klasifikasi dengan menggunakan Q-Fold (q=5) dan (q=10) menghasilkan
performansi yang sama dalam hal akurasi Total, Sensitivity dan G-Mean.
==================================================================================================================
Classification analysis is the process of finding the best model of a classifier for
predicting the class of an object or data class label is unknown. In the real life,
especially in the medical field often encountered multi-class classification with
imbalanced data sets conditions. Imbalanced condition of the data at issue in multiclass
classification as machine learning classifier will be inclined to predict that a lot
of data classes (the majority) compared with a minority class. As a result, generated a
good prediction accuracy of the data class training that many (the majority class),
while for class training data bit (the minority) will produce a poor prediction
accuracy. Hence, this research will apply the method Combine Sampling (SMOTE +
Tomek Links) LS-SVM for multi-class classification imbalanced using medical data.
The data used is data thyroid, breast cancer and cervical cancer. The experiment using
a q-fold cross validation (q = 5) and (q = 10). LS-SVM One againt One (OAO) is
used for multi-class classification. Parameter optimization RBF kernel function (σ)
and C using the PSO-GSA. Results showed that the best method to use in predicting
the status of patients with thyroid, breast cancer and cervical cancer is the combine
Sampling method Least Square Support Vector Machine PSO-GSA. Classification by
using Q-Fold (q = 5) and (q = 10) produces the same performance in terms of total
accuracy, sensitivity and G-mean
A Feasibility Study of Azure Machine Learning for Sheet Metal Fabrication
The research demonstrated that sheet metal fabrication machines can utilize machine learning to gain competitive advantage. With various possible applications of machine learning, it was decided to focus on the topic of predictive maintenance. Implementation of the predictive service is accomplished with Microsoft Azure Machine Learning. The aim was to demonstrate to the stakeholders at the case company potential laying in machine learning. It was found that besides machine learning technologies being founded on sophisticated algorithms and mathematics it can still be utilized and bring benefits with moderate effort required. Significance of this study is in it demonstrating potentials of the machine learning to be used in improving operations management and especially for sheet metal fabrication machines.fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format
Searching for Needles in the Cosmic Haystack
Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection
- …