413 research outputs found

    An empirical evaluation of imbalanced data strategies from a practitioner's point of view

    Full text link
    This research tested the following well known strategies to deal with binary imbalanced data on 82 different real life data sets (sampled to imbalance rates of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline (just the base classifier). As base classifiers we used SVM with RBF kernel, random forests, and gradient boosting machines and we measured the quality of the resulting classifier using 6 different metrics (Area under the curve, Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced accuracy). The best strategy strongly depends on the metric used to measure the quality of the classifier. For AUC and accuracy class weight and the baseline perform better; for F-measure and MCC, SMOTE performs better; and for G-mean and balanced accuracy, underbagging

    Detection of Dispersed Radio Pulses: A machine learning approach to candidate identification and classification

    Get PDF
    Searching for extraterrestrial, transient signals in astronomical data sets is an active area of current research. However, machine learning techniques are lacking in the literature concerning single-pulse detection. This paper presents a new, two-stage approach for identifying and classifying dispersed pulse groups (DPGs) in single-pulse search output. The first stage identified DPGs and extracted features to characterize them using a new peak identification algorithm which tracks sloping tendencies around local maxima in plots of signal-to-noise ratio vs. dispersion measure. The second stage used supervised machine learning to classify DPGs. We created four benchmark data sets: one unbalanced and three balanced versions using three different imbalance treatments.We empirically evaluated 48 classifiers by training and testing binary and multiclass versions of six machine learning algorithms on each of the four benchmark versions. While each classifier had advantages and disadvantages, all classifiers with imbalance treatments had higher recall values than those with unbalanced data, regardless of the machine learning algorithm used. Based on the benchmarking results, we selected a subset of classifiers to classify the full, unlabelled data set of over 1.5 million DPGs identified in 42,405 observations made by the Green Bank Telescope. Overall, the classifiers using a multiclass ensemble tree learner in combination with two oversampling imbalance treatments were the most efficient; they identified additional known pulsars not in the benchmark data set and provided six potential discoveries, with significantly less false positives than the other classifiers.Comment: 13 pages, accepted for publication in MNRAS, ref. MN-15-1713-MJ.R

    Predicting Happiness - Comparison of Supervised Machine Learning Techniques Performance on a Multiclass Classification Problem

    Get PDF
    In the modern world, especially in contemporary economies and politics, a population\u27s subjective well-being is a frequent subject of the public debate. As comparisons of happiness levels in different countries are published, different circumstances and their effect on the value of the subjective well-being reported by people are also analysed. However, a significant amount of the research related to subjective well-being and its determinants is still based upon survey answers and employing conventional statistical methods providing details regarding correlations and causality between different factors and subjective well-being. Application of Supervised Machine Learning techniques for prediction of subjective well-being may provide new ways of understanding how individual factors contribute to the concept value and allow for addressing any issues, which may potentially affect mental and physical health. The focus of this research is to use the survey data and make predictions regarding subjective well-being (a multiclass target) using Supervised Machine Learning models. In particular, the study is aimed at comparing the performance of two techniques: Decision Tree and Neural Networks. The „C4.5 algorithm‟ used by the Decision Trees is considered as the benchmark algorithm, to which other supervised learning algorithms should be compared. At the same time, Neural Networks were previously proven to have high predictive power, even with multiclass categorisation problems. Two experiments are conducted as part of this research, one using original highly imbalanced data; the other using the dataset balanced using SMOTE. The experimental results gathered show that for the first experiment there is no statistically significant difference (

    Application of advanced machine learning techniques to early network traffic classification

    Get PDF
    The fast-paced evolution of the Internet is drawing a complex context which imposes demanding requirements to assure end-to-end Quality of Service. The development of advanced intelligent approaches in networking is envisioning features that include autonomous resource allocation, fast reaction against unexpected network events and so on. Internet Network Traffic Classification constitutes a crucial source of information for Network Management, being decisive in assisting the emerging network control paradigms. Monitoring traffic flowing through network devices support tasks such as: network orchestration, traffic prioritization, network arbitration and cyberthreats detection, amongst others. The traditional traffic classifiers became obsolete owing to the rapid Internet evolution. Port-based classifiers suffer from significant accuracy losses due to port masking, meanwhile Deep Packet Inspection approaches have severe user-privacy limitations. The advent of Machine Learning has propelled the application of advanced algorithms in diverse research areas, and some learning approaches have proved as an interesting alternative to the classic traffic classification approaches. Addressing Network Traffic Classification from a Machine Learning perspective implies numerous challenges demanding research efforts to achieve feasible classifiers. In this dissertation, we endeavor to formulate and solve important research questions in Machine-Learning-based Network Traffic Classification. As a result of numerous experiments, the knowledge provided in this research constitutes an engaging case of study in which network traffic data from two different environments are successfully collected, processed and modeled. Firstly, we approached the Feature Extraction and Selection processes providing our own contributions. A Feature Extractor was designed to create Machine-Learning ready datasets from real traffic data, and a Feature Selection Filter based on fast correlation is proposed and tested in several classification datasets. Then, the original Network Traffic Classification datasets are reduced using our Selection Filter to provide efficient classification models. Many classification models based on CART Decision Trees were analyzed exhibiting excellent outcomes in identifying various Internet applications. The experiments presented in this research comprise a comparison amongst ensemble learning schemes, an exploratory study on Class Imbalance and solutions; and an analysis of IP-header predictors for early traffic classification. This thesis is presented in the form of compendium of JCR-indexed scientific manuscripts and, furthermore, one conference paper is included. In the present work we study a wide number of learning approaches employing the most advance methodology in Machine Learning. As a result, we identify the strengths and weaknesses of these algorithms, providing our own solutions to overcome the observed limitations. Shortly, this thesis proves that Machine Learning offers interesting advanced techniques that open prominent prospects in Internet Network Traffic Classification.Departamento de Teoría de la Señal y Comunicaciones e Ingeniería TelemáticaDoctorado en Tecnologías de la Información y las Telecomunicacione

    Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach

    Get PDF
    Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches

    Combine Sampling - Least Square Support Vector Machine Untuk Klasifikasi Multi Class Imbalanced Data

    Get PDF
    Analisis Klasifikasi adalah proses menemukan model terbaik dari classifier untuk memprediksi kelas dari suatu objek atau data yang label kelasnya tidak diketahui. Pada kehidupan nyata, khususnya di bidang medis sering kali ditemui klasifikasi multi class dengan kondisi himpunan data imbalanced. Kondisi imbalanced data menjadi masalah dalam klasifikasi multi class karena mesin classifier learning akan condong memprediksi ke kelas data yang banyak (mayoritas) dibanding dengan kelas minoritas. Akibatnya, dihasilkan akurasi prediksi yang baik terhadap kelas data training yang banyak (kelas mayoritas) sedangkan untuk kelas data training yang sedikit (kelas minoritas) akan dihasilkan akurasi prediksi yang buruk. Oleh Karena itu, pada penelitian ini akan diterapkan metode Combine Sampling (SMOTE+Tomek Links) LS-SVM untuk klasifikasi multi class imbalanced dengan menggunakan data medis. Data yang digunakan adalah data thyroid, kanker payudara dan kanker serviks. Percobaan tersebut menggunakan q-fold cross validation (q=5) dan (q=10). LS-SVM One Againt One (OAO) digunakan untuk klasifikasi multi class. Optimasi parameter fungsi kernel RBF dan C menggunakan PSO-GSA Hasil menunjukan bahwa metode yang terbaik untuk digunakan dalam memprediksi status pasien penderita Thyroid, kanker payudara dan kanker serviks adalah metode Combine Sampling Least Square Support Vector Machine PSO-GSA. Klasifikasi dengan menggunakan Q-Fold (q=5) dan (q=10) menghasilkan performansi yang sama dalam hal akurasi Total, Sensitivity dan G-Mean. ================================================================================================================== Classification analysis is the process of finding the best model of a classifier for predicting the class of an object or data class label is unknown. In the real life, especially in the medical field often encountered multi-class classification with imbalanced data sets conditions. Imbalanced condition of the data at issue in multiclass classification as machine learning classifier will be inclined to predict that a lot of data classes (the majority) compared with a minority class. As a result, generated a good prediction accuracy of the data class training that many (the majority class), while for class training data bit (the minority) will produce a poor prediction accuracy. Hence, this research will apply the method Combine Sampling (SMOTE + Tomek Links) LS-SVM for multi-class classification imbalanced using medical data. The data used is data thyroid, breast cancer and cervical cancer. The experiment using a q-fold cross validation (q = 5) and (q = 10). LS-SVM One againt One (OAO) is used for multi-class classification. Parameter optimization RBF kernel function (σ) and C using the PSO-GSA. Results showed that the best method to use in predicting the status of patients with thyroid, breast cancer and cervical cancer is the combine Sampling method Least Square Support Vector Machine PSO-GSA. Classification by using Q-Fold (q = 5) and (q = 10) produces the same performance in terms of total accuracy, sensitivity and G-mean

    A Feasibility Study of Azure Machine Learning for Sheet Metal Fabrication

    Get PDF
    The research demonstrated that sheet metal fabrication machines can utilize machine learning to gain competitive advantage. With various possible applications of machine learning, it was decided to focus on the topic of predictive maintenance. Implementation of the predictive service is accomplished with Microsoft Azure Machine Learning. The aim was to demonstrate to the stakeholders at the case company potential laying in machine learning. It was found that besides machine learning technologies being founded on sophisticated algorithms and mathematics it can still be utilized and bring benefits with moderate effort required. Significance of this study is in it demonstrating potentials of the machine learning to be used in improving operations management and especially for sheet metal fabrication machines.fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format

    Searching for Needles in the Cosmic Haystack

    Get PDF
    Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection
    corecore