861 research outputs found

    A SUMMARY OF Classification and Regression Tree WITH APPLICATION

    Get PDF
    Classification and regression tree (CART) is a non-parametric methodology that was introduced first by Breiman and colleagues in 1984. CART is a technique which divides populations into meaningful subgroups that allows the identification of groups of interest. CART as a classification method constructs decision trees. Depending on information that is available about the dataset, a classification tree or a regression tree can be constructed. The first part of this paper describes the fundamental principles of tree construction, pruning procedure and different splitting algorithms. The second part of the paper answers the questions why or why not the CART method should be used or not. The advantages and weaknesses of the CART method are discussed and tested in detail. Finally, CART is applied to an example with real data, using the statistical software R. In this paper some graphical and plotting tools are presented

    Classification of the machine state in turning processes by using the acoustic emission

    Get PDF
    Processing digital information stands as a crucial foundation of Industry 4.0, facilitating a spectrum of activities from monitoring processes to their understanding and optimization. The application of data processing techniques, including feature extraction and classification, coupled with the identification of the most suitable features for specific purposes, continues to pose a significant challenge in the manufacturing sector. This research investigates the suitability of classification methods for machine and tool state classification by employing acoustic emission (AE) sensors during the dry turning of Ti6Al4V. Features such as quantiles, Fourier coefficients, and mel-frequency cepstral coefficients are extracted from the AE signals to facilitate classification. From this features the 20 best are selected for the classification to reduce the dimension of the feature space and redundancy. Algorithms including decision tree, k-nearest-neighbors (KNN), and quadratic discriminant analysis (QDA) are tested for the classification of machine states. Of these, QDA exhibits the highest accuracy at 98.6 %. Nonetheless, an examination of the confusion matrix reveals that certain classes, influenced by imbalanced training data, exhibit a lower prediction accuracy. In summary, the study affirms the potential of AE sensors for machine state recognition and tool condition monitoring. Although QDA emerges as the most acurate classifier, there remains an avenue for refinement, particularly in training data optimization and decision-making processes, to augment accuracy

    Optimal Ensemble Learning Based on Distinctive Feature Selection by Univariate ANOVA-F Statistics for IDS

    Get PDF
    Cyber-attacks are increasing day by day. The generation of data by the population of the world is immensely escalated. The advancements in technology, are intern leading to more chances of vulnerabilities to individual’s personal data. Across the world it became a very big challenge to bring down the threats to data security. These threats are not only targeting the user data and also destroying the whole network infrastructure in the local or global level, the attacks could be hardware or software. Central objective of this paper is to design an intrusion detection system using ensemble learning specifically Decision Trees with distinctive feature selection univariate ANOVA-F test. Decision Trees has been the most popular among ensemble learning methods and it also outperforms among the other classification algorithm in various aspects. With the essence of different feature selection techniques, the performance found to be increased more, and the detection outcome will be less prone to false classification. Analysis of Variance (ANOVA) with F-statistics computations could be a reasonable criterion to choose distinctives features in the given network traffic data. The mentioned technique is applied and tested on NSL KDD network dataset. Various performance measures like accuracy, precision, F-score and Cross Validation curve have drawn to justify the ability of the method

    Optimal Ensemble Learning Based on Distinctive Feature Selection by Univariate ANOVA-F Statistics for IDS

    Get PDF
    Cyber-attacks are increasing day by day. The generation of data by the population of the world is immensely escalated. The advancements in technology, are intern leading to more chances of vulnerabilities to individual’s personal data. Across the world it became a very big challenge to bring down the threats to data security. These threats are not only targeting the user data and also destroying the whole network infrastructure in the local or global level, the attacks could be hardware or software. Central objective of this paper is to design an intrusion detection system using ensemble learning specifically Decision Trees with distinctive feature selection univariate ANOVA-F test. Decision Trees has been the most popular among ensemble learning methods and it also outperforms among the other classification algorithm in various aspects. With the essence of different feature selection techniques, the performance found to be increased more, and the detection outcome will be less prone to false classification. Analysis of Variance (ANOVA) with F-statistics computations could be a reasonable criterion to choose distinctives features in the given network traffic data. The mentioned technique is applied and tested on NSL KDD network dataset. Various performance measures like accuracy, precision, F-score and Cross Validation curve have drawn to justify the ability of the method

    Functional data analysis methods for predicting disease status.

    Get PDF
    Introduction: Differential scanning calorimetry (DSC) is used to determine thermally-induced conformational changes of biomolecules within a blood plasma sample. Recent research has indicated that DSC curves (or thermograms) may have different characteristics based on disease status and, thus, may be useful as a monitoring and diagnostic tool for some diseases. Since thermograms are curves measured over a range of temperature values, they are often considered as functional data. In this dissertation we propose and apply functional data analysis (FDA) techniques to analyze DSC data from the Lupus Family Registry and Repository (LFRR). The aim is to develop FDA methods to create models for classifying lupus vs. control patients on the basis of the thermogram curves. Methods: In project 1 we examine how well standard functional regression is able to capture the differences in curves for cases and controls and compare this to a multivariate approach. In project 2 we develop a semiparametric model; the Generalized Functional Partially Linear Single-Index Model (GFPL). This model is useful when there exists some curvature or non-linearity in the logit, which cannot be modeled by the standard Functional Generalized Linear Model (FGLM). It also mitigates the curse of dimensionality, is a more flexible model, and yields interpretable results. In project 3, we propose a tree-based method: Local Basis Random Forests (LBRF) for Functional Data. This non-parametric method allows us to focus on significant parts of the functional covariates and reduce the noise level. Results: The standard functional logistic regression model with FPCA scores as the predictors gives an 81.25% correct classification rate on the test data, comparable to results from the multivariate approach. The proposed GFPL gives prediction accuracies and standard errors that are better than the standard FGLM when there is nonlinearity present. The LBRF for functional data yields high prediction accuracy (as high as 97% in simulations and 92% in the Lupus data), especially when the true signal is localized, and is able to capture where the true signal lies

    Deep churn prediction method for telecommunication industry

    Get PDF
    Being able to predict the churn rate is the key to success for the telecommunication industry. It is also important for the telecommunication industry to obtain a high profit. Thus, the challenge is to predict the churn percentage of customers with higher accuracy without comprising the profit. In this study, various types of learning strategies are investigated to address this challenge and build a churn predication model. Ensemble learning techniques (Adaboost, random forest (RF), extreme randomized tree (ERT), xgboost (XGB), gradient boosting (GBM), and bagging and stacking), traditional classification techniques (logistic regression (LR), decision tree (DT), and k-nearest neighbor (kNN), and artificial neural network (ANN)), and the deep learning convolutional neural network (CNN) technique have been tested to select the best model for building a customer churn prediction model. The evaluation of the proposed models was conducted using two pubic datasets: Southeast Asian telecom industry, and American telecom market. On both of the datasets, CNN and ANN returned better results than the other techniques. The accuracy obtained on the first dataset using CNN was 99% and using ANN was 98%, and on the second dataset it was 98% and 99%, respectively

    A Machine Learning-Based Anomaly Prediction Service for Software-Defined Networks

    Get PDF
    Software-defined networking (SDN) has gained tremendous growth and can be exploited in different network scenarios, from data centers to wide-area 5G networks. It shifts control logic from the devices to a centralized entity (programmable controller) for efficient traffic monitoring and flow management. A software-based controller enforces rules and policies on the requests sent by forwarding elements; however, it cannot detect anomalous patterns in the network traffic. Due to this, the controller may install the flow rules against the anomalies, reducing the overall network performance. These anomalies may indicate threats to the network and decrease its performance and security. Machine learning (ML) approaches can identify such traffic flow patterns and predict the systems’ impending threats. We propose an ML-based service to predict traffic anomalies for software-defined networks in this work. We first create a large dataset for network traffic by modeling a programmable data center with a signature-based intrusion-detection system. The feature vectors are pre-processed and are constructed against each flow request by the forwarding element. Then, we input the feature vector of each request to a machine learning classifier for training to predict anomalies. Finally, we use the holdout cross-validation technique to evaluate the proposed approach. The evaluation results specify that the proposed approach is highly accurate. In contrast to baseline approaches (random prediction and zero rule), the performance improvement of the proposed approach in average accuracy, precision, recall, and f-measure is (54.14%, 65.30%, 81.63%, and 73.70%) and (4.61%, 11.13%, 9.45%, and 10.29%), respectively

    Classification of microarrays; synergistic effects between normalization, gene selection and machine learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Machine learning is a powerful approach for describing and predicting classes in microarray data. Although several comparative studies have investigated the relative performance of various machine learning methods, these often do not account for the fact that performance (e.g. error rate) is a result of a series of analysis steps of which the most important are data normalization, gene selection and machine learning.</p> <p>Results</p> <p>In this study, we used seven previously published cancer-related microarray data sets to compare the effects on classification performance of five normalization methods, three gene selection methods with 21 different numbers of selected genes and eight machine learning methods. Performance in term of error rate was rigorously estimated by repeatedly employing a double cross validation approach. Since performance varies greatly between data sets, we devised an analysis method that first compares methods within individual data sets and then visualizes the comparisons across data sets. We discovered both well performing individual methods and synergies between different methods.</p> <p>Conclusion</p> <p>Support Vector Machines with a radial basis kernel, linear kernel or polynomial kernel of degree 2 all performed consistently well across data sets. We show that there is a synergistic relationship between these methods and gene selection based on the T-test and the selection of a relatively high number of genes. Also, we find that these methods benefit significantly from using normalized data, although it is hard to draw general conclusions about the relative performance of different normalization procedures.</p
    corecore