Search CORE

668 research outputs found

Automated Empirical Selection of Rule Induction Methods Based on Recursive Iteration of Resampling Methods

Author: B. Efron
B. Efron
C. Schaffer
C. Schaffer
G.J. McLachlan
J.R. Quinlan
P. Clark
R.S. Michalski
R.S. Michalski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Handling minority class problem in threats detection based on heterogeneous ensemble learning approach.

Author: Ahriz Hatem
Eke Hope
Petrovski Andrei
Publication venue: IGI Global
Publication date: 31/07/2020
Field of study

Multiclass problem, such as detecting multi-steps behaviour of Advanced Persistent Threats (APTs) have been a major global challenge, due to their capability to navigates around defenses and to evade detection for a prolonged period of time. Targeted APT attacks present an increasing concern for both cyber security and business continuity. Detecting the rare attack is a classification problem with data imbalance. This paper explores the applications of data resampling techniques, together with heterogeneous ensemble approach for dealing with data imbalance caused by unevenly distributed data elements among classes with our focus on capturing the rare attack. It has been shown that the suggested algorithms provide not only detection capability, but can also classify malicious data traffic corresponding to rare APT attacks

Open Access Institutional Repository at Robert Gordon University

Recommended from our members

Effective techniques for handling incomplete data using decision trees

Author: Twala Bhekisipho E.T.H.
Publication venue
Publication date: 01/01/2005
Field of study

Decision Trees (DTs) have been recognized as one of the most successful formalisms for knowledge representation and reasoning and are currently applied to a variety of data mining or knowledge discovery applications, particularly for classification problems. There are several efficient methods to learn a DT from data. However, these methods are often limited to the assumption that data are complete. In this thesis, some contributions to the field of machine learning and statistics that solve the problem of extracting DTs for learning and classification tasks from incomplete databases are presented. The methodology underlying the thesis blends together well-established statistical theories with the most advanced techniques for machine learning and automated reasoning with uncertainty. The first contribution is the extensive simulations which study the impact of missing data on predictive accuracy of existing DTs which can cope with missing values, when missing values are in both the training and test sets or when they are in either of the two sets. All simulations are performed under missing completely at random, missing at random and informatively missing mechanisms and for different missing data patterns and proportions. The proposal of a simple, novel, yet effective proposed procedure for training and testing using decision trees in the presence of missing data is the next contribution. Original and simple splitting criteria for attribute selection in tree building are put forward. The proposed technique is evaluated and validated in empirical tests over many real world application domains. In this work, the proposed algorithm maintains (sometimes exceeds) the outstanding accuracy of multiple imputation, especially on datasets containing mixed attributes and purely nominal attributes. Also, the proposed algorithm greatly improves in accuracy for IM data. Another major advantage of this method over multiple imputation is the important saving in computational resources due to it simplicity. The next contribution is the proposal of three versions of simple probabilistic techniques that could be used for classifying incomplete vectors using decision trees based on complete data. The proposed procedure is superficially similar to that of fractional cases but more effective. The experimental results demonstrate that these approaches can achieve comparative quality to sophisticated algorithms like multiple imputation and therefore are applicable to all kinds of datasets. Finally, novel uses of two proposed ensemble procedures for handling incomplete training and test data are proposed and discussed. The algorithms combine the two best approaches either with resampling (REMIMIA) or without resampling (EMIMIA) of the training data before growing the decision trees. Experiments are used to evaluate and validate the success of the proposed ensemble methods with respect to individual missing data techniques in the form of empirical tests. EMIMIA attains the highest overall level of prediction accuracy

Open Research Online (The Open University)

Empowering One-vs-One Decomposition with Ensemble Learning for Multi-Class Imbalanced Data

Author: García López Salvador
Herrera Triguero Francisco
Krawczyk Bartosz
Rosales-Pérez Alejandro
Zhang Zhongliang
Publication venue: 'Elsevier BV'
Publication date: 01/04/2016
Field of study

Zhongliang Zhang was supported by the National Science Foundation of China (NSFC Proj. 61273204) and CSC Scholarship Program (CSC NO. 201406080059). Bartosz Krawczyk was supported by the Polish National Science Center under the grant no. UMO-2015/19/B/ST6/01597. Salvador Garcia and Francisco Herrera were partially supported by the Spanish Ministry of Education and Science under Project TIN2014-57251-P and the Andalusian Research Plan P10-TIC-6858, P11-TIC-7765. Alejandro Rosales-Perez was supported by the CONACyT grant 329013.Multi-class imbalance classification problems occur in many real-world applications, which suffer from the quite different distribution of classes. Decomposition strategies are well-known techniques to address the classification problems involving multiple classes. Among them binary approaches using one-vs-one and one-vs-all has gained a significant attention from the research community. They allow to divide multi-class problems into several easier-to-solve two-class sub-problems. In this study we develop an exhaustive empirical analysis to explore the possibility of empowering the one-vs-one scheme for multi-class imbalance classification problems with applying binary ensemble learning approaches. We examine several state-of-the-art ensemble learning methods proposed for addressing the imbalance problems to solve the pairwise tasks derived from the multi-class data set. Then the aggregation strategy is employed to combine the binary ensemble outputs to reconstruct the original multi-class task. We present a detailed experimental study of the proposed approach, supported by the statistical analysis. The results indicate the high effectiveness of ensemble learning with one-vs-one scheme in dealing with the multi-class imbalance classification problems.National Natural Science Foundation of China (NSFC) 61273204CSC Scholarship Program (CSC) 201406080059Polish National Science Center UMO-2015/19/B/ST6/01597Spanish Government TIN2014-57251-PAndalusian Research Plan P10-TIC-6858 P11-TIC-7765Consejo Nacional de Ciencia y Tecnologia (CONACyT) 32901

Repositorio Institucional Universidad de Granada

Sequential and adaptive Bayesian computation for inference and optimization

Author: Akyildiz Omer Deniz
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/03/2019
Field of study

With the advent of cheap and ubiquitous measurement devices, today more data is measured, recorded, and archived in a relatively short span of time than all data recorded throughout history. Moreover, advances in computation have made it possible to model much more complicated phenomena and to use the vast amounts of data to calibrate the resulting high-dimensional models. In this thesis, we are interested in two fundamental problems which are repeatedly being faced in practice as the dimension of the models and datasets are growing steadily: the problem of inference in high-dimensional models and the problem of optimization for problems when the number of data points is very large. The inference problem gets diﬃcult when the model one wants to calibrate and estimate is deﬁned in a high-dimensional space. The behavior of computational algorithms in high-dimensional spaces is complicated and deﬁes intuition. Computational methods which work accurately for inferring low-dimensional models, for example, may fail to generalize the same performance to high-dimensional models. In recent years, due to the signiﬁcant interest in high-dimensional models, there has been a plethora of work in signal processing and machine learning to develop computational methods which are robust in high-dimensional spaces. In particular, the high-dimensional stochastic ﬁltering problem has attracted signiﬁcant attention as it arises in multiple ﬁelds which are of crucial importance such as geophysics, aerospace, control. In particular, a class of algorithms called particle ﬁlters has received attention and become a fruitful ﬁeld of research because of their accuracy and robustness in low-dimensional systems. In short, these methods keep a cloud of particles (samples in a state space), which describe the empirical probability distribution over the state variable of interest. The particle ﬁlters use a model of the phenomenon of interest to propagate and predict the future states and use an observation model to assimilate the observations to correct the state estimates. The most common particle ﬁlter, called the bootstrap particle ﬁlter (BPF), consists of an iterative sampling-weighting-resampling scheme. However, BPFs also largely fail at inferring high-dimensional dynamical systems due to a number of reasons. In this work, we propose a novel particle ﬁlter, named the nudged particle ﬁlter (NuPF), which speciﬁcally aims at improving the performance of particle ﬁlters in high-dimensional systems. The algorithm relies on the idea of nudging, which has been widely used in the geophysics literature to tackle high-dimensional inference problems. In particular, in addition to standard sampling-weighting-resampling steps of the particle ﬁlter, we deﬁne a general nudging step based on the gradient of the likelihoods, which generalize some of the nudging schemes proposed in the literature. This step is based on modifying the particles, generated in the sampling step, using the gradients of the likelihoods. In particular, the nudging step moves a fraction of the particles to the regions under which they have high-likelihoods. This scheme results in signiﬁcantly improved behavior in high-dimensional models. The resulting NuPF is able to track high-dimensional systems successfully. Unlike the proposed nudging schemes in the literature, the NuPF does not rely on Gaussianity assumptions and can be deﬁned for a general likelihood. We analytically prove that, because we only move a fraction of the particles and not all of them, the algorithm has a convergence rate that matches standard Monte Carlo algorithms. More precisely, the NuPF has the same asymptotic convergence guarantees as the bootstrap particle ﬁlter. As a byproduct, we also show that the nudging step improves the robustness of the particle ﬁlter against model misspeciﬁcation. In particular, model misspeciﬁcation occurs when the true data-generating system and the model posed by the user of the algorithm diﬀer signiﬁcantly. In this case, a majority of computational inference methods fail due to the discrepancy between the modeling assumptions and the observed data. We show that the nudging step increases the robustness of particle ﬁlters against model misspeciﬁcation. Specifically, we prove that the NuPF generates particle systems which have provably higher marginal likelihoods compared to the standard bootstrap particle ﬁlter. This theoretical result is attained by showing that the NuPF can be interpreted as a bootstrap particle ﬁlter for a modiﬁed state-space model. Finally, we demonstrate the empirical behavior of the NuPF with several examples. In particular, we show results on high-dimensional linear state-space models, a misspeciﬁed Lorenz 63 model, a high-dimensional Lorenz 96 model, and a misspeciﬁed object tracking model. In all examples, the NuPF infers the states successfully. The second problem, the so-called scability problem in optimization, occurs because of the large number of data points in modern datasets. With the increasing abundance of data, many problems in signal processing, statistical inference, and machine learning turn into a large-scale optimization problems. For example, in signal processing, one might be interested in estimating a sparse signal given a large number of corrupted observations. Similarly, maximum-likelihood inference problems in statistics result in large-scale optimization problems. Another signiﬁcant application domain is machine learning, where all important training methods are deﬁned as optimization problems. To tackle these problems, computational optimization methods developed over the past decades are ineﬃcient since they need to compute function evaluations or gradients over all the data for a single iteration. Because of this reason, a class of optimization methods, termed stochastic optimization methods, have emerged. The algorithms of this class are designed to tackle problems which are deﬁned over a big number of data points. In short, these methods utilize a subsample of the dataset in order to update the parameter estimate and do so iteratively until some convergence criterion is met. However, there is a major diﬃculty that has to be addressed: Although the convergence theory for these algorithms is understood, they can have unstable behavior in practice. In particular, the most commonly used stochastic optimization method, namely the stochastic gradient descent, can diverge easily if its step-size is poorly set. Over the years, practitioners have developed a number of rules of thumb to alleviate stability issues. We argue in this thesis that one way to develop robust stochastic optimization methods is to frame them as inference methods. In particular, we show that stochastic optimization schemes can be recast as inference methods and can be understood as inference algorithms. Framing the problem as an inference problem opens the way to compare these methods to the optimal inference algorithms and understand why they might be failing or producing unstable behavior. In this vein, we show that there is an intrinsic relationship between a class of stochastic optimization methods, called incremental proximal methods, and Kalman (and extended Kalman) ﬁlters. The ﬁltering approach to stochastic optimization results in an automatic calibration of the step-size, which removes the instability problems depending on the step-sizes. The probabilistic interpretation of stochastic optimization problems also paves the way to develop new optimization methods based on strategies which are popular in the inference literature. In particular, one can use a set of sampling methods in order to solve the inference problem and hence obtain the global minimum. In this manner, we propose a parallel sequential Monte Carlo optimizer (PSMCO), which is aiming at solving stochastic optimization problems. The PSMCO is designed as a zeroth order method which does not use gradients. It only uses subsets of the data points in order to move at each iteration. The PSMCO obtains an estimate of a global minimum at each iteration by utilizing a cheap kernel density estimator. We prove that the resulting estimator converges to a global minimum almost surely as the number of Monte Carlo samples tends to inﬁnity. We also empirically demonstrate that the algorithm is able to reconstruct multiple global minima and solve diﬃcult global optimization problems. By further exploiting the relationship between inference and optimization, we also propose a probabilistic and online matrix factorization method, termed the dictionary ﬁlter to solve large-scale matrix factorization problems. Matrix factorization methods have received signiﬁcant interest from the machine learning community due to their expressive representations of high-dimensional data and interpretability of their estimates. As the majority of the matrix factorization methods are deﬁned as optimization problems, they suﬀer from the same issues as stochastic optimization methods. In particular, when using stochastic gradient descent, one might need to try and err many times before deciding to use a step-size. To alleviate these problems, we introduce a matrix-variate probabilistic model for which inference results in a matrix factorization scheme. The scheme is online, in the sense that it only uses a single data point at a time to update the factors. The algorithm bears relationship with optimization schemes, namely with the incremental proximal method deﬁned over a matrix-variate cost function. By way of intuition we developed for the optimization-inference relationship, we devise a model which results in similar update rules for matrix factorization as for the incremental proximal method. However, the probabilistic updates are more stable and eﬃcient. Moreover, the algorithm does not have a step-size parameter to tune, as its role is played by the posterior covariance matrix. We demonstrate the utility of the algorithm on a missing data problem and a video processing problem. We show that the algorithm can be successfully used in machine learning problems and several promising extensions of the method can be constructed easily.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Ricardo Cao Abad.- Secretario: Michael Peter Wiper.- Vocal: Nicholas Paul Whitele

Universidad Carlos III de Madrid e-Archivo

Machine Learning

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience

Directory of Open Access Books (DOAB)

A Framework for the Verification and Validation of Artificial Intelligence Machine Learning Systems

Author: Burns Swala B.
Publication venue: JagWorks@USA
Publication date: 01/05/2023
Field of study

An effective verification and validation (V&V) process framework for the white-box and black-box testing of artificial intelligence (AI) machine learning (ML) systems is not readily available. This research uses grounded theory to develop a framework that leads to the most effective and informative white-box and black-box methods for the V&V of AI ML systems. Verification of the system ensures that the system adheres to the requirements and specifications developed and given by the major stakeholders, while validation confirms that the system properly performs with representative users in the intended environment and does not perform in an unexpected manner. Beginning with definitions, descriptions, and examples of ML processes and systems, the research results identify a clear and general process to effectively test these systems. The developed framework ensures the most productive and accurate testing results. Formerly, and occasionally still, the system definition and requirements exist in scattered documents that make it difficult to integrate, trace, and test through V&V. Modern system engineers along with system developers and stakeholders collaborate to produce a full system model using model-based systems engineering (MBSE). MBSE employs a Unified Modeling Language (UML) or System Modeling Language (SysML) representation of the system and its requirements that readily passes from each stakeholder for system information and additional input. The comprehensive and detailed MBSE model allows for direct traceability to the system requirements. xxiv To thoroughly test a ML system, one performs either white-box or black-box testing or both. Black-box testing is a testing method in which the internal model structure, design, and implementation of the system under test is unknown to the test engineer. Testers and analysts are simply looking at performance of the system given input and output. White-box testing is a testing method in which the internal model structure, design, and implementation of the system under test is known to the test engineer. When possible, test engineers and analysts perform both black-box and white-box testing. However, sometimes testers lack authorization to access the internal structure of the system. The researcher captures this decision in the ML framework. No two ML systems are exactly alike and therefore, the testing of each system must be custom to some degree. Even though there is customization, an effective process exists. This research includes some specialized methods, based on grounded theory, to use in the testing of the internal structure and performance. Through the study and organization of proven methods, this research develops an effective ML V&V framework. Systems engineers and analysts are able to simply apply the framework for various white-box and black-box V&V testing circumstances

University of South Alabama Institutional Repository

Programming Languages and Systems

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

This open access book constitutes the proceedings of the 30th European Symposium on Programming, ESOP 2021, which was held during March 27 until April 1, 2021, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021. The conference was planned to take place in Luxembourg and changed to an online format due to the COVID-19 pandemic. The 24 papers included in this volume were carefully reviewed and selected from 79 submissions. They deal with fundamental issues in the specification, design, analysis, and implementation of programming languages and systems

OAPEN Library

Methods to Improve the Prediction Accuracy and Performance of Ensemble Models

Author: Adegoke V.
Adegoke V.
Publication venue: London South Bank University
Publication date: 01/01/2022
Field of study

The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools. The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model. To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments. The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis. To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabetics’ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy. The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this thesis

LSBU Research Open

Searching for Needles in the Cosmic Haystack

Author: Devine Thomas Ryan
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2020
Field of study

Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection

The Research Repository @ WVU (West Virginia University)