49 research outputs found

    A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation

    Full text link
    Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets

    Advancement of Data-Driven Short-Term Flood Predictions on an Urbanized Watershed Using Preprocessing Techniques

    Get PDF
    Supervised classification can be applied for short-term predictions of hydrological events in cases where the label of the event rather than its magnitude is crucial, as in the case of early flood warning systems. To be effective, these warning systems must be able to forecast floods accurately and to provide estimates early enough. Following the approach of transforming hydrological sensor data into a phase space using time-delay embedding, an attempt was made to improve the performance of the models and to increase the lead-time of reliable predictions. For this, the available set of attributes supplied by stream and rain gauges was extended by derivatives. In addition, imbalanced data techniques were applied at the data preprocessing step. The computational experiments were conducted on various data sets, lead-times, and years with different hydrological characteristics. The results show that especially derivatives of water level data improve model performance, increasingly when added for only one or two hours before the prediction time. In addition to that, the imbalanced data techniques allowed for overall improved prediction of floods at the cost of slight increase of misclassification of low-flow events

    Can Threshold-Based Sensor Alerts be Analysed to Detect Faults in a District Heating Network?

    Get PDF
    Older IoT “smart sensors” create system alerts from threshold rules on reading values. These simple thresholds are not very flexible to changes in the network. Due to the large number of false positives generated, these alerts are often ignored by network operators. Current state-of-the-art analytical models typically create alerts using raw sensor readings as the primary input. However, as greater numbers of sensors are being deployed, the growth in the number of readings that must be processed becomes problematic. The number of analytic models deployed to each of these systems is also increasing as analysis is broadened. This study aims to investigate if alerts created using threshold rules can be used to predict network faults. By using threshold-based alerts instead of raw continuous readings, the amount of data that the analytic models need to process is greatly reduced. The study was done using alert data from a European city’s District Heating network. The alerts were generated by “smart sensors” that used threshold rules. Analytic models were tested to find the most accurate prediction of a network fault. Work order (maintenance) records were used as the target variable indicating a fault had occurred at the same time and location as the alert was active. The target variable was highly imbalanced (96:4) with a minority class being when a Work Order was required. The decision tree model developed used misclassification costs to achieve a reasonable accuracy with a trade-off between precision (.63) and recall (.56). The sparse nature of the alert data may be to blame for this result. The results show promise that this method could work well on datasets with better sensor coverage

    Intrusion Detection: Embedded Software Machine Learning and Hardware Rules Based Co-Designs

    Get PDF
    Security of innovative technologies in future generation networks such as (Cyber Physical Systems (CPS) and Wi-Fi has become a critical universal issue for individuals, economy, enterprises, organizations and governments. The rate of cyber-attacks has increased dramatically, and the tactics used by the attackers are continuing to evolve and have become ingenious during the attacks. Intrusion Detection is one of the solutions against these attacks. One approach in designing an intrusion detection system (IDS) is software-based machine learning. Such approach can predict and detect threats before they result in major security incidents. Moreover, despite the considerable research in machine learning based designs, there is still a relatively small body of literature that is concerned with imbalanced class distributions from the intrusion detection system perspective. In addition, it is necessary to have an effective performance metric that can compare multiple multi-class as well as binary-class systems with respect to class distribution. Furthermore, the expectant detection techniques must have the ability to identify real attacks from random defects, ingrained defects in the design, misconfigurations of the system devices, system faults, human errors, and software implementation errors. Moreover, a lightweight IDS that is small, real-time, flexible and reconfigurable enough to be used as permanent elements of the system's security infrastructure is essential. The main goal of the current study is to design an effective and accurate intrusion detection framework with minimum features that are more discriminative and representative. Three publicly available datasets representing variant networking environments are adopted which also reflect realistic imbalanced class distributions as well as updated attack patterns. The presented intrusion detection framework is composed of three main modules: feature selection and dimensionality reduction, handling imbalanced class distributions, and classification. The feature selection mechanism utilizes searching algorithms and correlation based subset evaluation techniques, whereas the feature dimensionality reduction part utilizes principal component analysis and auto-encoder as an instance of deep learning. Various classifiers, including eight single-learning classifiers, four ensemble classifiers, one stacked classifier, and five imbalanced class handling approaches are evaluated to identify the most efficient and accurate one(s) for the proposed intrusion detection framework. A hardware-based approach to detect malicious behaviors of sensors and actuators embedded in medical devices, in which the safety of the patient is critical and of utmost importance, is additionally proposed. The idea is based on a methodology that transforms a device's behavior rules into a state machine to build a Behavior Specification Rules Monitoring (BSRM) tool for four medical devices. Simulation and synthesis results demonstrate that the BSRM tool can effectively identify the expected normal behavior of the device and detect any deviation from its normal behavior. The performance of the BSRM approach has also been compared with a machine learning based approach for the same problem. The FPGA module of the BSRM can be embedded in medical devices as an IDS and can be further integrated with the machine learning based approach. The reconfigurable nature of the FPGA chip adds an extra advantage to the designed model in which the behavior rules can be easily updated and tailored according to the requirements of the device, patient, treatment algorithm, and/or pervasive healthcare application

    Predicting Account Receivables Outcomes with Machine-Learning

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceThe Account Receivables (AR) of a company are considered an important determinant of a company’s Cash Flow – the backbone of a company’s financial performance or health. It has been proved that by efficiently managing the money owed by customers for goods and services (AR), a company can avoid financial difficulties and even stabilize results in moments of extreme volatility. The aim of this project is to use machine-learning and data visualization techniques to predict invoice outcomes and provide useful information and a solution using analytics to the collection management team. Specifically, this project demonstrates how supervised learning models can classify with high accuracy whether a newly created invoice will be paid earlier, on-time or later than the contracted due date. It is also studied how to predict the magnitude of the delayed payments by classifying them into interesting, delayed categories for the business: up to 1 month late, from 1 to 3 months late and delayed for more than 3 months. The developed models use real-life data from a multinational company in the manufacturing and automation industries and can predict payments with higher accuracy than the baseline achieved by the business

    An improved long short term memory network for intrusion detection

    Get PDF
    Over the years, intrusion detection system has played a crucial role in network security by discovering attacks from network traffics and generating an alarm signal to be sent to the security team. Machine learning methods, e.g., Support Vector Machine, K Nearest Neighbour, have been used in building intrusion detection systems but such systems still suffer from low accuracy and high false alarm rate. Deep learning models (e.g., Long Short-Term Memory, LSTM) have been employed in designing intrusion detection systems to address this issue. However, LSTM needs a high number of iterations to achieve high performance. In this paper, a novel, and improved version of the Long Short-Term Memory (ILSTM) algorithm was proposed. The ILSTM is based on the novel integration of the chaotic butterfly optimization algorithm (CBOA) and particle swarm optimization (PSO) to improve the accuracy of the LSTM algorithm. The ILSTM was then used to build an efficient intrusion detection system for binary and multi-class classification cases. The proposed algorithm has two phases: phase one involves training a conventional LSTM network to get initial weights, and phase two involves using the hybrid swarm algorithms, CBOA and PSO, to optimize the weights of LSTM to improve the accuracy. The performance of ILSTM and the intrusion detection system were evaluated using two public datasets (NSL-KDD dataset and LITNET-2020) under nine performance metrics. The results showed that the proposed ILSTM algorithm outperformed the original LSTM and other related deep-learning algorithms regarding accuracy and precision. The ILSTM achieved an accuracy of 93.09% and a precision of 96.86% while LSTM gave an accuracy of 82.74% and a precision of 76.49%. Also, the ILSTM performed better than LSTM in both datasets. In addition, the statistical analysis showed that ILSTM is more statistically significant than LSTM. Further, the proposed ISTLM gave better results of multiclassification of intrusion types such as DoS, Prob, and U2R attacks

    Analysis, Characterization, Prediction and Attribution of Extreme Atmospheric Events with Machine Learning: a Review

    Full text link
    Atmospheric Extreme Events (EEs) cause severe damages to human societies and ecosystems. The frequency and intensity of EEs and other associated events are increasing in the current climate change and global warming risk. The accurate prediction, characterization, and attribution of atmospheric EEs is therefore a key research field, in which many groups are currently working by applying different methodologies and computational tools. Machine Learning (ML) methods have arisen in the last years as powerful techniques to tackle many of the problems related to atmospheric EEs. This paper reviews the ML algorithms applied to the analysis, characterization, prediction, and attribution of the most important atmospheric EEs. A summary of the most used ML techniques in this area, and a comprehensive critical review of literature related to ML in EEs, are provided. A number of examples is discussed and perspectives and outlooks on the field are drawn.Comment: 93 pages, 18 figures, under revie

    Spatial Modeling of Maritime Risk Using Machine Learning

    Get PDF
    Managing navigational safety is a key responsibility of coastal states. Predicting and measuring these risks has a high complexity due to their infrequent occurrence, multitude of causes, and large study areas. As a result, maritime risk models are generally limited in scale to small regions, generalized across diverse environments, or rely on the use of expert judgement. Therefore, such an approach has limited scalability and may incorrectly characterize the risk. Within this article a novel method for undertaking spatial modeling of maritime risk is proposed through machine learning. This enables navigational safety to be characterized while leveraging the significant volumes of relevant data available. The method comprises two key components: aggregation of historical accident data, vessel traffic, and other exploratory features into a spatial grid; and the implementation of several classification algorithms that predicts annual accident occurrence for various vessel types. This approach is applied to characterize the risk of collisions and groundings in the United Kingdom. The results vary between hazard types and vessel types but show remarkable capability at characterizing maritime risk, with accuracies and area under curve scores in excess of 90% in most implementations. Furthermore, the ensemble tree-based algorithms of XGBoost and Random Forest consistently outperformed other machine learning algorithms that were tested. The resultant potential risk maps provide decisionmakers with actionable intelligence in order to target risk mitigation measures in regions with the greatest requirement
    corecore