1,040 research outputs found

    Performance analysis of binary and multiclass models using azure machine learning

    Get PDF
    Network data is expanding and that too at an alarming rate. Besides, the sophisticated attack tools used by hackers lead to capricious cyber threat landscape. Traditional models proposed in the field of network intrusion detection using machine learning algorithms emphasize more on improving attack detection rate and reducing false alarms but time efficiency is often overlooked. Therefore, in order to address this limitation, a modern solution has been presented using Machine Learning-as-a-Service platform. The proposed work analyses the performance of eight two-class and three multiclass algorithms using UNSW NB-15, a modern intrusion detection dataset. 82,332 testing samples were considered to evaluate the performance of algorithms. The proposed two class decision forest model exhibited 99.2% accuracy and took 6 seconds to learn 1,75,341 network instances. Multiclass classification task was also undertaken wherein attack types like generic, exploits, shellcode and worms were classified with a recall percentage of 99%, 94.49%, 91.79% and 90.9% respectively by the multiclass decision forest model that also leapfrogged others in terms of training and execution time

    Towards Enhancement of Machine Learning Techniques Using CSE-CIC-IDS2018 Cybersecurity Dataset

    Get PDF
    In machine learning, balanced datasets play a crucial role in the bias observed towards classification and prediction. The CSE-CIC IDS datasets published in 2017 and 2018 have both attracted considerable scholarly attention towards research in intrusion detection systems. Recent work published using this dataset indicates little attention paid to the imbalance of the dataset. The study presented in this paper sets out to explore the degree to which imbalance has been treated and provide a taxonomy of the machine learning approaches developed using these datasets. A survey of published works related to these datasets was done to deliver a combined qualitative and quantitative methodological approach for our analysis towards deriving a taxonomy. The research presented here confirms that the impact of bias due to the imbalance datasets is rarely addressed. This data supports further research and development of supervised machine learning techniques that reduce bias in classification or prediction due to these imbalance datasets. This study\u27s experiment is to train the model using the train, and test split function from sci-kit learn library on the CSE-CIC-IDS2018. The system needs to be trained by a learning algorithm to accomplish this. There are many machine learning algorithms available and presented by the literature. Among which there are three types of classification based Supervised ML techniques which are used in our study: 1) KNN, 2) Random Forest (RF) and 3) Logistic Regression (LR). This experiment also determines how each of the dataset\u27s 67 preprocessed features affects the ML model\u27s performance. Feature drop selection is performed in two ways, independent and group drop. Experimental results generate the threshold values for each classifier and performance metric values such as accuracy, precision, recall, and F1-score. Also, results are generated from the comparison of manual feature drop methods. A good amount of drop is noticed in the group for most of the classifiers

    A comparative analysis of machine learning models for corporate default forecasting

    Get PDF
    This study examines the potential benefits of utilizing machine learning models for default forecasting by comparing the discriminatory power of the random forest and XGBoost models with traditional statistical models. The results of the evaluation with out-of-time predictions show that the machine learning models exhibit a higher discriminatory power compared to the traditional models. The reduction in the sample size of the training dataset leads to a decrease in predictive power of the machine learning models, reducing the difference in performance between the two model types. While modifications in model dimensionality have a limited impact on the discriminatory power of the statistical models, the predictive power of machine learning models increases with the addition of further predictors. When employing a clustering approach, both traditional and machine learning models exhibit an improvement in discriminatory power in the small, medium, and large firm size clusters compared to the previous non-clustering specifications. Machine learning models exhibit a significantly higher ability to classify micro firms. The findings of this research indicate that the machine learning models exhibit superior discriminatory power compared to the traditional models across the different specifications. Machine learning models can be used to forecast the potential impact of corporate default of non-financial micro cooperations on the Portuguese labour market by estimating the number of jobs at risk

    A comparative analysis of machine learning models for corporate default forecasting

    Get PDF
    This study examines the potential benefits of utilizing machine learning models for default forecasting by comparing the discriminatory power of the random forest and XGBoost models with traditional statistical models. The results of the evaluation with out-of-time predictions show that the machine learning models exhibit a higher discriminatory power compared to the traditional models. The reduction in the sample size of the training dataset leads to a decrease in predictive power of the machine learning models, reducing the difference in performance between the two model types. While modifications in model dimensionality have a limited impact on the discriminatory power of the statistical models, the predictive power of machine learning models increases with the addition of further predictors. When employing a clustering approach, both traditional and machine learning models exhibit an improvement in discriminatory power in the small, medium, and large firm size clusters compared to the previous non-clustering specifications. Machine learning models exhibit a significantly higher ability to classify micro firms. The findings of this research indicate that the machine learning models exhibit superior discriminatory power compared to the traditional models across the different specifications. Machine learning models can be used to forecast the potential impact of corporate default of non-financial micro cooperations on the Portuguese labour market by estimating the number of jobs at risk

    Data Science for Internal Audit in Banking: Refinement of an Internal Audit Alarmistic System with Machine Learning

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThis report presents the work developed during the academic internship required for obtaining the Master’s Degree in Data Science and Advanced Analytics. The internship took place in the area of Data & Analytics of the Department for Internal Audit of Caixa Geral de Depósitos (Portugal), from the 14th of September 2020 to the 13th of June 2021. The internship’s goal was the introduction of machine learning to the Department of Internal Audit. In particular, the implementation of three machine learning pipelines to aid in audit activities of the institution, which systematically analyze operations that stand out from the implemented alarm system. The alarm system triggers alerts when an event disobeys a predefined methodology. Each triggering event is reviewed and processed individually by the auditors, either by being classified as a confirmed error or as a false positive. Confirmed errors frequently lead to recommendations to rectify the operations, while false positives are closed without a recommendation. The alerts’ triggers are defined by sets of arguably general and manually implemented rules, resulting in high trigger frequencies and low precisions. Trigger frequency, precision, and cost of miss rate differ for each alert. Based on the alerts’ trigger history data, three types of alerts were selected for improvements. The deployment of machine learning pipelines with classification models optimized the triggers' specificity while maintaining high sensitivity, which reduced the number of daily events that have to be reviewed by the auditors. This optimization maximizes the efficiency and productivity of the general alarm system and decreases the auditors’ workload

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    Machine Learning Based Network Vulnerability Analysis of Industrial Internet of Things

    Full text link
    It is critical to secure the Industrial Internet of Things (IIoT) devices because of potentially devastating consequences in case of an attack. Machine learning and big data analytics are the two powerful leverages for analyzing and securing the Internet of Things (IoT) technology. By extension, these techniques can help improve the security of the IIoT systems as well. In this paper, we first present common IIoT protocols and their associated vulnerabilities. Then, we run a cyber-vulnerability assessment and discuss the utilization of machine learning in countering these susceptibilities. Following that, a literature review of the available intrusion detection solutions using machine learning models is presented. Finally, we discuss our case study, which includes details of a real-world testbed that we have built to conduct cyber-attacks and to design an intrusion detection system (IDS). We deploy backdoor, command injection, and Structured Query Language (SQL) injection attacks against the system and demonstrate how a machine learning based anomaly detection system can perform well in detecting these attacks. We have evaluated the performance through representative metrics to have a fair point of view on the effectiveness of the methods

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining
    corecore