3,771 research outputs found

    A New Genetic Programming Algorithm for Building Decision Tree

    Get PDF
    AbstractGenetic programming (GP) is a flexible and powerful evolutionary technique with some special features that are suitable for building a classifier of tree representation. However, unsuitable step size of editing operator will destroy the continuity of the evolution. In this paper, we propose a multiage genetic programming (MGP) algorithm to build a classifier on a given training set. Individuals are grouped into different groups according to their ages (tree size). The competitions between individuals are limited in the same groups. That prevents the structure editing operators from destroying the continuity of the evolution. The experimental results showed that the MGP algorithm is superior to the traditional genetic programming algorithm (GP) in building decision tree

    Improving a Network Intrusion Detection System’s Efficiency Using Model-Based Data Augmentation

    Get PDF
    A network intrusion detection system (NIDS) is one important element to mitigate cybersecurity risks, the NIDS allow for detecting anomalies in a network which may be a cyberattack to a corporate network environment. A NIDS can be seen as a classification problem where the ultimate goal is to distinguish between malicious traffic among a majority of benign traffic. Researches on NIDS are often performed using outdated datasets that don’t represent the actual cyberspace. Datasets such as the CICIDS2018 address this gap by being generated from attacks and an infrastructure that reflects an up-to-date scenario. A problem may arise when machine learning classification algorithms are trained on a dataset that presents class imbalance towards a majority, which is the case of CICIDS2018 data where the majority class is skewed to legitimate traffic. Such problem can be tackled by modifying a dataset probability distribution by augmenting the existing data to achieve balance in the dataset. Many different methods can be used to do so, ranging from naive approaches like random oversampling or undersampling; Machine learning with SMOTE and Decision Trees; Or even sophisticated deep learning models such as the GAN and CTGAN. An evaluation of the different data-augmentation methods for training a random forest classifier task showed that ROS and SMOTE are competitive in detecting attacks, while CTGAN demonstrated to better recognize benign samples and provide a balance between security and functionality for the network, however at a computational resource expense

    The Application of Data Analytics and Machine Learning for Formation Classification and Bit Dull Grading Prediction

    Get PDF
    Master's thesis in Petroleum Engineering.The oil and gas industry, especially its upstream part generates a massive amount of data. The proper data collection and processing are the vital elements of reducing the non-productive time and increasing the drilling operations efficiency. The major part of each well program is the drill bits selection. It is the most important tool which does slicing or crushing downhole and highly affects the overall drilling performance. However, drill bit selection is mostly accomplished through lessons learned from previous runs as well as bit grading after each run. These methods are highly subjective and usually based on the engineer’s experience. The abundance of field data with data analytics and machine learning capabilities are a perfect combination for creating reliable data-driven models. The main objective of this study is to create robust models that are able to classify the formation based on drilling parameters as well as estimate the bit dull grading based on drilling parameters and the formation. In order to achieve the aforementioned goals, the disclosed Volve filed dataset was meticulously processed and analyzed. The models were created for each of the well sections by using the Python, especially the pandas and scikit-learn libraries. However, after running the first simulation, models usually showed unsatisfactory accuracy. In order to increase models performance, the code was written to find the best parameter for each machine learning technique. Even though the bit dull grading model has a valid algorithm, the input parameters are hard to find, due to the lack of literature and patterns. Obtained results proved that the machine learning technique may be successfully implemented to solve the everyday problems in the oil and gas industry. Moreover, the outcome should help in the well planning process, enables to decrease the number of trips and improves overall drilling phase efficiency. The process could eliminate the trial and error drill bits selection and ensure more efficient and effective decision-making process

    The Challenges in SDN/ML Based Network Security : A Survey

    Full text link
    Machine Learning is gaining popularity in the network security domain as many more network-enabled devices get connected, as malicious activities become stealthier, and as new technologies like Software Defined Networking (SDN) emerge. Sitting at the application layer and communicating with the control layer, machine learning based SDN security models exercise a huge influence on the routing/switching of the entire SDN. Compromising the models is consequently a very desirable goal. Previous surveys have been done on either adversarial machine learning or the general vulnerabilities of SDNs but not both. Through examination of the latest ML-based SDN security applications and a good look at ML/SDN specific vulnerabilities accompanied by common attack methods on ML, this paper serves as a unique survey, making a case for more secure development processes of ML-based SDN security applications.Comment: 8 pages. arXiv admin note: substantial text overlap with arXiv:1705.0056

    AN INVESTIGATION OF EVOLUTIONARY COMPUTING IN SYSTEMS IDENTIFICATION FOR PRELIMINARY DESIGN

    Get PDF
    This research investigates the integration of evolutionary techniques for symbolic regression. In particular the genetic programming paradigm is used together with other evolutionary computational techniques to develop novel approaches to the improvement of areas of simple preliminary design software using empirical data sets. It is shown that within this problem domain, conventional genetic programming suffers from several limitations, which are overcome by the introduction of an improved genetic programming strategy based on node complexity values, and utilising a steady state algorithm with subpopulations. A further extension to the new technique is introduced which incorporates a genetic algorithm to aid the search within continuous problem spaces, increasing the robustness of the new method. The work presented here represents an advance in the Geld of genetic programming for symbolic regression with significant improvements over the conventional genetic programming approach. Such improvement is illustrated by extensive experimentation utilising both simple test functions and real-world design examples

    Optimal identification of unknown groundwater contaminant sources in conjunction with designed monitoring networks

    Get PDF
    Human activities and improper management practices have resulted in widespread deterioration of groundwater quality worldwide. Groundwater contamination has seriously threatened its beneficial use in recent decades. Remediation processes are necessary for groundwater management. In the remediation of contaminated aquifer sites, identification of unknown groundwater contaminant sources has a crucial role. In other words, an effective groundwater remediation process needs an accurate identification of contaminant sources in terms of contaminant source locations, magnitudes and time-release. On the other hand, the efficiency and reliability of contaminant source identification depend on the availability, adequacy, and accuracy of hydrogeologic information and contaminant concentration measurements data. Whereas, generally when groundwater contaminations are detected, only limited and sparse measured contaminant concentration values are available. Usually, groundwater contaminations are detected after a long time, years or even decades after the starting of contaminant source activities or even after their extinction. Therefore, usually, there is not enough information regarding the number of contaminant sources, the duration of sources' activities and the contaminant magnitudes, as well as the hydrogeologic parameters of the contaminated aquifers. Simulations of groundwater flow and solute transport involve intrinsic uncertainties due to this sparse information or lack of enough hydrogeologic information of the porous medium. Therefore, for groundwater management, developing and applying an efficient procedure for identification of unknown contaminant sources is essential. Moreover, available observed contaminant concentration values are usually erroneous and this erroneous data could cause instability in the solution results. Various combinations of source characteristics can result in similar effects at observation locations and cause non-uniqueness in the solution. Due to these instabilities and non-uniqueness in solution (Datta, 2002), the source identification problem is known as an "ill-posed problem" (Yeh, 1986). The non-uniqueness and uncertainties involved in this ill-posed problem make this problem a difficult and complex task. Suggested methodologies to tackle this task are not completely efficient. For instance, the crux of previous approaches is highly vulnerable to the accuracy and adequacy of contaminant concentration measurements and hydrogeologic data. As a result, many of the previously suggested approaches are not applicable to real-world cases and application of relevant approaches to real-world contaminant aquifer sites is usually tedious and time-consuming. The suggested methodologies involve enormous computational time and cost due to repeated runs of the numerical simulation models within the optimisation algorithms. Therefore, to identify the unknown characteristics of contaminant sources, different surrogate models were developed. Three different algorithms were utilized for developing the surrogate models: Self-Organising Maps (SOM), Gaussian Process Regression (GPR), and Multivariate Adaptive Regression Splines (MARS). Performance of the developed procedures was assessed for potential applicability in two hypothetical, an experimental, and a real-world contaminated aquifer sites. In the used contaminated aquifer sites, only limited contaminant concentrations data were assumed to be available. In three cases, it was also assumed that the contaminant concentrations data were collected a long time after the start of the first potential contaminant source activities. The performance evaluations of the developed surrogate models show that these models could accurately mimic the behaviour of simulation models of groundwater flow and solute transport. These surrogate models solutions showed acceptable errors in comparison to the more robust numerical model solutions. These surrogate models were also used for identification of unknown groundwater contaminant sources when utilized to solve the inverse problem. The SOM algorithm was chosen as the surrogate model type in this study for directly addressing the source identification problem as well. The SOM algorithm was chosen for its classification capabilities. In source identification problems, the number of actual contaminant sources is uncertain and usually, a set of a larger number of potential contaminant sources are assumed. Therefore, screening the active sources by SOM-based Surrogate Models (SOM-based SMs) may simplify the source identification problems. The performance of the developed SOM-based SMs was assessed for different scenarios. Results indicate that the developed models could also accurately screen the active sources among all potential contaminant sources with sparse contaminant concentrations data and uncertain hydrogeologic information. For comparison purposes, MARS and GPR algorithms that are precise prediction tools were also utilized for developing MARS and GPR-based Surrogate Models (MARS and GPR-based SM) for source identification. Performance of the developed surrogate models for source identification was evaluated in terms of Normalized Absolute Error of Estimation (NAEE). For example, the performance of the developed SOM, MARS and GPR-based SMs was assessed in an illustrative hypothetical contaminated aquifer site. The results for testing data in terms of NAEE were equal to 16.3, 4.9 and 6.6%, respectively. Performance of the developed SOM, MARS and GPR-based SMs was also evaluated in an experimental contaminated aquifer site. The results for testing data in terms of NAEE were equal to 15.8, 14.1 and 16.2%. These performance evaluation results of the developed surrogate models indicate that the MARS-based SMs can be more accurate models than the SOM and GPR-based SMs in source identification problems. The most important advantage of the developed methodologies is their direct application for source identification in an inverse mode without linking to an optimisation model. Surrogate Model-Based Optimisation (SMO) was also developed and utilized for source identification. In this developed SMO, MARS and Genetic Algorithm (GA) were utilized as the surrogate model and the optimisation model types, respectively. MARS-based SMOs performance was assessed in an illustrative hypothetical contaminated aquifer site and in a real-world contaminated aquifer site. The result of the developed MARS-based SMO for testing data in the illustrative hypothetical contaminated aquifer site in terms of Root Mean Square Error (RMSE) was equal to 0.92. Obtained solution results of the developed MARS-based SM in the real contaminated study area for testing data in terms of RMSE was equal to 42.5. The performance evaluation results of the developed methodologies in different hypothetical and real contaminated study areas demonstrate the capabilities of the constructed SOM, GPR, and MARS-based SMs and MARS-based SMO for source identification. Also, in order to increase the accuracy of source identification results, and based on the preliminary solution results of the developed SOM-based SMs, a sequential sampling method can be applied adaptively for updating the developed surrogate models. Information from a hypothetical contaminated aquifer site was used to assess the performance of this procedure. Performance evaluation results of adaptively developed MARS and GPR-based SMs in terms of NAEE were equal to 1.9 and 2.1%, respectively. The results show 3 and 4.5% improvements for source identification results by applying adaptively developed MARS and GPR-based SMs, respectively. Another difficulty with source identification problems has been the limitation and sparsity of observed contaminant concentrations data. Previously suggested methodologies usually need long-term observation data at numerous locations which can involve large costs. Therefore, developing an effective monitoring network design procedure was one of the main goals of this study. In designing the monitoring networks, two main objectives were considered: 1. Maximizing the accuracy of source identification results, and 2. Limiting the number of monitoring locations. It was supposed that by implementing obtained results from the designed monitoring networks for developing surrogate models, the source identification results would significantly improve. In this study, different algorithms were utilized to identify potentially important and effective monitoring locations which probably could improve source identification results. These algorithms are Random Forests (RF), Tree Net (TN) and CART. The performance of these algorithms was evaluated in different scenarios. Results indicate the potential applicability of these algorithms in recognising the most important components of prediction models. As a result, these algorithms could apply for designing monitoring networks for improving the source identification efficiency and accuracy. Concentration measurement information from a designed monitoring network and from a set of arbitrary monitoring sites was utilized to develop MARS-based surrogate models for source identification. The solution results for these two scenarios of designed monitoring and arbitrary measurements were compared for a hypothetical study area for evaluation purpose. Performance evaluation results of the developed surrogate model using information from the designed monitoring network showed improvement in source identification error in terms of RMSE for testing data by 0.7. The obtained information from the designed monitoring network was used to develop MARSbased SM for source identification of testing data in a real contaminated aquifer site. Source identification results of the developed MARS-based SM with testing data for the real contaminated aquifer site showed improvement by 35.3 in terms of RMSE compared to the solution results of MARS-based SM, which was developed by using obtained information from arbitrary monitoring locations. Performance evaluation results for the developed monitoring network procedure demonstrate the potential applicability of this procedure for source identification
    • …
    corecore