1,469 research outputs found
Classification hardness for supervised learners on 20 years of intrusion detection data
This article consolidates analysis of established (NSL-KDD) and new intrusion detection datasets (ISCXIDS2012, CICIDS2017, CICIDS2018) through the use of supervised machine learning (ML) algorithms. The uniformity in analysis procedure opens up the option to compare the obtained results. It also provides a stronger foundation for the conclusions about the efficacy of supervised learners on the main classification task in network security. This research is motivated in part to address the lack of adoption of these modern datasets. Starting with a broad scope that includes classification by algorithms from different families on both established and new datasets has been done to expand the existing foundation and reveal the most opportune avenues for further inquiry. After obtaining baseline results, the classification task was increased in difficulty, by reducing the available data to learn from, both horizontally and vertically. The data reduction has been included as a stress-test to verify if the very high baseline results hold up under increasingly harsh constraints. Ultimately, this work contains the most comprehensive set of results on the topic of intrusion detection through supervised machine learning. Researchers working on algorithmic improvements can compare their results to this collection, knowing that all results reported here were gathered through a uniform framework. This work's main contributions are the outstanding classification results on the current state of the art datasets for intrusion detection and the conclusion that these methods show remarkable resilience in classification performance even when aggressively reducing the amount of data to learn from
Learning from Structured Data with High Dimensional Structured Input and Output Domain
Structured data is accumulated rapidly in many applications, e.g. Bioinformatics, Cheminformatics, social network analysis, natural language processing and text mining. Designing and analyzing algorithms for handling these large collections of structured data has received significant interests in data mining and machine learning communities, both in the input and output domain. However, it is nontrivial to adopt traditional machine learning algorithms, e.g. SVM, linear regression to structured data. For one thing, the structural information in the input domain and output domain is ignored if applying the normal algorithms to structured data. For another, the major challenge in learning from many high-dimensional structured data is that input/output domain can contain tens of thousands even larger number of features and labels. With the high dimensional structured input space and/or structured output space, learning a low dimensional and consistent structured predictive function is important for both robustness and interpretability of the model. In this dissertation, we will present a few machine learning models that learn from the data with structured input features and structured output tasks. For learning from the data with structured input features, I have developed structured sparse boosting for graph classification, structured joint sparse PCA for anomaly detection and localization. Besides learning from structured input, I also investigated the interplay between structured input and output under the context of multi-task learning. In particular, I designed a multi-task learning algorithms that performs structured feature selection & task relationship Inference. We will demonstrate the applications of these structured models on subgraph based graph classification, networked data stream anomaly detection/localization, multiple cancer type prediction, neuron activity prediction and social behavior prediction. Finally, through my intern work at IBM T.J. Watson Research, I will demonstrate how to leverage structural information from mobile data (e.g. call detail record and GPS data) to derive important places from people's daily life for transit optimization and urban planning
Oil and Gas flow Anomaly Detection on offshore naturally flowing wells using Deep Neural Networks
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThe Oil and Gas industry, as never before, faces multiple challenges. It is being impugned for being
dirty, a pollutant, and hence the more demand for green alternatives. Nevertheless, the world still has
to rely heavily on hydrocarbons, since it is the most traditional and stable source of energy, as opposed
to extensively promoted hydro, solar or wind power. Major operators are challenged to produce the
oil more efficiently, to counteract the newly arising energy sources, with less of a climate footprint,
more scrutinized expenditure, thus facing high skepticism regarding its future. It has to become
greener, and hence to act in a manner not required previously.
While most of the tools used by the Hydrocarbon E&P industry is expensive and has been used for
many years, it is paramount for the industry’s survival and prosperity to apply predictive maintenance
technologies, that would foresee potential failures, making production safer, lowering downtime,
increasing productivity and diminishing maintenance costs. Many efforts were applied in order to
define the most accurate and effective predictive methods, however data scarcity affects the speed
and capacity for further experimentations. Whilst it would be highly beneficial for the industry to invest
in Artificial Intelligence, this research aims at exploring, in depth, the subject of Anomaly Detection,
using the open public data from Petrobras, that was developed by experts.
For this research the Deep Learning Neural Networks, such as Recurrent Neural Networks with LSTM
and GRU backbones, were implemented for multi-class classification of undesirable events on naturally
flowing wells. Further, several hyperparameter optimization tools were explored, mainly focusing on
Genetic Algorithms as being the most advanced methods for such kind of tasks.
The research concluded with the best performing algorithm with 2 stacked GRU and the following
vector of hyperparameters weights: [1, 47, 40, 14], which stand for timestep 1, number of hidden units
47, number of epochs 40 and batch size 14, producing F1 equal to 0.97%.
As the world faces many issues, one of which is the detrimental effect of heavy industries to the
environment and as result adverse global climate change, this project is an attempt to contribute to
the field of applying Artificial Intelligence in the Oil and Gas industry, with the intention to make it
more efficient, transparent and sustainable
Machine Learning-based Approaches for Advanced Monitoring of Smart Glasses
openWith today’s growing demand on productivity, product quality and effectiveness, the importance of Machine Learning-based functionalities and services has dramatically increased. Such paradigm shift can be mainly associated with the increasing availability of Internet of Things (IoT) sensors and devices, the increase of data collected in the IoT scenario and the increasing popularity and availability of machine learning approaches. One of the most appealing applications of ML-based solutions is for sure Predictive Maintenance, which aims at improving maintenance management by exploiting the estimation of the health status of a piece of equipment. One of the main formalizations of the PdM problem is the prediction of the Remaining Useful Life (RUL), that is defined as the time/process iterations remaining for a device component to perform its task before it loses functionality. This work investigates a possible application of predictive maintenance techniques for the monitoring of the battery of Smart Glasses. The work starts with the description of the considered devices, the modalities of data collection and the Exploratory Data Analysis for better understanding the task. The first experimental part consists in the application of an unsupervised anomaly detection technique, useful to initially deal with the partial and unlabeled data. The last part of the work contains the results of the application of both classical machine learning and deep learning approaches for the estimation of the RUL of the devices battery. A section for the interpretation of the machine-learning models is included for both the anomaly detection and RUL estimation approaches
Forecasting Player Behavioral Data and Simulating in-Game Events
Understanding player behavior is fundamental in game data science. Video
games evolve as players interact with the game, so being able to foresee player
experience would help to ensure a successful game development. In particular,
game developers need to evaluate beforehand the impact of in-game events.
Simulation optimization of these events is crucial to increase player
engagement and maximize monetization. We present an experimental analysis of
several methods to forecast game-related variables, with two main aims: to
obtain accurate predictions of in-app purchases and playtime in an operational
production environment, and to perform simulations of in-game events in order
to maximize sales and playtime. Our ultimate purpose is to take a step towards
the data-driven development of games. The results suggest that, even though the
performance of traditional approaches such as ARIMA is still better, the
outcomes of state-of-the-art techniques like deep learning are promising. Deep
learning comes up as a well-suited general model that could be used to forecast
a variety of time series with different dynamic behaviors
TSE-IDS: A Two-Stage Classifier Ensemble for Intelligent Anomaly-based Intrusion Detection System
Intrusion detection systems (IDS) play a pivotal role in computer security by discovering and repealing malicious activities in computer networks. Anomaly-based IDS, in particular, rely on classification models trained using historical data to discover such malicious activities. In this paper, an improved IDS based on hybrid feature selection and two-level classifier ensembles is proposed. An hybrid feature selection technique comprising three methods, i.e. particle swarm optimization, ant colony algorithm, and genetic algorithm, is utilized to reduce the feature size of the training datasets (NSL-KDD and UNSW-NB15 are considered in this paper). Features are selected based on the classification performance of a reduced error pruning tree (REPT) classifier. Then, a two-level classifier ensembles based on two meta learners, i.e., rotation forest and bagging, is proposed. On the NSL-KDD dataset, the proposed classifier shows 85.8% accuracy, 86.8% sensitivity, and 88.0% detection rate, which remarkably outperform other classification techniques recently proposed in the literature. Results regarding the UNSW-NB15 dataset also improve the ones achieved by several state of the art techniques. Finally, to verify the results, a two-step statistical significance test is conducted. This is not usually considered by IDS research thus far and, therefore, adds value to the experimental results achieved by the proposed classifier
- …