21 research outputs found

    Behavioral complexity analysis of networked systems to identify malware attacks

    Get PDF
    2020 Fall.Includes bibliographical references.Internet of Things (IoT) environments are often composed of a diverse set of devices that span a broad range of functionality, making them a challenge to secure. This diversity of function leads to a commensurate diversity in network traffic, some devices have simple network footprints and some devices have complex network footprints. This network-complexity in a device's traffic provides a differentiator that can be used by the network to distinguish which devices are most effectively managed autonomously and which devices are not. This study proposes an informed autonomous learning method by quantifying the complexity of a device based on historic traffic and applies this complexity metric to build a probabilistic model of the device's normal behavior using a Gaussian Mixture Model (GMM). This method results in an anomaly detection classifier with inlier probability thresholds customized to the complexity of each device without requiring labeled data. The model efficacy is then evaluated using seven common types of real malware traffic and across four device datasets of network traffic: one residential-based, two from labs, and one consisting of commercial automation devices. The results of the analysis of over 100 devices and 800 experiments show that the model leads to highly accurate representations of the devices and a strong correlation between the measured complexity of a device and the accuracy to which its network behavior can be modeled

    Data Mining

    Get PDF
    The availability of big data due to computerization and automation has generated an urgent need for new techniques to analyze and convert big data into useful information and knowledge. Data mining is a promising and leading-edge technology for mining large volumes of data, looking for hidden information, and aiding knowledge discovery. It can be used for characterization, classification, discrimination, anomaly detection, association, clustering, trend or evolution prediction, and much more in fields such as science, medicine, economics, engineering, computers, and even business analytics. This book presents basic concepts, ideas, and research in data mining

    An Anomaly-based Intrusion Detection System in Presence of Benign Outliers with Visualization Capabilities

    Get PDF
    Abnormal network traffic analysis through Intrusion Detection Systems (IDSs) and visualization techniques has considerably become an important research topic to protect computer networks from intruders. It has been still challenging to design an accurate and a robust IDS with visualization capabilities to discover security threats due to the high volume of network traffic. This research work introduces and describes a novel anomaly-based intrusion detection system in presence of long-range independence data called benign outliers, using a neural projection architecture by a modified Self-Organizing Map (SOM) to not only detect attacks and anomalies accurately, but also provide visualized information and insights to end users. The proposed approach enables better analysis by merging the large amount of network traffic into an easy-to-understand 2D format and a simple user interaction. To show the performance and validate the proposed visualization-based IDS, it has been trained and tested over synthetic and real benchmarking datasets (NSL-KDD, UNSW-NB15, AAGM and VPN-nonVPN) that are widely applied in this domain. The results of the conducted experimental study confirm the advantages and effectiveness of the proposed approach

    Advanced random forest approaches for outlier detection

    Get PDF
    Outlier Detection (OD) is a Pattern Recognition task which consists of finding those patterns in a set of data which are likely to have been generated by a different mechanism than the one underlying the rest of the data. The importance of OD is visible in everyday life. Indeed, fast, and accurate detection of outliers is crucial: for example, in the electrocardiogram of a patient, an abnormality in the heart rhythm can cause severe health problems. Due to the high number of fields in which OD is needed, several approaches have been designed. Among them, Random Forest-based techniques have raised great interest in the research community: a Random Forest (RF) is an ensemble of Decision Trees where each tree is diverse and independent. They are characterized by a high degree of flexibility, robustness, and high generalization capabilities. Even though originally designed for classification and regression, in the latest years, due to their success, there has been an increased development of RF-based approaches for other learning tasks, including OD. The forerunner of several RF methods for OD is Isolation Forest (iForest), a technique which main principle is isolation, i.e. the separation of each object from the rest of the data. Since outliers are different from the rest of the data and thus easier to separate, we can easily identify them as those objects isolated after few splits in the tree. iForests have been employed in a great variety of application fields, showing excellent performances. This thesis is inserted into the above scenario: even if some extensions of basic RF-based approaches for OD have been proposed, their potentialities have not been fully exploited and there is large room for improvements. In this thesis, we introduce some advanced RF-based techniques for OD, investigating both methodological issues and alternative uses of these flexible approaches. In detail, we moved along four research directions. The starting point of the first one is the absence of RF methods for OD able to work with non-vectorial data: here we propose ProxIForest, an approach which works with all types of data for which a distance measure can be defined, thus including non-vectorial data as well. Indeed, for the latter, many powerful distances have been proposed. The second direction focuses on how to measure the outlierness degree of an object in an RF, i.e. the anomaly score, since most extensions of iForest concern only the tree building procedure. In detail, we propose two novel classes of methods: the first class exploits the information contained within a tree. The second one focuses on the ensemble aspect of RFs: the aggregation of the anomaly scores extracted from each tree is crucial to correctly identify outliers. As to the third research direction we took a different perspective exploiting the fact that each tree in a forest is a space partitioner encoding relations, i.e. distances, between objects. Whereas this aspect has been widely researched in the clustering field, it has never been investigated for OD: we extract from an iForest a distance measure and input it to an outlier detector. As last research direction, we designed a new variant of iForest to characterize multiple sclerosis given a brain connectivity network: we cast the problem as an OD task, by making an analogy between disconnected brain regions, the hallmark of the disease, and outliers. All proposals have been thoroughly empirically validated on either classical or ad hoc datasets: we performed several analyses, including comparisons to state-of-the-art approaches and statistical tests. This thesis proves the suitability of RF-based approaches for OD from different perspectives: not only they can be successfully used for the task, but we can also use them to extract distances or features. Further, by contributing to this field, this thesis proves that there are still many aspects requiring further investigation

    Interconnected Services for Time-Series Data Management in Smart Manufacturing Scenarios

    Get PDF
    xvii, 218 p.The rise of Smart Manufacturing, together with the strategic initiatives carried out worldwide, have promoted its adoption among manufacturers who are increasingly interested in boosting data-driven applications for different purposes, such as product quality control, predictive maintenance of equipment, etc. However, the adoption of these approaches faces diverse technological challenges with regard to the data-related technologies supporting the manufacturing data life-cycle. The main contributions of this dissertation focus on two specific challenges related to the early stages of the manufacturing data life-cycle: an optimized storage of the massive amounts of data captured during the production processes and an efficient pre-processing of them. The first contribution consists in the design and development of a system that facilitates the pre-processing task of the captured time-series data through an automatized approach that helps in the selection of the most adequate pre-processing techniques to apply to each data type. The second contribution is the design and development of a three-level hierarchical architecture for time-series data storage on cloud environments that helps to manage and reduce the required data storage resources (and consequently its associated costs). Moreover, with regard to the later stages, a thirdcontribution is proposed, that leverages advanced data analytics to build an alarm prediction system that allows to conduct a predictive maintenance of equipment by anticipating the activation of different types of alarms that can be produced on a real Smart Manufacturing scenario

    On the Principles of Evaluation for Natural Language Generation

    Get PDF
    Natural language processing is concerned with the ability of computers to understand natural language texts, which is, arguably, one of the major bottlenecks in the course of chasing the holy grail of general Artificial Intelligence. Given the unprecedented success of deep learning technology, the natural language processing community has been almost entirely in favor of practical applications with state-of-the-art systems emerging and competing for human-parity performance at an ever-increasing pace. For that reason, fair and adequate evaluation and comparison, responsible for ensuring trustworthy, reproducible and unbiased results, have fascinated the scientific community for long, not only in natural language but also in other fields. A popular example is the ISO-9126 evaluation standard for software products, which outlines a wide range of evaluation concerns, such as cost, reliability, scalability, security, and so forth. The European project EAGLES-1996, being the acclaimed extension to ISO-9126, depicted the fundamental principles specifically for evaluating natural language technologies, which underpins succeeding methodologies in the evaluation of natural language. Natural language processing encompasses an enormous range of applications, each with its own evaluation concerns, criteria and measures. This thesis cannot hope to be comprehensive but particularly addresses the evaluation in natural language generation (NLG), which touches on, arguably, one of the most human-like natural language applications. In this context, research on quantifying day-to-day progress with evaluation metrics lays the foundation of the fast-growing NLG community. However, previous works have failed to address high-quality metrics in multiple scenarios such as evaluating long texts and when human references are not available, and, more prominently, these studies are limited in scope, given the lack of a holistic view sketched for principled NLG evaluation. In this thesis, we aim for a holistic view of NLG evaluation from three complementary perspectives, driven by the evaluation principles in EAGLES-1996: (i) high-quality evaluation metrics, (ii) rigorous comparison of NLG systems for properly tracking the progress, and (iii) understanding evaluation metrics. To this end, we identify the current state of challenges derived from the inherent characteristics of these perspectives, and then present novel metrics, rigorous comparison approaches, and explainability techniques for metrics to address the identified issues. We hope that our work on evaluation metrics, system comparison and explainability for metrics inspires more research towards principled NLG evaluation, and contributes to the fair and adequate evaluation and comparison in natural language processing

    Supporting Safety Analysis of Deep Neural Networks with Automated Debugging and Repair

    Get PDF

    Connected Attribute Filtering Based on Contour Smoothness

    Get PDF
    corecore