4 research outputs found

    Anomaly Detection System for Data Quality Assurance in IoT infrastructures based on Machine Learning

    Get PDF
    The inclusion of IoT in digital platforms is very common nowadays due to the ease of deployment, low power consumption and low cost. It is also common to use heterogeneous IoT devices of ad-hoc or commercial development, using private or third-party network infrastructures. This scenario makes it difficult to detect invalid packets from malfunctioning devices, from sensors to application servers. These invalid packets generate low quality or erroneous data, which negatively influence the services that use them. For this reason, we need to create procedures and mechanisms to ensure the quality of the data obtained from IoT infrastructures, regardless of the type of infrastructure and the control we have over them, so that the systems that use this data can be reliable. In this work we propose the development of an Anomaly Detection System for IoT infrastructures based on Machine Learning using unsupervised learning. We validate the proposal by implementing it on the IoT infrastructure of the University of Alicante, which has a multiple sensing system and uses third-party services, over a campus of one million square meters. The contribution of this work has been the generation of an anomaly detection system capable of revealing incidents in IoT infrastructures, without knowing details about the infrastructures or devices, through the analysis of data in real time. This proposal allows to discard from the IoT data flow all those packets that are suspected to be anomalous to ensure a high quality of information to the tools that consume IoT data.This project has been funded by the UAIND22-01B project "Adaptive control of urban supply systems" of the University of Alicante

    Advanced random forest approaches for outlier detection

    Get PDF
    Outlier Detection (OD) is a Pattern Recognition task which consists of finding those patterns in a set of data which are likely to have been generated by a different mechanism than the one underlying the rest of the data. The importance of OD is visible in everyday life. Indeed, fast, and accurate detection of outliers is crucial: for example, in the electrocardiogram of a patient, an abnormality in the heart rhythm can cause severe health problems. Due to the high number of fields in which OD is needed, several approaches have been designed. Among them, Random Forest-based techniques have raised great interest in the research community: a Random Forest (RF) is an ensemble of Decision Trees where each tree is diverse and independent. They are characterized by a high degree of flexibility, robustness, and high generalization capabilities. Even though originally designed for classification and regression, in the latest years, due to their success, there has been an increased development of RF-based approaches for other learning tasks, including OD. The forerunner of several RF methods for OD is Isolation Forest (iForest), a technique which main principle is isolation, i.e. the separation of each object from the rest of the data. Since outliers are different from the rest of the data and thus easier to separate, we can easily identify them as those objects isolated after few splits in the tree. iForests have been employed in a great variety of application fields, showing excellent performances. This thesis is inserted into the above scenario: even if some extensions of basic RF-based approaches for OD have been proposed, their potentialities have not been fully exploited and there is large room for improvements. In this thesis, we introduce some advanced RF-based techniques for OD, investigating both methodological issues and alternative uses of these flexible approaches. In detail, we moved along four research directions. The starting point of the first one is the absence of RF methods for OD able to work with non-vectorial data: here we propose ProxIForest, an approach which works with all types of data for which a distance measure can be defined, thus including non-vectorial data as well. Indeed, for the latter, many powerful distances have been proposed. The second direction focuses on how to measure the outlierness degree of an object in an RF, i.e. the anomaly score, since most extensions of iForest concern only the tree building procedure. In detail, we propose two novel classes of methods: the first class exploits the information contained within a tree. The second one focuses on the ensemble aspect of RFs: the aggregation of the anomaly scores extracted from each tree is crucial to correctly identify outliers. As to the third research direction we took a different perspective exploiting the fact that each tree in a forest is a space partitioner encoding relations, i.e. distances, between objects. Whereas this aspect has been widely researched in the clustering field, it has never been investigated for OD: we extract from an iForest a distance measure and input it to an outlier detector. As last research direction, we designed a new variant of iForest to characterize multiple sclerosis given a brain connectivity network: we cast the problem as an OD task, by making an analogy between disconnected brain regions, the hallmark of the disease, and outliers. All proposals have been thoroughly empirically validated on either classical or ad hoc datasets: we performed several analyses, including comparisons to state-of-the-art approaches and statistical tests. This thesis proves the suitability of RF-based approaches for OD from different perspectives: not only they can be successfully used for the task, but we can also use them to extract distances or features. Further, by contributing to this field, this thesis proves that there are still many aspects requiring further investigation

    A Novel Anomaly Score for Isolation Forests

    No full text
    Isolation Forests represent a recent variant of Random Forests, specifically designed for one-class classification problems. In the original version, this method builds a set of extremely randomized trees to describe the set of points, subsequently measuring the \u201canomaly\u201d of a testing point by looking at how much deep it arrives in each tree. Even if few extensions have been recently proposed \u2013 mainly aimed at improving the training stage \u2013 in most cases the anomaly score is still kept in its original formulation, which does not completely exploit all the information contained in the trained forest. This paper is focused on improving this aspect, and proposes a new approach for the computation of the anomaly score, which exploits the different information linked to the different nodes of the trees of the forest. We investigate three dif- ferent variants of the novel anomaly score, evaluating them with twelve UCI benchmark datasets, with encouraging results
    corecore