8 research outputs found

    AutoML: state of the art with a focus on anomaly detection, challenges, and research directions

    Get PDF
    International audienceThe last decade has witnessed the explosion of machine learning research studies with the inception of several algorithms proposed and successfully adopted in different application domains. However, the performance of multiple machine learning algorithms is very sensitive to multiple ingredients (e.g., hyper-parameters tuning and data cleaning) where a significant human effort is required to achieve good results. Thus, building well-performing machine learning algorithms requires domain knowledge and highly specialized data scientists. Automated machine learning (autoML) aims to make easier and more accessible the use of machine learning algorithms for researchers with varying levels of expertise. Besides, research effort to date has mainly been devoted to autoML for supervised learning, and only a few research proposals have been provided for the unsupervised learning. In this paper, we present an overview of the autoML field with a particular emphasis on the automated methods and strategies that have been proposed for unsupervised anomaly detection

    AutoAD: an Automated Framework for Unsupervised Anomaly Detection

    Get PDF
    International audienceOver the last decade, we witnessed the prolifera-tion of several machine learning algorithms capable of solving different tasks for the most diverse applications. Often, for an algorithm to be effective, significant human effort is required, in particular for hyper-parameter tuning and data cleaning. Recently, there have been increasing efforts to alleviate such a burden and make machine learning algorithms easier to use for researchers with varying levels of expertise. Nevertheless, the question of whether an efficient and fully generalizable automated Machine Learning (autoML) framework is possible remains unanswered. In this paper, we present autoAD, the first autoML framework for unsupervised anomaly detection. By leveraging a pool of different anomaly detection algorithms, each one coming with its own hyper-parameter search space, our framework automatically selects the best performing ap-proach, while determining an optimal configuration for its hyper-parameters on a given dataset. Our extensive experimental evaluation, conducted on a rich collection of datasets, shows the substantial gains that can be achieved with autoAD compared to state-of-the-art methods for unsupervised anomaly detection

    UMAP: Urban Mobility Analysis Platform to Harvest Car Sharing Data

    Get PDF
    Car sharing is nowadays a popular transport means in smart cities. In particular, the free-floating paradigm lets the users look for available cars, book one, and then start and stop the rental at their will, within the city area. This is done by using a smartphone app, which in turn contacts a web-based backend to exchange information. In this paper we present UMAP, a platform to harvest data freely made available on the web to extract driving habits in cities. We design UMAP to fetch data from car sharing platforms in real time, and process it to extract more advanced information about driving patterns and user’s habits while augmenting data with mapping and direction information fetched from other web platforms. This information is stored in a data lake where historical series are built, and later analyzed using easy to design and customize analytics modules. We prove the flexibility of UMAP by presenting a case of study for the city of Turin. We collect car sharing usage data over 50 days, and characterize both the temporal and spatial properties of rentals, as well as users’ habits in using the service, which we contrast with public transportation alternatives. Results provide insights about the driving style and needs, that are useful for smart city planners, and prove the feasibility of our approach

    STREAMRHF: Tree-Based Unsupervised Anomaly Detection for Data Streams

    Get PDF
    International audienceWe present STREAMRHF, an unsupervised anomaly detection algorithm for data streams. Our algorithm builds on some of the ideas of Random Histogram Forest (RHF), a state-of-the-art algorithm for batch unsupervised anomaly detection. STREAMRHF constructs a forest of decision trees, where feature splits are determined according to the kurtosis score of every feature. It irrevocably assigns an anomaly score to data points, as soon as they arrive, by means of an incremental computation of its random trees and the kurtosis scores of the features. This allows efficient online scoring and concept drift detection altogether. Our approach is tree-based which boasts several appealing properties, such as explainability of the results. We conduct an extensive experimental evaluation on multiple datasets from different real-world applications. Our evaluation shows that our streaming algorithm achieves comparable average precision to RHF while outperforming state-of-the-art streaming approaches for unsupervised anomaly detection with furthermore limited computational complexity

    Détection d'anomalies non supervisée : méthodes et applications

    No full text
    An anomaly (also known as outlier) is an instance that significantly deviates from the rest of the input data and being defined by Hawkins as 'an observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism'. Anomaly detection (also known as outlier or novelty detection) is thus the machine learning and data mining field with the purpose of identifying those instances whose features appear to be inconsistent with the remainder of the dataset. In many applications, correctly distinguishing the set of anomalous data points (outliers) from the set of normal ones (inliers) proves to be very important. A first application is data cleaning, i.e., identifying noisy and fallacious measurement in a dataset before further applying learning algorithms. However, with the explosive growth of data volume collectable from various sources, e.g., card transactions, internet connections, temperature measurements, etc. the use of anomaly detection becomes a crucial stand-alone task for continuous monitoring of the systems. In this context, anomaly detection can be used to detect ongoing intrusion attacks, faulty sensor networks or cancerous masses.The thesis proposes first a batch tree-based approach for unsupervised anomaly detection, called 'Random Histogram Forest (RHF)'. The algorithm solves the curse of dimensionality problem using the fourth central moment (aka kurtosis) in the model construction while boasting linear running time. A stream based anomaly detection engine, called 'ODS', that leverages DenStream, an unsupervised clustering technique is presented subsequently and finally Automated Anomaly Detection engine which alleviates the human effort required when dealing with several algorithm and hyper-parameters is presented as last contribution.Une anomalie (également connue sous le nom de outlier) est une instance qui s'écarte de manière significative du reste des données et est définie par Hawkins comme "une observation, qui s'écarte tellement des autres observations qu'elle éveille les soupçons qu'il a été généré par un mécanisme différent". La détection d’anomalies (également connue sous le nom de détection de valeurs aberrantes ou de nouveauté) est donc le domaine de l’apprentissage automatique et de l’exploration de données dans le but d’identifier les instances dont les caractéristiques semblent être incohérentes avec le reste de l’ensemble de données. Dans de nombreuses applications, distinguer correctement l'ensemble des points de données anormaux (outliers) de l'ensemble des points normaux (inliers) s'avère très important. Une première application est le nettoyage des données, c'est-à-dire l'identification des mesures bruyantes et fallacieuses dans un ensemble de données avant d'appliquer davantage les algorithmes d'apprentissage. Cependant, avec la croissance explosive du volume de données pouvant être collectées à partir de diverses sources, par exemple les transactions par carte, les connexions Internet, les mesures de température, etc., l'utilisation de la détection d'anomalies devient une tâche autonome cruciale pour la surveillance continue des systèmes. Dans ce contexte, la détection d'anomalies peut être utilisée pour détecter des attaques d'intrusion en cours, des réseaux de capteurs défaillants ou des masses cancéreuses. La thèse propose d'abord une approche basée sur un collection d'arbres pour la détection non supervisée d'anomalies, appelée "Random Histogram Forest (RHF)". L'algorithme résout le problème de la dimensionnalité en utilisant le quatrième moment central (alias 'kurtosis') dans la construction du modèle en bénéficiant d'un temps d'exécution linéaire. Un moteur de détection d'anomalies basé sur le stream, appelé 'ODS', qui exploite DenStream, une technique de clustering non supervisée est présenté par la suite et enfin un moteur de détection automatisée d'anomalies qui allège l'effort humain requis lorsqu'il s'agit de plusieurs algorithmes et hyper-paramètres est présenté en dernière contributio

    Détection d'anomalies non supervisée : méthodes et applications

    No full text
    Une anomalie (également connue sous le nom de outlier) est une instance qui s'écarte de manière significative du reste des données et est définie par Hawkins comme "une observation, qui s'écarte tellement des autres observations qu'elle éveille les soupçons qu'il a été généré par un mécanisme différent". La détection d’anomalies (également connue sous le nom de détection de valeurs aberrantes ou de nouveauté) est donc le domaine de l’apprentissage automatique et de l’exploration de données dans le but d’identifier les instances dont les caractéristiques semblent être incohérentes avec le reste de l’ensemble de données. Dans de nombreuses applications, distinguer correctement l'ensemble des points de données anormaux (outliers) de l'ensemble des points normaux (inliers) s'avère très important. Une première application est le nettoyage des données, c'est-à-dire l'identification des mesures bruyantes et fallacieuses dans un ensemble de données avant d'appliquer davantage les algorithmes d'apprentissage. Cependant, avec la croissance explosive du volume de données pouvant être collectées à partir de diverses sources, par exemple les transactions par carte, les connexions Internet, les mesures de température, etc., l'utilisation de la détection d'anomalies devient une tâche autonome cruciale pour la surveillance continue des systèmes. Dans ce contexte, la détection d'anomalies peut être utilisée pour détecter des attaques d'intrusion en cours, des réseaux de capteurs défaillants ou des masses cancéreuses. La thèse propose d'abord une approche basée sur un collection d'arbres pour la détection non supervisée d'anomalies, appelée "Random Histogram Forest (RHF)". L'algorithme résout le problème de la dimensionnalité en utilisant le quatrième moment central (alias 'kurtosis') dans la construction du modèle en bénéficiant d'un temps d'exécution linéaire. Un moteur de détection d'anomalies basé sur le stream, appelé 'ODS', qui exploite DenStream, une technique de clustering non supervisée est présenté par la suite et enfin un moteur de détection automatisée d'anomalies qui allège l'effort humain requis lorsqu'il s'agit de plusieurs algorithmes et hyper-paramètres est présenté en dernière contributionAn anomaly (also known as outlier) is an instance that significantly deviates from the rest of the input data and being defined by Hawkins as 'an observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism'. Anomaly detection (also known as outlier or novelty detection) is thus the machine learning and data mining field with the purpose of identifying those instances whose features appear to be inconsistent with the remainder of the dataset. In many applications, correctly distinguishing the set of anomalous data points (outliers) from the set of normal ones (inliers) proves to be very important. A first application is data cleaning, i.e., identifying noisy and fallacious measurement in a dataset before further applying learning algorithms. However, with the explosive growth of data volume collectable from various sources, e.g., card transactions, internet connections, temperature measurements, etc. the use of anomaly detection becomes a crucial stand-alone task for continuous monitoring of the systems. In this context, anomaly detection can be used to detect ongoing intrusion attacks, faulty sensor networks or cancerous masses.The thesis proposes first a batch tree-based approach for unsupervised anomaly detection, called 'Random Histogram Forest (RHF)'. The algorithm solves the curse of dimensionality problem using the fourth central moment (aka kurtosis) in the model construction while boasting linear running time. A stream based anomaly detection engine, called 'ODS', that leverages DenStream, an unsupervised clustering technique is presented subsequently and finally Automated Anomaly Detection engine which alleviates the human effort required when dealing with several algorithm and hyper-parameters is presented as last contribution
    corecore