33 research outputs found

    Detecting Outliers in Data with Correlated Measures

    Full text link
    Advances in sensor technology have enabled the collection of large-scale datasets. Such datasets can be extremely noisy and often contain a significant amount of outliers that result from sensor malfunction or human operation faults. In order to utilize such data for real-world applications, it is critical to detect outliers so that models built from these datasets will not be skewed by outliers. In this paper, we propose a new outlier detection method that utilizes the correlations in the data (e.g., taxi trip distance vs. trip time). Different from existing outlier detection methods, we build a robust regression model that explicitly models the outliers and detects outliers simultaneously with the model fitting. We validate our approach on real-world datasets against methods specifically designed for each dataset as well as the state of the art outlier detectors. Our outlier detection method achieves better performances, demonstrating the robustness and generality of our method. Last, we report interesting case studies on some outliers that result from atypical events.Comment: 10 page

    Fast online computation of the Qn estimator with applications to the detection of outliers in data streams

    Get PDF
    We present FQN (Fast Qn), a novel algorithm for online computation of the Qn scale estimator. The algorithm works in the sliding window model, cleverly computing the Qn scale estimator in the current window. We thoroughly compare our algorithm for online Qn with the state of the art competing algorithm by Nunkesser et al., and show that FQN (i) is faster, requiring only O(s) time in the worst case where s is the length of the window (ii) its computational complexity does not depend on the input distribution and (iii) it requires less space. To the best of our knowledge, our algorithm is the first that allows online computation of the Qn scale estimator in worst case time linear in the size of the window. As an example of a possible application, besides its use as a robust measure of statistical dispersion, we show how to use the Qn estimator for fast detection of outliers in data streams. Extensive experimental results on both synthetic and real datasets confirm the validity of our approach

    Detection of Critical Events in Renewable Energy Production Time Series

    Full text link
    The introduction of more renewable energy sources into the energy system increases the variability and weather dependence of electricity generation. Power system simulations are used to assess the adequacy and reliability of the electricity grid over decades, but often become computational intractable for such long simulation periods with high technical detail. To alleviate this computational burden, we investigate the use of outlier detection algorithms to find periods of extreme renewable energy generation which enables detailed modelling of the performance of power systems under these circumstances. Specifically, we apply the Maximum Divergent Intervals (MDI) algorithm to power generation time series that have been derived from ERA5 historical climate reanalysis covering the period from 1950 through 2019. By applying the MDI algorithm on these time series, we identified intervals of extreme low and high energy production. To determine the outlierness of an interval different divergence measures can be used. Where the cross-entropy measure results in shorter and strongly peaking outliers, the unbiased Kullback-Leibler divergence tends to detect longer and more persistent intervals. These intervals are regarded as potential risks for the electricity grid by domain experts, showcasing the capability of the MDI algorithm to detect critical events in these time series. For the historical period analysed, we found no trend in outlier intensity, or shift and lengthening of the outliers that could be attributed to climate change. By applying MDI on climate model output, power system modellers can investigate the adequacy and possible changes of risk for the current and future electricity grid under a wider range of scenarios

    Web-based Geographical Visualization of Container Itineraries

    Get PDF
    Around 90% of the world cargo is transported in maritime containers, but only around 2% are physically inspected. This opens the possibility for illicit activities. A viable solution is to control containerized cargo through information-based risk analysis. Container route-based analysis has been considered a key factor in identifying potentially suspicious consignments. Essential part of itinerary analysis is the geographical visualization of the itinerary. In the present paper, we present initial work of a web-based system’s realization for interactive geographical visualization of container itinerary.JRC.G.4-Maritime affair

    Graph based Anomaly Detection and Description: A Survey

    Get PDF
    Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly attribution and highlight the major techniques that facilitate digging out the root cause, or the ‘why’, of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field

    An adaptive, fault-tolerant system for road network traffic prediction using machine learning

    Get PDF
    This thesis has addressed the design and development of an integrated system for real-time traffic forecasting based on machine learning methods. Although traffic prediction has been the driving motivation for the thesis development, a great part of the proposed ideas and scientific contributions in this thesis are generic enough to be applied in any other problem where, ideally, their definition is that of the flow of information in a graph-like structure. Such application is of special interest in environments susceptible to changes in the underlying data generation process. Moreover, the modular architecture of the proposed solution facilitates the adoption of small changes to the components that allow it to be adapted to a broader range of problems. On the other hand, certain specific parts of this thesis are strongly tied to the traffic flow theory. The focus in this thesis is on a macroscopic perspective of the traffic flow where the individual road traffic flows are correlated to the underlying traffic demand. These short-term forecasts include the road network characterization in terms of the corresponding traffic measurements –traffic flow, density and/or speed–, the traffic state –whether a road is congested or not, and its severity–, and anomalous road conditions –incidents or other non-recurrent events–. The main traffic data used in this thesis is data coming from detectors installed along the road networks. Nevertheless, other kinds of traffic data sources could be equally suitable with the appropriate preprocessing. This thesis has been developed in the context of Aimsun Live –a simulation-based traffic solution for real-time traffic prediction developed by Aimsun–. The methods proposed here is planned to be linked to it in a mutually beneficial relationship where they cooperate and assist each other. An example is when an incident or non-recurrent event is detected with the proposed methods in this thesis, then the simulation-based forecasting module can simulate different strategies to measure their impact. Part of this thesis has been also developed in the context of the EU research project "SETA" (H2020-ICT-2015). The main motivation that has guided the development of this thesis is enhancing those weak points and limitations previously identified in Aimsun Live, and whose research found in literature has not been especially extensive. These include: • Autonomy, both in the preparation and real-time stages. • Adaptation, to gradual or abrupt changes in traffic demand or supply. • Informativeness, about anomalous road conditions. • Forecasting accuracy improved with respect to previous methodology at Aimsun and a typical forecasting baseline. • Robustness, to deal with faulty or missing data in real-time. • Interpretability, adopting modelling choices towards a more transparent reasoning and understanding of the underlying data-driven decisions. • Scalable, using a modular architecture with emphasis on a parallelizable exploitation of large amounts of data. The result of this thesis is an integrated system –Adarules– for real-time forecasting which is able to make the best of the available historical data, while at the same time it also leverages the theoretical unbounded size of data in a continuously streaming scenario. This is achieved through the online learning and change detection features along with the automatic finding and maintenance of patterns in the network graph. In addition to the Adarules system, another result is a probabilistic model that characterizes a set of interpretable latent variables related to the traffic state based on the traffic data provided by the sensors along with optional prior knowledge provided by the traffic expert following a Bayesian approach. On top of this traffic state model, it is built the probabilistic spatiotemporal model that learns the dynamics of the transition of traffic states in the network, and whose objectives include the automatic incident detection.Esta tesis ha abordado el diseño y desarrollo de un sistema integrado para la predicción de tráfico en tiempo real basándose en métodos de aprendizaje automático. Aunque la predicción de tráfico ha sido la motivación que ha guiado el desarrollo de la tesis, gran parte de las ideas y aportaciones científicas propuestas en esta tesis son lo suficientemente genéricas como para ser aplicadas en cualquier otro problema en el que, idealmente, su definición sea la del flujo de información en una estructura de grafo. Esta aplicación es de especial interés en entornos susceptibles a cambios en el proceso de generación de datos. Además, la arquitectura modular facilita la adaptación a una gama más amplia de problemas. Por otra parte, ciertas partes específicas de esta tesis están fuertemente ligadas a la teoría del flujo de tráfico. El enfoque de esta tesis se centra en una perspectiva macroscópica del flujo de tráfico en la que los flujos individuales están ligados a la demanda de tráfico subyacente. Las predicciones a corto plazo incluyen la caracterización de las carreteras en base a las medidas de tráfico -flujo, densidad y/o velocidad-, el estado del tráfico -si la carretera está congestionada o no, y su severidad-, y la detección de condiciones anómalas -incidentes u otros eventos no recurrentes-. Los datos utilizados en esta tesis proceden de detectores instalados a lo largo de las redes de carreteras. No obstante, otros tipos de fuentes de datos podrían ser igualmente empleados con el preprocesamiento apropiado. Esta tesis ha sido desarrollada en el contexto de Aimsun Live -software desarrollado por Aimsun, basado en simulación para la predicción en tiempo real de tráfico-. Los métodos aquí propuestos cooperarán con este. Un ejemplo es cuando se detecta un incidente o un evento no recurrente, entonces pueden simularse diferentes estrategias para medir su impacto. Parte de esta tesis también ha sido desarrollada en el marco del proyecto de la UE "SETA" (H2020-ICT-2015). La principal motivación que ha guiado el desarrollo de esta tesis es mejorar aquellas limitaciones previamente identificadas en Aimsun Live, y cuya investigación encontrada en la literatura no ha sido muy extensa. Estos incluyen: -Autonomía, tanto en la etapa de preparación como en la de tiempo real. -Adaptación, a los cambios graduales o abruptos de la demanda u oferta de tráfico. -Sistema informativo, sobre las condiciones anómalas de la carretera. -Mejora en la precisión de las predicciones con respecto a la metodología anterior de Aimsun y a un método típico usado como referencia. -Robustez, para hacer frente a datos defectuosos o faltantes en tiempo real. -Interpretabilidad, adoptando criterios de modelización hacia un razonamiento más transparente para un humano. -Escalable, utilizando una arquitectura modular con énfasis en una explotación paralela de grandes cantidades de datos. El resultado de esta tesis es un sistema integrado –Adarules- para la predicción en tiempo real que sabe maximizar el provecho de los datos históricos disponibles, mientras que al mismo tiempo también sabe aprovechar el tamaño teórico ilimitado de los datos en un escenario de streaming. Esto se logra a través del aprendizaje en línea y la capacidad de detección de cambios junto con la búsqueda automática y el mantenimiento de los patrones en la estructura de grafo de la red. Además del sistema Adarules, otro resultado de la tesis es un modelo probabilístico que caracteriza un conjunto de variables latentes interpretables relacionadas con el estado del tráfico basado en los datos de sensores junto con el conocimiento previo –opcional- proporcionado por el experto en tráfico utilizando un planteamiento Bayesiano. Sobre este modelo de estados de tráfico se construye el modelo espacio-temporal probabilístico que aprende la dinámica de la transición de estado

    An adaptive, fault-tolerant system for road network traffic prediction using machine learning

    Get PDF
    This thesis has addressed the design and development of an integrated system for real-time traffic forecasting based on machine learning methods. Although traffic prediction has been the driving motivation for the thesis development, a great part of the proposed ideas and scientific contributions in this thesis are generic enough to be applied in any other problem where, ideally, their definition is that of the flow of information in a graph-like structure. Such application is of special interest in environments susceptible to changes in the underlying data generation process. Moreover, the modular architecture of the proposed solution facilitates the adoption of small changes to the components that allow it to be adapted to a broader range of problems. On the other hand, certain specific parts of this thesis are strongly tied to the traffic flow theory. The focus in this thesis is on a macroscopic perspective of the traffic flow where the individual road traffic flows are correlated to the underlying traffic demand. These short-term forecasts include the road network characterization in terms of the corresponding traffic measurements –traffic flow, density and/or speed–, the traffic state –whether a road is congested or not, and its severity–, and anomalous road conditions –incidents or other non-recurrent events–. The main traffic data used in this thesis is data coming from detectors installed along the road networks. Nevertheless, other kinds of traffic data sources could be equally suitable with the appropriate preprocessing. This thesis has been developed in the context of Aimsun Live –a simulation-based traffic solution for real-time traffic prediction developed by Aimsun–. The methods proposed here is planned to be linked to it in a mutually beneficial relationship where they cooperate and assist each other. An example is when an incident or non-recurrent event is detected with the proposed methods in this thesis, then the simulation-based forecasting module can simulate different strategies to measure their impact. Part of this thesis has been also developed in the context of the EU research project "SETA" (H2020-ICT-2015). The main motivation that has guided the development of this thesis is enhancing those weak points and limitations previously identified in Aimsun Live, and whose research found in literature has not been especially extensive. These include: • Autonomy, both in the preparation and real-time stages. • Adaptation, to gradual or abrupt changes in traffic demand or supply. • Informativeness, about anomalous road conditions. • Forecasting accuracy improved with respect to previous methodology at Aimsun and a typical forecasting baseline. • Robustness, to deal with faulty or missing data in real-time. • Interpretability, adopting modelling choices towards a more transparent reasoning and understanding of the underlying data-driven decisions. • Scalable, using a modular architecture with emphasis on a parallelizable exploitation of large amounts of data. The result of this thesis is an integrated system –Adarules– for real-time forecasting which is able to make the best of the available historical data, while at the same time it also leverages the theoretical unbounded size of data in a continuously streaming scenario. This is achieved through the online learning and change detection features along with the automatic finding and maintenance of patterns in the network graph. In addition to the Adarules system, another result is a probabilistic model that characterizes a set of interpretable latent variables related to the traffic state based on the traffic data provided by the sensors along with optional prior knowledge provided by the traffic expert following a Bayesian approach. On top of this traffic state model, it is built the probabilistic spatiotemporal model that learns the dynamics of the transition of traffic states in the network, and whose objectives include the automatic incident detection.Esta tesis ha abordado el diseño y desarrollo de un sistema integrado para la predicción de tráfico en tiempo real basándose en métodos de aprendizaje automático. Aunque la predicción de tráfico ha sido la motivación que ha guiado el desarrollo de la tesis, gran parte de las ideas y aportaciones científicas propuestas en esta tesis son lo suficientemente genéricas como para ser aplicadas en cualquier otro problema en el que, idealmente, su definición sea la del flujo de información en una estructura de grafo. Esta aplicación es de especial interés en entornos susceptibles a cambios en el proceso de generación de datos. Además, la arquitectura modular facilita la adaptación a una gama más amplia de problemas. Por otra parte, ciertas partes específicas de esta tesis están fuertemente ligadas a la teoría del flujo de tráfico. El enfoque de esta tesis se centra en una perspectiva macroscópica del flujo de tráfico en la que los flujos individuales están ligados a la demanda de tráfico subyacente. Las predicciones a corto plazo incluyen la caracterización de las carreteras en base a las medidas de tráfico -flujo, densidad y/o velocidad-, el estado del tráfico -si la carretera está congestionada o no, y su severidad-, y la detección de condiciones anómalas -incidentes u otros eventos no recurrentes-. Los datos utilizados en esta tesis proceden de detectores instalados a lo largo de las redes de carreteras. No obstante, otros tipos de fuentes de datos podrían ser igualmente empleados con el preprocesamiento apropiado. Esta tesis ha sido desarrollada en el contexto de Aimsun Live -software desarrollado por Aimsun, basado en simulación para la predicción en tiempo real de tráfico-. Los métodos aquí propuestos cooperarán con este. Un ejemplo es cuando se detecta un incidente o un evento no recurrente, entonces pueden simularse diferentes estrategias para medir su impacto. Parte de esta tesis también ha sido desarrollada en el marco del proyecto de la UE "SETA" (H2020-ICT-2015). La principal motivación que ha guiado el desarrollo de esta tesis es mejorar aquellas limitaciones previamente identificadas en Aimsun Live, y cuya investigación encontrada en la literatura no ha sido muy extensa. Estos incluyen: -Autonomía, tanto en la etapa de preparación como en la de tiempo real. -Adaptación, a los cambios graduales o abruptos de la demanda u oferta de tráfico. -Sistema informativo, sobre las condiciones anómalas de la carretera. -Mejora en la precisión de las predicciones con respecto a la metodología anterior de Aimsun y a un método típico usado como referencia. -Robustez, para hacer frente a datos defectuosos o faltantes en tiempo real. -Interpretabilidad, adoptando criterios de modelización hacia un razonamiento más transparente para un humano. -Escalable, utilizando una arquitectura modular con énfasis en una explotación paralela de grandes cantidades de datos. El resultado de esta tesis es un sistema integrado –Adarules- para la predicción en tiempo real que sabe maximizar el provecho de los datos históricos disponibles, mientras que al mismo tiempo también sabe aprovechar el tamaño teórico ilimitado de los datos en un escenario de streaming. Esto se logra a través del aprendizaje en línea y la capacidad de detección de cambios junto con la búsqueda automática y el mantenimiento de los patrones en la estructura de grafo de la red. Además del sistema Adarules, otro resultado de la tesis es un modelo probabilístico que caracteriza un conjunto de variables latentes interpretables relacionadas con el estado del tráfico basado en los datos de sensores junto con el conocimiento previo –opcional- proporcionado por el experto en tráfico utilizando un planteamiento Bayesiano. Sobre este modelo de estados de tráfico se construye el modelo espacio-temporal probabilístico que aprende la dinámica de la transición de estadosPostprint (published version
    corecore