46 research outputs found

    Anomaly Detection in Time Series: Theoretical and Practical Improvements for Disease Outbreak Detection

    Get PDF
    The automatic collection and increasing availability of health data provides a new opportunity for techniques to monitor this information. By monitoring pre-diagnostic data sources, such as over-the-counter cough medicine sales or emergency room chief complaints of cough, there exists the potential to detect disease outbreaks earlier than traditional laboratory disease confirmation results. This research is particularly important for a modern, highly-connected society, where the onset of disease outbreak can be swift and deadly, whether caused by a naturally occurring global pandemic such as swine flu or a targeted act of bioterrorism. In this dissertation, we first describe the problem and current state of research in disease outbreak detection, then provide four main additions to the field. First, we formalize a framework for analyzing health series data and detecting anomalies: using forecasting methods to predict the next day's value, subtracting the forecast to create residuals, and finally using detection algorithms on the residuals. The formalized framework indicates the link between the forecast accuracy of the forecast method and the performance of the detector, and can be used to quantify and analyze the performance of a variety of heuristic methods. Second, we describe improvements for the forecasting of health data series. The application of weather as a predictor, cross-series covariates, and ensemble forecasting each provide improvements to forecasting health data. Third, we describe improvements for detection. This includes the use of multivariate statistics for anomaly detection and additional day-of-week preprocessing to aid detection. Most significantly, we also provide a new method, based on the CuScore, for optimizing detection when the impact of the disease outbreak is known. This method can provide an optimal detector for rapid detection, or for probability of detection within a certain timeframe. Finally, we describe a method for improved comparison of detection methods. We provide tools to evaluate how well a simulated data set captures the characteristics of the authentic series and time-lag heatmaps, a new way of visualizing daily detection rates or displaying the comparison between two methods in a more informative way

    Modeling emergency department visit patterns for infectious disease complaints: results and application to disease surveillance

    Get PDF
    BACKGROUND: Concern over bio-terrorism has led to recognition that traditional public health surveillance for specific conditions is unlikely to provide timely indication of some disease outbreaks, either naturally occurring or induced by a bioweapon. In non-traditional surveillance, the use of health care resources are monitored in "near real" time for the first signs of an outbreak, such as increases in emergency department (ED) visits for respiratory, gastrointestinal or neurological chief complaints (CC). METHODS: We collected ED CCs from 2/1/94 – 5/31/02 as a training set. A first-order model was developed for each of seven CC categories by accounting for long-term, day-of-week, and seasonal effects. We assessed predictive performance on subsequent data from 6/1/02 – 5/31/03, compared CC counts to predictions and confidence limits, and identified anomalies (simulated and real). RESULTS: Each CC category exhibited significant day-of-week differences. For most categories, counts peaked on Monday. There were seasonal cycles in both respiratory and undifferentiated infection complaints and the season-to-season variability in peak date was summarized using a hierarchical model. For example, the average peak date for respiratory complaints was January 22, with a season-to-season standard deviation of 12 days. This season-to-season variation makes it challenging to predict respiratory CCs so we focused our effort and discussion on prediction performance for this difficult category. Total ED visits increased over the study period by 4%, but respiratory complaints decreased by roughly 20%, illustrating that long-term averages in the data set need not reflect future behavior in data subsets. CONCLUSION: We found that ED CCs provided timely indicators for outbreaks. Our approach led to successful identification of a respiratory outbreak one-to-two weeks in advance of reports from the state-wide sentinel flu surveillance and of a reported increase in positive laboratory test results

    Modelling computer network traffic using wavelets and time series analysis

    Get PDF
    Modelling of network traffic is a notoriously difficult problem. This is primarily due to the ever-increasing complexity of network traffic and the different ways in which a network may be excited by user activity. The ongoing development of new network applications, protocols, and usage profiles further necessitate the need for models which are able to adapt to the specific networks in which they are deployed. These considerations have in large part driven the evolution of statistical profiles of network traffic from simple Poisson processes to non-Gaussian models that incorporate traffic burstiness, non-stationarity, self-similarity, long-range dependence (LRD) and multi-fractality. The need for ever more sophisticated network traffic models has led to the specification of a myriad of traffic models since. Many of these are listed in [91, 14]. In networks comprised of IoT devices much of the traffic is generated by devices which function autonomously and in a more deterministic fashion. Thus in this dissertation the activity of building time series models for IoT network traffic is undertaken. In the work that follows a broad review of the historical development of network traffic modelling is presented tracing a path that leads to the use of time series analysis for the said task. An introduction to time series analysis is provided in order to facilitate the theoretical discussion regarding the feasibility and suitability of time series analysis techniques for modelling network traffic. The theory is then followed by a summary of the techniques and methodology that might be followed to detect, remove and/or model the typical characteristics associated with network traffic such as linear trends, cyclic trends, periodicity, fractality, and long range dependence. A set of experiments is conducted in order determine the effect of fractality on the estimation of AR and MA components of a time series model. A comparison of various Hurst estimation techniques is also performed on synthetically generated data. The wavelet-based Abry-Veitch Hurst estimator is found to perform consistly well with respect to its competitors, and the subsequent removal of fractality via fractional differencing is found to provide a substantial improvement on the estimation of time series model parameters

    Outlier Identification in Spatio-Temporal Processes

    Full text link
    This dissertation answers some of the statistical challenges arising in spatio-temporal data from Internet traffic, electricity grids and climate models. It begins with methodological contributions to the problem of anomaly detection in communication networks. Using electricity consumption patterns for University of Michigan campus, the well known spatial prediction method kriging has been adapted for identification of false data injections into the system. Events like Distributed Denial of Service (DDoS), Botnet/Malware attacks, Port Scanning etc. call for methods which can identify unusual activity in Internet traffic patterns. Storing information on the entire network though feasible cannot be done at the time scale at which data arrives. In this work, hashing techniques which can produce summary statistics for the network have been used. The hashed data so obtained indeed preserves the heavy tailed nature of traffic payloads, thereby providing a platform for the application of extreme value theory (EVT) to identify heavy hitters in volumetric attacks. These methods based on EVT require the estimation of the tail index of a heavy tailed distribution. The traditional estimators (Hill et al. (1975)) for the tail index tend to be biased in the presence of outliers. To circumvent this issue, a trimmed version of the classic Hill estimator has been proposed and studied from a theoretical perspective. For the Pareto domain of attraction, the optimality and asymptotic normality of the estimator has been established. Additionally, a data driven strategy to detect the number of extreme outliers in heavy tailed data has also been presented. The dissertation concludes with the statistical formulation of m-year return levels of extreme climatic events (heat/cold waves). The Generalized Pareto distribution (GPD) serves as good fit for modeling peaks over threshold of a distribution. Allowing the parameters of the GPD to vary as a function of covariates such as time of the year, El-Nino and location in the US, extremes of the areal impact of heat waves have been well modeled and inferred.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145789/1/shrijita_1.pd

    A DATA ANALYTICAL FRAMEWORK FOR IMPROVING REAL-TIME, DECISION SUPPORT SYSTEMS IN HEALTHCARE

    Get PDF
    In this dissertation we develop a framework that combines data mining, statistics and operations research methods for improving real-time decision support systems in healthcare. Our approach consists of three main concepts: data gathering and preprocessing, modeling, and deployment. We introduce the notion of offline and semi-offline modeling to differentiate between models that are based on known baseline behavior and those based on a baseline with missing information. We apply and illustrate the framework in the context of two important healthcare contexts: biosurveillance and kidney allocation. In the biosurveillance context, we address the problem of early detection of disease outbreaks. We discuss integer programming-based univariate monitoring and statistical and operations research-based multivariate monitoring approaches. We assess method performance on authentic biosurveillance data. In the kidney allocation context, we present a two-phase model that combines an integer programming-based learning phase and a data-analytical based real-time phase. We examine and evaluate our method on the current Organ Procurement and Transplantation Network (OPTN) waiting list. In both contexts, we show that our framework produces significant improvements over existing methods

    Syndromic surveillance: reports from a national conference, 2004

    Get PDF
    Overview, Policy, and Systems -- Federal Role in Early Detection Preparedness Systems -- BioSense: Implementation of a National Early Event Detection and Situational Awareness System -- Guidelines for Constructing a Statewide Hospital Syndromic Surveillance Network -- -- Data Sources -- Implementation of Laboratory Order Data in BioSense Early Event Detection and Situation Awareness System -- Use of Medicaid Prescription Data for Syndromic Surveillance ? New York -- Poison Control Center?Based Syndromic Surveillance for Foodborne Illness -- Monitoring Over-The-Counter Medication Sales for Early Detection of Disease Outbreaks ? New York City -- Experimental Surveillance Using Data on Sales of Over-the-Counter Medications ? Japan, November 2003?April 2004 -- -- Analytic Methods -- Public Health Monitoring Tools for Multiple Data Streams -- Use of Multiple Data Streams to Conduct Bayesian Biologic Surveillance -- Space-Time Clusters with Flexible Shapes -- INFERNO: A System for Early Outbreak Detection and Signature Forecasting -- High-Fidelity Injection Detectability Experiments: a Tool for Evaluating Syndromic Surveillance Systems -- Linked Analysis for Definition of Nurse Advice Line Syndrome Groups, and Comparison to Encounters -- -- Simulation and Other Evaluation Approaches -- Simulation for Assessing Statistical Methods of Biologic Terrorism Surveillance -- An Evaluation Model for Syndromic Surveillance: Assessing the Performance of a Temporal Algorithm -- Evaluation of Syndromic Surveillance Based on National Health Service Direct Derived Data ? England and Wales -- Initial Evaluation of the Early Aberration Reporting System ? Florida -- -- Practice and Experience -- Deciphering Data Anomalies in BioSense -- Syndromic Surveillance on the Epidemiologist?s Desktop: Making Sense of Much Data -- Connecting Health Departments and Providers: Syndromic Surveillance?s Last Mile -- Comparison of Syndromic Surveillance and a Sentinel Provider System in Detecting an Influenza Outbreak ? Denver, Colorado, 2003 -- Ambulatory-Care Diagnoses as Potential Indicators of Outbreaks of Gastrointestinal Illness ? Minnesota -- Emergency Department Visits for Concern Regarding Anthrax ? New Jersey, 2001 -- Hospital Admissions Syndromic Surveillance ? Connecticut, October 2001?June 2004 -- Three Years of Emergency Department Gastrointestinal Syndromic Surveillance in New York City: What Have we Found?"August 26, 2005."Papers from the National Syndromic Surveillance Conference sponsored by the Centers for Disease Control and Prevention, the Tufts Health Care Institute, the Alfred P. Sloan Foundation, held Nov. 3-4, 2004 in Boston, MA."Public health surveillance continues to broaden in scope and intensity. Public health professionals responsible for conducting such surveillance must keep pace with evolving methodologies, models, business rules, policies, roles, and procedures. The third annual Syndromic Surveillance Conference was held in Boston, Massachusetts, during November 3-4, 2004. The conference was attended by 440 persons representing the public health, academic, and private-sector communities from 10 countries and provided a forum for scientific discourse and interaction regarding multiple aspects of public health surveillance." - p. 3Also vailable via the World Wide Web

    Predictive Maintenance of Wind Generators based on AI Techniques

    Get PDF
    As global warming is slowly becoming a dangerous reality, governments and private institutions are introducing policies to minimize it. Those policies have led to the development and deployment of Renewable Energy Sources (RESs), which introduces new challenges, among which the minimization of downtime and Levelised Cost of Energy (LCOE) by optimizing maintenance strategy where early detection of incipient faults is of significant intent. Hence, this is the focus of this thesis. While there are several maintenance approaches, predictive maintenance can utilize SCADA readings from large scale power plants to detect early signs of failures, which can be characterized by abnormal patterns in the measurements. There exists several approaches to detect these patterns such as model-based or hybrid techniques, but these require the detailed knowledge of the analyzed system. As SCADA system collects large amounts of data, machine learning techniques can be used to detect the underlying failure patterns and notify customers of the abnormal behaviour. In this work, a novel framework based on machine learning techniques for fault prediction of wind farm generators is developed for an actual customer. The proposed fault prognosis methodology addresses data limitation such as class imbalance and missing data, performs statistical tests on time series to test for its stationarity, selects the features with the most predictive power, and applies machine learning models to predict a fault with 1 hour horizon. The proposed techniques are tested and validated using historical data for a wind farm in Summerside, Prince Edward Island (PEI), Canada, and models are evaluated based on appropriate evaluation metrics. The results demonstrate the ability of the proposed methodology to predict wind generator failures, and the viability of the proposed methodology for optimizing preventive maintenance strategies

    Spectral Temporal Information for Missing Data Reconstruction (STIMDR) of Landsat Reflectance Time Series

    Get PDF
    The number of Landsat time-series applications has grown substantially because of its approximately 50-year history and relatively high spatial resolution for observing long term changes in the Earth’s surface. However, missing observations (i.e., gaps) caused by clouds and cloud shadows, orbit and sensing geometry, and sensor issues have broadly limited the development of Landsat time-series applications. Due to the large area and temporal and spatial irregularity of time-series gaps, it is difficult to find an efficient and highly precise method to fill them. The Missing Observation Prediction based on Spectral-Temporal Metrics (MOPSTM) method has been proposed and delivered good performance in filling large-area gaps of single-date Landsat images. However, it can be less practical for a time series longer than one year due to the lack of mechanics that exclude dissimilar data in time series (e.g., different phenology or changes in land cover). To solve this problem, this study proposes a new gap-filling method, Spectral Temporal Information for Missing Data Reconstruction (STIMDR), and examines its performance in Landsat reflectance time series. Two groups of experiments, including 2000 × 2000 pixel Landsat single-date images and Landsat time series acquired from four sites (Kenya, Finland, Germany, and China), were performed to test the new method. We simulated artificial gaps to evaluate predicted pixel values with real observations. Quantitative and qualitative evaluations of gap-filled images through comparisons with other state-of-the-art methods confirmed the more robust and accurate performance of the proposed method. In addition, the proposed method was also able to fill gaps contaminated by extreme cloud cover for a period (e.g., winter in high-latitude areas). A down-stream task of random forest supervised classification through both gap-filled simulated datasets and the original valid datasets verified that STIMDR-generated products are relevant to the user community for land cover applications

    mCity: utilização de dados de monitorização de uma cidade inteligente para caraterizar e melhorar a mobilidade urbana

    Get PDF
    The sustainable growth of cities created the need for better informed decisions based on information and communication technologies to sense the city and quantify its pulse. An important part in this concept of \smart cities" is the characterization of the traffic flows. In this work, we aim at characterizing the urban mobility in two different cities, Porto and Aveiro. The structure and contents of the corresponding datasets is very different, enabling two case studies, with distinct use cases related to traffic analysis and forecasting. For the Porto use case, we had access to road-mounted traffic sensors and the buses tracking data. The first source was studied and was looked for patterns (e.g.: weekdays behavior). Historic traffic counters data was used to forecast future flows, using both statistical and deep learning methods. We found that it was not possible to find a clear relationship between (buses) speed and traffic intensity, however, when the speed was high, there was low intensity, and when there was high intensity, the velocity was low. There are daily and weekly patterns in the traffic flow data that enable forecasting. When the anomalies in traffic do happen, the methods for short-term forecasting perform better than those for long-term forecasting. In the Aveiro use case, the dataset includes bus traces, that were used to characterize the driving behavior, based on speed and acceleration. These data were mapped into the city to find problematic areas. Side-by-side visualizations help with the comparison of the traffic behavior in selected time periods. We observed that some roads often present the same problems, independently of the day or time of the day. In other parts of the city, the problems can be found more often in specifics periods. The datasets for Aveiro and Porto were sampled with different frequency (each second and each minute, respectively). We confirmed, with simulations, that the analysis made for Aveiro was not possible with the granularity of the Porto's data set (as some information would be lost). The computational pipeline to run the supporting analyses is fully implemented, as well the required integrations to programmatically obtain the data from the existing data sinks. For the driving behavior analysis, a web dashboard is deployed, enabling the relevant departments to study potential problematic areas in the city of Aveiro.O crescimento sustentável das cidades criou a necessidade de decisões melhor informadas, baseadas em tecnologias de informação e comunicação para sentir a cidade e quantificar o seu pulso. Uma parte importante no conceito de “cidades inteligentes" é a caracterização dos luxos de tráfego. O objetivo deste trabalho ‘e caraterizar a mobilidade em duas cidades diferentes: Porto e Aveiro. A estrutura e conteúdo dos respetivos datasets é muito diferente, permitindo dois casos de estudo, com casos de uso distintos relacionados com a análise de tráfego e a previsão. Para o caso de uso do Porto, foi concedido acesso a sensores de tráfego instalados na estrada e dados de rastreamento de autocarros. Para a primeira fonte realizou-se um estudo e a pesquisa de padrões (por exemplo, o comportamento dos dias da semana). Dados históricos dos contadores de tráfego foram usados para prever fluxos futuros, usando métodos estatísticos e de aprendizagem profunda. Descobrimos que não era possível encontrar uma relação clara entre a velocidade (dos autocarros) e a intensidade do tráfego, no entanto, quando a velocidade era alta, havia baixa intensidade e, quando havia alta intensidade, a velocidade era baixa. Existem padrões diários e semanais nos dados do fluxo de tráfego que permitem a previsão. Quando as anomalias no tráfego ocorrem, os métodos para previsão de curto prazo têm um desempenho melhor do que aqueles para previsão de longo prazo. Para o caso de uso de Aveiro, o conjunto de dados inclui rastreamentos de autocarros, que foram utilizados para caraterizar o comportamento de condução, baseado na velocidade e aceleração. Esses dados foram mapeados na cidade para encontrar áreas problemáticas. As visualizações lado a lado ajudam na comparação do comportamento do tráfego em períodos selecionados. Foi observado que algumas estradas apresentam frequentemente os mesmos problemas, independentemente do dia ou da hora do dia. Em outras partes da cidade, os problemas podem ser encontrados com mais frequência em períodos específicos. Os conjuntos de dados de Aveiro e Porto tinham amostras com diferentes frequências (a cada segundo e a cada minuto, respetivamente). Confirmamos, com simulações, que a analise feita para Aveiro não era possível com a granularidade do conjunto de dados do Porto (dado que algumas informações seriam perdidas). A pipeline computacional para executar as análises de suporte foi totalmente implementada, bem como as integrações necessárias para obter programaticamente os dados das fontes de dados existentes. Foi desenvolvida uma pipeline de previsão de tráfego para o Porto. Para a análise do comportamento de condução, foi construída uma web dashboard, permitindo que os departamentos relevantes estudem possíveis áreas problemáticas na cidade de Aveiro.Mestrado em Engenharia Informátic
    corecore