78 research outputs found

    RANK-BASED TEMPO-SPATIAL CLUSTERING: A FRAMEWORK FOR RAPID OUTBREAK DETECTION USING SINGLE OR MULTIPLE DATA STREAMS

    Get PDF
    In the recent decades, algorithms for disease outbreak detection have become one of the main interests of public health practitioners to identify and localize an outbreak as early as possible in order to warrant further public health response before a pandemic develops. Today’s increased threat of biological warfare and terrorism provide an even stronger impetus to develop methods for outbreak detection based on symptoms as well as definitive laboratory diagnoses. In this dissertation work, I explore the problems of rapid disease outbreak detection using both spatial and temporal information. I develop a framework of non-parameterized algorithms which search for patterns of disease outbreak in spatial sub-regions of the monitored region within a certain period. Compared to the current existing spatial or tempo-spatial algorithm, the algorithms in this framework provide a methodology for fast searching of either univariate data set or multivariate data set. It first measures which study area is more likely to have an outbreak occurring given the baseline data and currently observed data. Then it applies a greedy searching mechanism to look for clusters with high posterior probabilities given the risk measurement for each unit area as heuristic. I also explore the performance of the proposed algorithms. From the perspective of predictive modeling, I adopt a Gamma-Poisson (GP) model to compute the probability of having an outbreak in each cluster when analyzing univariate data. I build a multinomial generalized Dirichlet (MGD) model to identify outbreak clusters from multivariate data which include the OTC data streams collected by the national retail data monitor (NRDM) and the ED data streams collected by the RODS system. Key contributions of this dissertation include 1) it introduces a rank-based tempo-spatial clustering algorithm, RSC, by utilizing greedy searching and Bayesian GP model for disease outbreak detection with comparable detection timeliness, cluster positive prediction value (PPV) and improved running time; 2) it proposes a multivariate extension of RSC (MRSC) which applies MGD model. The evaluation demonstrated the advantage that MGD model can effectively suppress the false alarms caused by elevated signals that are non-disease relevant and occur in all the monitored data streams

    Anomaly Detection in Time Series: Theoretical and Practical Improvements for Disease Outbreak Detection

    Get PDF
    The automatic collection and increasing availability of health data provides a new opportunity for techniques to monitor this information. By monitoring pre-diagnostic data sources, such as over-the-counter cough medicine sales or emergency room chief complaints of cough, there exists the potential to detect disease outbreaks earlier than traditional laboratory disease confirmation results. This research is particularly important for a modern, highly-connected society, where the onset of disease outbreak can be swift and deadly, whether caused by a naturally occurring global pandemic such as swine flu or a targeted act of bioterrorism. In this dissertation, we first describe the problem and current state of research in disease outbreak detection, then provide four main additions to the field. First, we formalize a framework for analyzing health series data and detecting anomalies: using forecasting methods to predict the next day's value, subtracting the forecast to create residuals, and finally using detection algorithms on the residuals. The formalized framework indicates the link between the forecast accuracy of the forecast method and the performance of the detector, and can be used to quantify and analyze the performance of a variety of heuristic methods. Second, we describe improvements for the forecasting of health data series. The application of weather as a predictor, cross-series covariates, and ensemble forecasting each provide improvements to forecasting health data. Third, we describe improvements for detection. This includes the use of multivariate statistics for anomaly detection and additional day-of-week preprocessing to aid detection. Most significantly, we also provide a new method, based on the CuScore, for optimizing detection when the impact of the disease outbreak is known. This method can provide an optimal detector for rapid detection, or for probability of detection within a certain timeframe. Finally, we describe a method for improved comparison of detection methods. We provide tools to evaluate how well a simulated data set captures the characteristics of the authentic series and time-lag heatmaps, a new way of visualizing daily detection rates or displaying the comparison between two methods in a more informative way

    Machine Learning Solutions for Transportation Networks

    Get PDF
    This thesis brings a collection of novel models and methods that result from a new look at practical problems in transportation through the prism of newly available sensor data. There are four main contributions: First, we design a generative probabilistic graphical model to describe multivariate continuous densities such as observed traffic patterns. The model implements a multivariate normal distribution with covariance constrained in a natural way, using a number of parameters that is only linear (as opposed to quadratic) in the dimensionality of the data. This means that learning these models requires less data. The primary use for such a model is to support inferences, for instance, of data missing due to sensor malfunctions. Second, we build a model of traffic flow inspired by macroscopic flow models. Unlike traditional such models, our model deals with uncertainty of measurement and unobservability of certain important quantities and incorporates on-the-fly observations more easily. Because the model does not admit efficient exact inference, we develop a particle filter. The model delivers better medium- and long- term predictions than general-purpose time series models. Moreover, having a predictive distribution of traffic state enables the application of powerful decision-making machinery to the traffic domain. Third, two new optimization algorithms for the common task of vehicle routing are designed, using the traffic flow model as their probabilistic underpinning. Their benefits include suitability to highly volatile environments and the fact that optimization criteria other than the classical minimal expected time are easily incorporated. Finally, we present a new method for detecting accidents and other adverse events. Data collected from highways enables us to bring supervised learning approaches to incident detection. We show that a support vector machine learner can outperform manually calibrated solutions. A major hurdle to performance of supervised learners is the quality of data which contains systematic biases varying from site to site. We build a dynamic Bayesian network framework that learns and rectifies these biases, leading to improved supervised detector performance with little need for manually tagged data. The realignment method applies generally to virtually all forms of labeled sequential data

    Syndromic surveillance: reports from a national conference, 2004

    Get PDF
    Overview, Policy, and Systems -- Federal Role in Early Detection Preparedness Systems -- BioSense: Implementation of a National Early Event Detection and Situational Awareness System -- Guidelines for Constructing a Statewide Hospital Syndromic Surveillance Network -- -- Data Sources -- Implementation of Laboratory Order Data in BioSense Early Event Detection and Situation Awareness System -- Use of Medicaid Prescription Data for Syndromic Surveillance ? New York -- Poison Control Center?Based Syndromic Surveillance for Foodborne Illness -- Monitoring Over-The-Counter Medication Sales for Early Detection of Disease Outbreaks ? New York City -- Experimental Surveillance Using Data on Sales of Over-the-Counter Medications ? Japan, November 2003?April 2004 -- -- Analytic Methods -- Public Health Monitoring Tools for Multiple Data Streams -- Use of Multiple Data Streams to Conduct Bayesian Biologic Surveillance -- Space-Time Clusters with Flexible Shapes -- INFERNO: A System for Early Outbreak Detection and Signature Forecasting -- High-Fidelity Injection Detectability Experiments: a Tool for Evaluating Syndromic Surveillance Systems -- Linked Analysis for Definition of Nurse Advice Line Syndrome Groups, and Comparison to Encounters -- -- Simulation and Other Evaluation Approaches -- Simulation for Assessing Statistical Methods of Biologic Terrorism Surveillance -- An Evaluation Model for Syndromic Surveillance: Assessing the Performance of a Temporal Algorithm -- Evaluation of Syndromic Surveillance Based on National Health Service Direct Derived Data ? England and Wales -- Initial Evaluation of the Early Aberration Reporting System ? Florida -- -- Practice and Experience -- Deciphering Data Anomalies in BioSense -- Syndromic Surveillance on the Epidemiologist?s Desktop: Making Sense of Much Data -- Connecting Health Departments and Providers: Syndromic Surveillance?s Last Mile -- Comparison of Syndromic Surveillance and a Sentinel Provider System in Detecting an Influenza Outbreak ? Denver, Colorado, 2003 -- Ambulatory-Care Diagnoses as Potential Indicators of Outbreaks of Gastrointestinal Illness ? Minnesota -- Emergency Department Visits for Concern Regarding Anthrax ? New Jersey, 2001 -- Hospital Admissions Syndromic Surveillance ? Connecticut, October 2001?June 2004 -- Three Years of Emergency Department Gastrointestinal Syndromic Surveillance in New York City: What Have we Found?"August 26, 2005."Papers from the National Syndromic Surveillance Conference sponsored by the Centers for Disease Control and Prevention, the Tufts Health Care Institute, the Alfred P. Sloan Foundation, held Nov. 3-4, 2004 in Boston, MA."Public health surveillance continues to broaden in scope and intensity. Public health professionals responsible for conducting such surveillance must keep pace with evolving methodologies, models, business rules, policies, roles, and procedures. The third annual Syndromic Surveillance Conference was held in Boston, Massachusetts, during November 3-4, 2004. The conference was attended by 440 persons representing the public health, academic, and private-sector communities from 10 countries and provided a forum for scientific discourse and interaction regarding multiple aspects of public health surveillance." - p. 3Also vailable via the World Wide Web

    Subspace Representations and Learning for Visual Recognition

    Get PDF
    Pervasive and affordable sensor and storage technology enables the acquisition of an ever-rising amount of visual data. The ability to extract semantic information by interpreting, indexing and searching visual data is impacting domains such as surveillance, robotics, intelligence, human- computer interaction, navigation, healthcare, and several others. This further stimulates the investigation of automated extraction techniques that are more efficient, and robust against the many sources of noise affecting the already complex visual data, which is carrying the semantic information of interest. We address the problem by designing novel visual data representations, based on learning data subspace decompositions that are invariant against noise, while being informative for the task at hand. We use this guiding principle to tackle several visual recognition problems, including detection and recognition of human interactions from surveillance video, face recognition in unconstrained environments, and domain generalization for object recognition.;By interpreting visual data with a simple additive noise model, we consider the subspaces spanned by the model portion (model subspace) and the noise portion (variation subspace). We observe that decomposing the variation subspace against the model subspace gives rise to the so-called parity subspace. Decomposing the model subspace against the variation subspace instead gives rise to what we name invariant subspace. We extend the use of kernel techniques for the parity subspace. This enables modeling the highly non-linear temporal trajectories describing human behavior, and performing detection and recognition of human interactions. In addition, we introduce supervised low-rank matrix decomposition techniques for learning the invariant subspace for two other tasks. We learn invariant representations for face recognition from grossly corrupted images, and we learn object recognition classifiers that are invariant to the so-called domain bias.;Extensive experiments using the benchmark datasets publicly available for each of the three tasks, show that learning representations based on subspace decompositions invariant to the sources of noise lead to results comparable or better than the state-of-the-art

    Change-point Problem and Regression: An Annotated Bibliography

    Get PDF
    The problems of identifying changes at unknown times and of estimating the location of changes in stochastic processes are referred to as the change-point problem or, in the Eastern literature, as disorder . The change-point problem, first introduced in the quality control context, has since developed into a fundamental problem in the areas of statistical control theory, stationarity of a stochastic process, estimation of the current position of a time series, testing and estimation of change in the patterns of a regression model, and most recently in the comparison and matching of DNA sequences in microarray data analysis. Numerous methodological approaches have been implemented in examining change-point models. Maximum-likelihood estimation, Bayesian estimation, isotonic regression, piecewise regression, quasi-likelihood and non-parametric regression are among the methods which have been applied to resolving challenges in change-point problems. Grid-searching approaches have also been used to examine the change-point problem. Statistical analysis of change-point problems depends on the method of data collection. If the data collection is ongoing until some random time, then the appropriate statistical procedure is called sequential. If, however, a large finite set of data is collected with the purpose of determining if at least one change-point occurred, then this may be referred to as non-sequential. Not surprisingly, both the former and the latter have a rich literature with much of the earlier work focusing on sequential methods inspired by applications in quality control for industrial processes. In the regression literature, the change-point model is also referred to as two- or multiple-phase regression, switching regression, segmented regression, two-stage least squares (Shaban, 1980), or broken-line regression. The area of the change-point problem has been the subject of intensive research in the past half-century. The subject has evolved considerably and found applications in many different areas. It seems rather impossible to summarize all of the research carried out over the past 50 years on the change-point problem. We have therefore confined ourselves to those articles on change-point problems which pertain to regression. The important branch of sequential procedures in change-point problems has been left out entirely. We refer the readers to the seminal review papers by Lai (1995, 2001). The so called structural change models, which occupy a considerable portion of the research in the area of change-point, particularly among econometricians, have not been fully considered. We refer the reader to Perron (2005) for an updated review in this area. Articles on change-point in time series are considered only if the methodologies presented in the paper pertain to regression analysis

    Climate Change and Marine Geological Dynamics

    Get PDF
    The tendency for climate to change has been one of the most surprising outcomes of the study of Earth's history. Marine geoscience can reveal valuable information about past environments, climates, and biota just before, during and after each climate perturbation. Particularly, certain intervals of geological records are windows to key episodes in the climate history of the Earth–life system. Ιn this regard, the detailed analyses of such time intervals are challenging and rewarding for environmental reconstruction and climate modelling, because they provide documentation and better understanding of a warmer-than-present world, and opportunities to test and refine the predictive ability of climate models. Marine geological dynamics such as sea-level changes, hydrographic parameters, water quality, sedimentary cyclicity, and (paleo)climate are strongly related through a direct exchange between the oceanographic and atmospheric systems. The increasing attention paid to this wide topic is also motivated by the interplay of these processes across a variety of settings (coastal to open marine) and timescales (early Cenozoic to modern). In order to realize the full predictive value of these warm (fresh)/cold (salty) intervals in Earth's history, it is important to have reliable tools (e.g., integrated geochemical, paleontological and/or paleoceanographic proxies) through the application of multiple, independent, and novel techniques (e.g., TEX86, UK’37, Mg/Ca, Na/Ca, Δ47, and μCT) for providing reliable hydroclimate reconstructions at both local and global scales
    • …
    corecore