13 research outputs found

    Bayesian temporal and spatio-temporal Markov switching models for the detection of influenza outbreaks

    Get PDF
    Influenza is a disease which affects millions of people every year and causes hundreds of thousends of deads every year. This disease causes substantial direct and indirect costs every year. The influenza epidemic have a particular behavior which shapes the statistical methods for their detection. Seasonal epidemics happen virtually every year in the temperate parts of the globe during the cold months and extend throughout whole regions, countries and even continents. Besides the seasonal epidemics, some nonseasonal epidemics can be observed at unexpected times, usually caused by strains which jump the barrier between animals and humans, as happened with the well known Swine Flu epidemic, which caused great alarm in 2009. Several statistical methods have been proposed for the detection of outbreaks of diseases and, in particular, for influenza outbreaks. A reduced version of the review present in this thesis has been published in REVSTAT-Statistical Journal by Amorós et al. in 2015. An interesting tool for the modeling of statistical methods for the detection of influenza outbreaks is the use of Markov switching models, where latent variables are paired with the observations, indicating the epidemic or endemic phase. Two different models are applied to the data according to the value of the latent variable. The latent variables are temporally linked through a Markov chain. The observations are also conditionally dependent on their temporal or spatio-temporal neighbors. Models using this tool can offer a probability of being in epidemic as an outcome instead of just a ‘yes’ or ‘no’. Bayesian paradigm offers an interesting framework where the outcomes can be interpreted as probability distributions. Also, inference can be done over complex hierarchical models, as usually the Markov switching models are. This research offer two extensions of the model proposed by Martinez-Beneito et al. in 2008, published in Statistics in Medicine. The first proposal is a framework of Poison Markov switching models over the counts. This proposal has been published in Statistical Methods in Medical Research by Conesa et al. in 2015. In this proposal, the counts are modeled through a Poisson distribution, and the mean of these counts is related to the rates through the population. Then, the rates are modeled through a Normal distribution. The the mean and variance of the rates depend on whether we are in the epidemic or nonepidemic phase for each week. The latent variables which determine the epidemic phase are modeled through a hidden Markov chain. The mean and the variance on the epidemic phase is considered to be larger than the ones on the endemic phase. Different degrees of temporal dependency of the mean of the data can be defined. A first option is be to consider the rates conditionally independent. A second option is to consider that every observation is conditionally dependent on the previous observation through an autoregressive process of order 1. Higher orders of dependency can be defined, but we limited our framework of models to an autoregressive process of order 2 to avoid unnecessary complexity, as no big changes in the outcome were appreciated using higher orders of autocorrelation. The application of this framework of methods over several data bases showed that this proposal outperforms other methodologies present in the literature. It also stresses several difficulties in the process of evaluation of statistical methods for the detection of influenza outbreaks. The second proposal of this research is a spatio-temporal Markov switching model over the differentiated rates, which are considered to follow a normal distribution, with mean and variance parameters dependent on the epidemic state. The latent variables are modeled in the same way as in the temporal proposal, but having one conditionally independent hidden Markov chain for each of the locations. The variance of the endemic phase is also considered to be lower than that of the epidemic phase. Three components are defined for the mean of the differentiated rates: First of all, a common term for all the regions for each time is set in both the endemic and epidemic mean. These terms are defined as two random effects, with mean zero and a higher variance for the epidemic phase. The variances of these random effects are linked to those of the likelihood to avoid problems of identifiability. An autoregressive term for each location is also defined for the epidemic term, as it is expected that from the begining of the epidemic until the peak we observe similar positive jumps and from the peak to the end of the epidemic we observe similar negative jumps. An intrinsic CAR structure is also defined for the epidemic mean, considering that the epidemic can spread to neighbor regions which will have similar epidemic increases of the rates. This proposal has been applied over the United States Google Flu Trends data from 2007 to 2013 for the 48 spatially connected states plus Washington D.C. The comparison of the model with several simplifications and variations has stressed the necessity of several of the assumptions made during the modeling process

    Twitter Mining for Syndromic Surveillance

    Get PDF
    Enormous amounts of personalised data is generated daily from social media platforms today. Twitter in particular, generates vast textual streams in real-time, accompanied with personal information. This big social media data offers a potential avenue for inferring public and social patterns. This PhD thesis investigates the use of Twitter data to deliver signals for syndromic surveillance in order to assess its ability to augment existing syndromic surveillance efforts and give a better understanding of symptomatic people who do not seek healthcare advice directly. We focus on a specific syndrome - asthma/difficulty breathing. We seek to develop means of extracting reliable signals from the Twitter signal, to be used for syndromic surveillance purposes. We begin by outlining our data collection and preprocessing methods. However, we observe that even with keyword-based data collection, many of the collected tweets are not relevant because they represent chatter, or talk of awareness instead of an individual suffering a particular condition. In light of this, we set out to identify relevant tweets to collect a strong and reliable signal. We first develop novel features based on the emoji content of Tweets and apply semi-supervised learning techniques to filter Tweets. Next, we investigate the effectiveness of deep learning at this task. We pro-pose a novel classification algorithm based on neural language models, and compare it to existing successful and popular deep learning algorithms. Following this, we go on to propose an attentive bi-directional Recurrent Neural Network architecture for filtering Tweets which also offers additional syndromic surveillance utility by identifying keywords among syndromic Tweets. In doing so, we are not only able to detect alarms, but also have some clues into what the alarm involves. Lastly, we look towards optimizing the Twitter syndromic surveillance pipeline by selecting the best possible keywords to be supplied to the Twitter API. We developed algorithms to intelligently and automatically select keywords such that the quality, in terms of relevance, and quantity of Tweets collected is maximised

    Public Health Monitoring of Behavioural Risk Factors and Mobility in Canada: An IoT-based Big Data Approach

    Get PDF
    Background: Despite the presence of robust global public health surveillance mechanisms, the COVID-19 pandemic devastated the world and exposed the weakness of the public healthcare systems. Public health surveillance has improved in recent years as technology evolved to enable the mining of diverse data sources, for example, electronic medical records, and social media, to monitor diseases and risk factors. However, the current state of the public health surveillance system depends on traditional (e.g., Canadian Community Health Survey (CCHS), Canadian Health Measures Survey (CHMS)) and modern data sources (e.g., Health insurance registry, Physician billing claims database). While improvement was observed over the past few years, there is still a room for improving the current systems with NextGen data sources (e.g., social media data, Internet of Things data), improved analytical mechanism, reporting, and dissemination of the results to drive improved policymaking at the national and provincial level. With that context, data generated from modern technologies like the Internet of Things (IoT) have demonstrated the potential to bridge the gap and be relevant for public health surveillance. This study explores IoT technologies as potential data sources for public health surveillance and assesses their feasibility with a proof of concept. The objectives of this thesis are to use data from IoT technologies, in this case, a smart thermostat with remote sensors that collect real-time data without additional burden on the users, to measure some of the critical population-level health indicators for Canada and its provinces. Methods: This exploratory research thesis utilizes an innovative data source (ecobee) and cloud-based analytical infrastructure (Microsoft Azure). The research started with a pilot study to assess the feasibility and validity of ecobee smart thermostat-generated movement sensor data to calculate population-level indicators for physical activity, sedentary behaviour, and sleep parameters for Canada. In the pilot study, eight participants gathered step counts using a commercially available Fitbit wearable as well as sensor activation data from ecobee smart thermostats. In the second part of the study, a perspective article analyzes the feasibility and utility of IoT data for public health surveillance. In the third part of this study, data from ecobee smart thermostats from the “Donate your Data” program was used to compare the behavioural changes during the COVID-19 pandemic in four provinces of Canada. In the fourth part of the study, data from the “Donate your Data” program was used in conjunction with Google residential mobility data to assess the impact of the work-from-home policy on micro and macro mobility across four provinces of Canada. The study's final part discusses how IoT data can be utilized to improve policy-level decisions and their impact on daily living, with a focus on situations similar to the COVID-19 pandemic. Results: The Spearman correlation coefficient of the step counts from Fitbit and the number of sensors activated was 0.8 (range 0.78-0.90; n=3292) with statistically significant at P < .001 level. The pilot study shows that ecobee sensors data have the potential to generate the population-level health indicators. The indicators generated from IoT data for Canada, Physical Activity, Sleep, and Sedentary Behaviours (PASS) were consistent with values from the PASS indicators developed by the Public Health Agency of Canada. Following the pilot study, the perspective paper analyzed the possible use of the IoT data from nine critical dimensions: simplicity, flexibility, data quality, acceptability, sensitivity, positive predictive value, representativeness, timeliness, and stability. This paper also described the potential advantages, disadvantages and use cases of IoT data for individual and population-level health indicators. The results from the pilot study and the viewpoint paper show that IoT can become a future data source to complement traditional public health surveillance systems. The third part of the study shows a significant change in behaviour in Canada after the COVID-19 pandemic and work-from-home, stay at home and other policy changes. The sleep habits (average bedtime, wake-up time, sleep duration), average in-house and out-of-the-house duration has been calculated for the four major provinces of Canada (Ontario, Quebec, Alberta, and British Columbia). Compared to pre-pandemic time, the average sleep duration and time spent inside the house has been increased significantly whereas bedtime, and wake-up-time got delayed, and average time spent out-of-the-house decreased significantly during COVID-19 pandemic. The result of the fourth study shows that the in-house mobility (micro-mobility) has been increased after the pandemic related policy changes (e.g., stay-at-home orders, work-from-home policy, emergency declaration). The results were consistent with findings from the Google residential mobility data published by Google. The Pearson correlation coefficient between these datasets was 0.7 (range 0.68-0.8) with statistically significant at P <.001 level. The time-series data analysis for ecobee and google residential mobility data highlights the substantial similarities. The potential strength of IoT data has been demonstrated in the chapter in terms of anomaly detection. Discussion and Conclusion: This research's findings demonstrate that IoT data, in this case, smart thermostats with remote motion sensors, is a viable option to measure population-level health indicators. The impact of the population-level behavioural changes due to the COVID-19 pandemic might be sustained even after policy relaxation and significantly affects physical and mental health. These innovative datasets can strengthen the existing public health surveillance mechanism by providing timely and diverse data to public health officials. These additional data sources can offer surveillance systems with near-real-time health indicators and potentially measure short- and long-term impact policy changes

    Machine Learning Methods with Noisy, Incomplete or Small Datasets

    Get PDF
    In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information has low quality, which might include unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred to as the low-quality data problem. This book collects novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios

    Modelling dengue epidemics with autoregressive switching Markov models (AR-HMM) ⋆

    No full text
    Abstract. In this work, autoregressive switching-Markov models (AR-HMM) are applied to the dengue fever epidemics (DF) in La Havana (Cuba). This technique allows to model time series which are controlled by some unobserved process and finite time lags. A first experiment with real data of dengue is performed in order to obtain the characterization of different stages of the epidemics. The aim of this work is to present a method which can give valuable information about how an efficient control strategy can be performed.

    Analyzing Granger causality in climate data with time series classification methods

    Get PDF
    Attribution studies in climate science aim for scientifically ascertaining the influence of climatic variations on natural or anthropogenic factors. Many of those studies adopt the concept of Granger causality to infer statistical cause-effect relationships, while utilizing traditional autoregressive models. In this article, we investigate the potential of state-of-the-art time series classification techniques to enhance causal inference in climate science. We conduct a comparative experimental study of different types of algorithms on a large test suite that comprises a unique collection of datasets from the area of climate-vegetation dynamics. The results indicate that specialized time series classification methods are able to improve existing inference procedures. Substantial differences are observed among the methods that were tested
    corecore