79 research outputs found
Anomaly detection and explanation in big data
2021 Spring.Includes bibliographical references.Data quality tests are used to validate the data stored in databases and data warehouses, and to detect violations of syntactic and semantic constraints. Domain experts grapple with the issues related to the capturing of all the important constraints and checking that they are satisfied. The constraints are often identified in an ad hoc manner based on the knowledge of the application domain and the needs of the stakeholders. Constraints can exist over single or multiple attributes as well as records involving time series and sequences. The constraints involving multiple attributes can involve both linear and non-linear relationships among the attributes. We propose ADQuaTe as a data quality test framework that automatically (1) discovers different types of constraints from the data, (2) marks records that violate the constraints as suspicious, and (3) explains the violations. Domain knowledge is required to determine whether or not the suspicious records are actually faulty. The framework can incorporate feedback from domain experts to improve the accuracy of constraint discovery and anomaly detection. We instantiate ADQuaTe in two ways to detect anomalies in non-sequence and sequence data. The first instantiation (ADQuaTe2) uses an unsupervised approach called autoencoder for constraint discovery in non-sequence data. ADQuaTe2 is based on analyzing records in isolation to discover constraints among the attributes. We evaluate the effectiveness of ADQuaTe2 using real-world non-sequence datasets from the human health and plant diagnosis domains. We demonstrate that ADQuaTe2 can discover new constraints that were previously unspecified in existing data quality tests, and can report both previously detected and new faults in the data. We also use non-sequence datasets from the UCI repository to evaluate the improvement in the accuracy of ADQuaTe2 after incorporating ground truth knowledge and retraining the autoencoder model. The second instantiation (IDEAL) uses an unsupervised LSTM-autoencoder for constraint discovery in sequence data. IDEAL analyzes the correlations and dependencies among data records to discover constraints. We evaluate the effectiveness of IDEAL using datasets from Yahoo servers, NASA Shuttle, and Colorado State University Energy Institute. We demonstrate that IDEAL can detect previously known anomalies from these datasets. Using mutation analysis, we show that IDEAL can detect different types of injected faults. We also demonstrate that the accuracy of the approach improves after incorporating ground truth knowledge about the injected faults and retraining the LSTM-Autoencoder model. The novelty of this research lies in the development of a domain-independent framework that effectively and efficiently discovers different types of constraints from the data, detects and explains anomalous data, and minimizes false alarms through an interactive learning process
Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning
abstract: Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201
Improving Patient Safety, Patient Flow and Physician Well-Being in Emergency Departments
Over 151 million people visit US Emergency Departments (EDs) annually. The diverse nature and overwhelming volume of patient visits make the ED one of the most complicated settings in healthcare to study. ED overcrowding is a recognized worldwide public health problem, and its negative impacts include patient safety concerns, increased patient length of stay, medical errors, patients left without being seen, ambulance diversions, and increased health system expenditure. Additionally, ED crowding has been identified as a leading contributor to patient morbidity and mortality. Furthermore, this chaotic working environment affects the well-being of all ED staff through increased frustration, workload, stress, and higher rates of burnout which has a direct impact on patient safety.
This research takes a step-by-step approach to address these issues by first forecasting the daily and hourly patient arrivals, including their Emergency Severity Index (ESI) levels, to an ED utilizing time series forecasting models and machine learning models. Next, we developed an agent-based discrete event simulation model where both patients and physicians are modeled as unique agents for capturing activities representative of ED. Using this model, we develop various physician shift schedules, including restriction policies and overlapping policies, to improve patient safety and patient flow in the ED. Using the number of handoffs as the patient safety metric, which represents the number of patients transferred from one physician to another, patient time in the ED, and throughput for patient flow, we compare the new policies to the current practices. Additionally, using this model, we also compare the current patient assignment algorithm used by the partner ED to a novel approach where physicians determine patient assignment considering their workload, time remaining in their shift, etc.
Further, to identify the optimal physician staffing required for the ED for any given hour of the day, we develop a Mixed Integer Linear Programming (MILP) model where the objective is to minimize the combined cost of physician staffing in the ED, patient waiting time, and handoffs. To develop operations schedules, we surveyed over 70 ED physicians and incorporated their feedback into the MILP model. After developing multiple weekly schedules, these were tested in the validated simulation model to evaluate their efficacy in improving patient safety and patient flow while accounting for the ED staffing budget.
Finally, in the last phase, to comprehend the stress and burnout among attending and resident physicians working in the ED for the shift, we collected over 100 hours of physiological responses from 12 ED physicians along with subjective metrics on stress and burnout during ED shifts. We compared the physiological signals and subjective metrics to comprehend the difference between attending and resident physicians. Further, we developed machine learning models to detect the early onset of stress to assist physicians in decision-making
Twitter Mining for Syndromic Surveillance
Enormous amounts of personalised data is generated daily from social media platforms today. Twitter in particular, generates vast textual streams in real-time, accompanied with personal information. This big social media data offers a potential avenue for inferring public and social patterns. This PhD thesis investigates the use of Twitter data to deliver signals for syndromic surveillance in order to assess its ability to augment existing syndromic surveillance efforts and give a better understanding of symptomatic people who do not seek healthcare advice directly. We focus on a specific syndrome - asthma/difficulty breathing. We seek to develop means of extracting reliable signals from the Twitter signal, to be used for syndromic surveillance purposes. We begin by outlining our data collection and preprocessing methods. However, we observe that even with keyword-based data collection, many of the collected tweets are not relevant because they represent chatter, or talk of awareness instead of an individual suffering a particular condition. In light of this, we set out to identify relevant tweets to collect a strong and reliable signal. We first develop novel features based on the emoji content of Tweets and apply semi-supervised learning techniques to filter Tweets. Next, we investigate the effectiveness of deep learning at this task. We pro-pose a novel classification algorithm based on neural language models, and compare it to existing successful and popular deep learning algorithms. Following this, we go on to propose an attentive bi-directional Recurrent Neural Network architecture for filtering Tweets which also offers additional syndromic surveillance utility by identifying keywords among syndromic Tweets. In doing so, we are not only able to detect alarms, but also have some clues into what the alarm involves. Lastly, we look towards optimizing the Twitter syndromic surveillance pipeline by selecting the best possible keywords to be supplied to the Twitter API. We developed algorithms to intelligently and automatically select keywords such that the quality, in terms of relevance, and quantity of Tweets collected is maximised
A Dynamic Neural Network Architecture with immunology Inspired Optimization for Weather Data Forecasting
Recurrent neural networks are dynamical systems that provide for memory capabilities to recall past behaviour, which is necessary in the prediction of time series. In this paper, a novel neural network architecture inspired by the immune algorithm is presented and used in the forecasting of naturally occurring signals, including weather big data signals. Big Data Analysis is a major research frontier, which attracts extensive attention from academia, industry and government, particularly in the context of handling issues related to complex dynamics due to changing weather conditions. Recently, extensive deployment of IoT, sensors, and ambient intelligence systems led to an exponential growth of data in the climate domain. In this study, we concentrate on the analysis of big weather data by using the Dynamic Self Organized Neural Network Inspired by the Immune Algorithm. The learning strategy of the network focuses on the local properties of the signal using a self-organised hidden layer inspired by the immune algorithm, while the recurrent links of the network aim at recalling previously observed signal patterns. The proposed network exhibits improved performance when compared to the feedforward multilayer neural network and state-of-the-art recurrent networks, e.g., the Elman and the Jordan networks. Three non-linear and non-stationary weather signals are used in our experiments. Firstly, the signals are transformed into stationary, followed by 5-steps ahead prediction. Improvements in the prediction results are observed with respect to the mean value of the error (RMS) and the signal to noise ratio (SNR), however to the expense of additional computational complexity, due to presence of recurrent links
Applying Artificial Intelligence to wearable sensor data to diagnose and predict cardiovascular disease: a review
Cardiovascular disease (CVD) is the world’s leading cause of mortality. There is significant interest in using Artificial Intelligence (AI) to analyse data from novel sensors such as wearables to provide an earlier and more accurate prediction and diagnosis of heart disease. Digital health technologies that fuse AI and sensing devices may help disease prevention and reduce the substantial morbidity and mortality caused by CVD worldwide. In this review, we identify and describe recent developments in the application of digital health for CVD, focusing on AI approaches for CVD detection, diagnosis, and prediction through AI models driven by data collected from wearables. We summarise the literature on the use of wearables and AI in cardiovascular disease diagnosis, followed by a detailed description of the dominant AI approaches applied for modelling and prediction using data acquired from sensors such as wearables. We discuss the AI algorithms and models and clinical applications and find that AI and machine-learning-based approaches are superior to traditional or conventional statistical methods for predicting cardiovascular events. However, further studies evaluating the applicability of such algorithms in the real world are needed. In addition, improvements in wearable device data accuracy and better management of their application are required. Lastly, we discuss the challenges that the introduction of such technologies into routine healthcare may fac
Recommended from our members
Machine learning to model health with multimodal mobile sensor data
The widespread adoption of smartphones and wearables has led to the accumulation of rich datasets, which could aid the understanding of behavior and health in unprecedented detail. At the same time, machine learning and specifically deep learning have reached impressive performance in a variety of prediction tasks, but their use on time-series data appears challenging. Existing models struggle to learn from this unique type of data due to noise, sparsity, long-tailed distributions of behaviors, lack of labels, and multimodality.
This dissertation addresses these challenges by developing new models that leverage multi-task learning for accurate forecasting, multimodal fusion for improved population subtyping, and self-supervision for learning generalized representations. We apply our proposed methods to challenging real-world tasks of predicting mental health and cardio-respiratory fitness through sensor data.
First, we study the relationship of passive data as collected from smartphones (movement and background audio) to momentary mood levels. Our new training pipeline, which combines different sensor data into a low-dimensional embedding and clusters longitudinal user trajectories as outcome, outperforms traditional approaches based solely on psychology questionnaires. Second, motivated by mood instability as a predictor of poor mental health, we propose encoder-decoder models for time-series forecasting which exploit the bi-modality of mood with multi-task learning.
Next, motivated by the success of general-purpose models in vision and language tasks, we propose a self-supervised neural network ready-to-use as a feature extractor for wearable data. To this end, we set the heart rate responses as the supervisory signal for activity data, leveraging their underlying physiological relationship and show that the resulting task-agnostic embeddings can generalize in predicting structurally different downstream outcomes through transfer learning (e.g. BMI, age, energy expenditure), outperforming unsupervised autoencoders and biomarkers. Finally, acknowledging fitness as a strong predictor of overall health, which, however, can only be measured with expensive instruments (e.g., a VO2max test), we develop models that enable accurate prediction of fine-grained fitness levels with wearables in the present, and more importantly, its direction and magnitude almost a decade later.
All proposed methods are evaluated on large longitudinal datasets with tens of thousands of participants in the wild. The models developed and the insights drawn in this dissertation provide evidence for a better understanding of high-dimensional behavioral and physiological data with implications for large-scale health and lifestyle monitoring.The Department of Computer Science and Technology at the University of Cambridge through the EPSRC through Grant DTP (EP/N509620/1), and the Embiricos Trust Scholarship of Jesus College Cambridg
Process Mining Workshops
This open access book constitutes revised selected papers from the International Workshops held at the Third International Conference on Process Mining, ICPM 2021, which took place in Eindhoven, The Netherlands, during October 31–November 4, 2021. The conference focuses on the area of process mining research and practice, including theory, algorithmic challenges, and applications. The co-located workshops provided a forum for novel research ideas. The 28 papers included in this volume were carefully reviewed and selected from 65 submissions. They stem from the following workshops: 2nd International Workshop on Event Data and Behavioral Analytics (EDBA) 2nd International Workshop on Leveraging Machine Learning in Process Mining (ML4PM) 2nd International Workshop on Streaming Analytics for Process Mining (SA4PM) 6th International Workshop on Process Querying, Manipulation, and Intelligence (PQMI) 4th International Workshop on Process-Oriented Data Science for Healthcare (PODS4H) 2nd International Workshop on Trust, Privacy, and Security in Process Analytics (TPSA) One survey paper on the results of the XES 2.0 Workshop is included
- …