79 research outputs found

    Anomaly detection and explanation in big data

    Get PDF
    2021 Spring.Includes bibliographical references.Data quality tests are used to validate the data stored in databases and data warehouses, and to detect violations of syntactic and semantic constraints. Domain experts grapple with the issues related to the capturing of all the important constraints and checking that they are satisfied. The constraints are often identified in an ad hoc manner based on the knowledge of the application domain and the needs of the stakeholders. Constraints can exist over single or multiple attributes as well as records involving time series and sequences. The constraints involving multiple attributes can involve both linear and non-linear relationships among the attributes. We propose ADQuaTe as a data quality test framework that automatically (1) discovers different types of constraints from the data, (2) marks records that violate the constraints as suspicious, and (3) explains the violations. Domain knowledge is required to determine whether or not the suspicious records are actually faulty. The framework can incorporate feedback from domain experts to improve the accuracy of constraint discovery and anomaly detection. We instantiate ADQuaTe in two ways to detect anomalies in non-sequence and sequence data. The first instantiation (ADQuaTe2) uses an unsupervised approach called autoencoder for constraint discovery in non-sequence data. ADQuaTe2 is based on analyzing records in isolation to discover constraints among the attributes. We evaluate the effectiveness of ADQuaTe2 using real-world non-sequence datasets from the human health and plant diagnosis domains. We demonstrate that ADQuaTe2 can discover new constraints that were previously unspecified in existing data quality tests, and can report both previously detected and new faults in the data. We also use non-sequence datasets from the UCI repository to evaluate the improvement in the accuracy of ADQuaTe2 after incorporating ground truth knowledge and retraining the autoencoder model. The second instantiation (IDEAL) uses an unsupervised LSTM-autoencoder for constraint discovery in sequence data. IDEAL analyzes the correlations and dependencies among data records to discover constraints. We evaluate the effectiveness of IDEAL using datasets from Yahoo servers, NASA Shuttle, and Colorado State University Energy Institute. We demonstrate that IDEAL can detect previously known anomalies from these datasets. Using mutation analysis, we show that IDEAL can detect different types of injected faults. We also demonstrate that the accuracy of the approach improves after incorporating ground truth knowledge about the injected faults and retraining the LSTM-Autoencoder model. The novelty of this research lies in the development of a domain-independent framework that effectively and efficiently discovers different types of constraints from the data, detects and explains anomalous data, and minimizes false alarms through an interactive learning process

    Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning

    Get PDF
    abstract: Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201

    Improving Patient Safety, Patient Flow and Physician Well-Being in Emergency Departments

    Get PDF
    Over 151 million people visit US Emergency Departments (EDs) annually. The diverse nature and overwhelming volume of patient visits make the ED one of the most complicated settings in healthcare to study. ED overcrowding is a recognized worldwide public health problem, and its negative impacts include patient safety concerns, increased patient length of stay, medical errors, patients left without being seen, ambulance diversions, and increased health system expenditure. Additionally, ED crowding has been identified as a leading contributor to patient morbidity and mortality. Furthermore, this chaotic working environment affects the well-being of all ED staff through increased frustration, workload, stress, and higher rates of burnout which has a direct impact on patient safety. This research takes a step-by-step approach to address these issues by first forecasting the daily and hourly patient arrivals, including their Emergency Severity Index (ESI) levels, to an ED utilizing time series forecasting models and machine learning models. Next, we developed an agent-based discrete event simulation model where both patients and physicians are modeled as unique agents for capturing activities representative of ED. Using this model, we develop various physician shift schedules, including restriction policies and overlapping policies, to improve patient safety and patient flow in the ED. Using the number of handoffs as the patient safety metric, which represents the number of patients transferred from one physician to another, patient time in the ED, and throughput for patient flow, we compare the new policies to the current practices. Additionally, using this model, we also compare the current patient assignment algorithm used by the partner ED to a novel approach where physicians determine patient assignment considering their workload, time remaining in their shift, etc. Further, to identify the optimal physician staffing required for the ED for any given hour of the day, we develop a Mixed Integer Linear Programming (MILP) model where the objective is to minimize the combined cost of physician staffing in the ED, patient waiting time, and handoffs. To develop operations schedules, we surveyed over 70 ED physicians and incorporated their feedback into the MILP model. After developing multiple weekly schedules, these were tested in the validated simulation model to evaluate their efficacy in improving patient safety and patient flow while accounting for the ED staffing budget. Finally, in the last phase, to comprehend the stress and burnout among attending and resident physicians working in the ED for the shift, we collected over 100 hours of physiological responses from 12 ED physicians along with subjective metrics on stress and burnout during ED shifts. We compared the physiological signals and subjective metrics to comprehend the difference between attending and resident physicians. Further, we developed machine learning models to detect the early onset of stress to assist physicians in decision-making

    Twitter Mining for Syndromic Surveillance

    Get PDF
    Enormous amounts of personalised data is generated daily from social media platforms today. Twitter in particular, generates vast textual streams in real-time, accompanied with personal information. This big social media data offers a potential avenue for inferring public and social patterns. This PhD thesis investigates the use of Twitter data to deliver signals for syndromic surveillance in order to assess its ability to augment existing syndromic surveillance efforts and give a better understanding of symptomatic people who do not seek healthcare advice directly. We focus on a specific syndrome - asthma/difficulty breathing. We seek to develop means of extracting reliable signals from the Twitter signal, to be used for syndromic surveillance purposes. We begin by outlining our data collection and preprocessing methods. However, we observe that even with keyword-based data collection, many of the collected tweets are not relevant because they represent chatter, or talk of awareness instead of an individual suffering a particular condition. In light of this, we set out to identify relevant tweets to collect a strong and reliable signal. We first develop novel features based on the emoji content of Tweets and apply semi-supervised learning techniques to filter Tweets. Next, we investigate the effectiveness of deep learning at this task. We pro-pose a novel classification algorithm based on neural language models, and compare it to existing successful and popular deep learning algorithms. Following this, we go on to propose an attentive bi-directional Recurrent Neural Network architecture for filtering Tweets which also offers additional syndromic surveillance utility by identifying keywords among syndromic Tweets. In doing so, we are not only able to detect alarms, but also have some clues into what the alarm involves. Lastly, we look towards optimizing the Twitter syndromic surveillance pipeline by selecting the best possible keywords to be supplied to the Twitter API. We developed algorithms to intelligently and automatically select keywords such that the quality, in terms of relevance, and quantity of Tweets collected is maximised

    A Dynamic Neural Network Architecture with immunology Inspired Optimization for Weather Data Forecasting

    Get PDF
    Recurrent neural networks are dynamical systems that provide for memory capabilities to recall past behaviour, which is necessary in the prediction of time series. In this paper, a novel neural network architecture inspired by the immune algorithm is presented and used in the forecasting of naturally occurring signals, including weather big data signals. Big Data Analysis is a major research frontier, which attracts extensive attention from academia, industry and government, particularly in the context of handling issues related to complex dynamics due to changing weather conditions. Recently, extensive deployment of IoT, sensors, and ambient intelligence systems led to an exponential growth of data in the climate domain. In this study, we concentrate on the analysis of big weather data by using the Dynamic Self Organized Neural Network Inspired by the Immune Algorithm. The learning strategy of the network focuses on the local properties of the signal using a self-organised hidden layer inspired by the immune algorithm, while the recurrent links of the network aim at recalling previously observed signal patterns. The proposed network exhibits improved performance when compared to the feedforward multilayer neural network and state-of-the-art recurrent networks, e.g., the Elman and the Jordan networks. Three non-linear and non-stationary weather signals are used in our experiments. Firstly, the signals are transformed into stationary, followed by 5-steps ahead prediction. Improvements in the prediction results are observed with respect to the mean value of the error (RMS) and the signal to noise ratio (SNR), however to the expense of additional computational complexity, due to presence of recurrent links

    Applying Artificial Intelligence to wearable sensor data to diagnose and predict cardiovascular disease: a review

    Get PDF
    Cardiovascular disease (CVD) is the world’s leading cause of mortality. There is significant interest in using Artificial Intelligence (AI) to analyse data from novel sensors such as wearables to provide an earlier and more accurate prediction and diagnosis of heart disease. Digital health technologies that fuse AI and sensing devices may help disease prevention and reduce the substantial morbidity and mortality caused by CVD worldwide. In this review, we identify and describe recent developments in the application of digital health for CVD, focusing on AI approaches for CVD detection, diagnosis, and prediction through AI models driven by data collected from wearables. We summarise the literature on the use of wearables and AI in cardiovascular disease diagnosis, followed by a detailed description of the dominant AI approaches applied for modelling and prediction using data acquired from sensors such as wearables. We discuss the AI algorithms and models and clinical applications and find that AI and machine-learning-based approaches are superior to traditional or conventional statistical methods for predicting cardiovascular events. However, further studies evaluating the applicability of such algorithms in the real world are needed. In addition, improvements in wearable device data accuracy and better management of their application are required. Lastly, we discuss the challenges that the introduction of such technologies into routine healthcare may fac

    Process Mining Workshops

    Get PDF
    This open access book constitutes revised selected papers from the International Workshops held at the Third International Conference on Process Mining, ICPM 2021, which took place in Eindhoven, The Netherlands, during October 31–November 4, 2021. The conference focuses on the area of process mining research and practice, including theory, algorithmic challenges, and applications. The co-located workshops provided a forum for novel research ideas. The 28 papers included in this volume were carefully reviewed and selected from 65 submissions. They stem from the following workshops: 2nd International Workshop on Event Data and Behavioral Analytics (EDBA) 2nd International Workshop on Leveraging Machine Learning in Process Mining (ML4PM) 2nd International Workshop on Streaming Analytics for Process Mining (SA4PM) 6th International Workshop on Process Querying, Manipulation, and Intelligence (PQMI) 4th International Workshop on Process-Oriented Data Science for Healthcare (PODS4H) 2nd International Workshop on Trust, Privacy, and Security in Process Analytics (TPSA) One survey paper on the results of the XES 2.0 Workshop is included
    corecore