9,496 research outputs found

    Biomedical ontology alignment: An approach based on representation learning

    Get PDF
    While representation learning techniques have shown great promise in application to a number of different NLP tasks, they have had little impact on the problem of ontology matching. Unlike past work that has focused on feature engineering, we present a novel representation learning approach that is tailored to the ontology matching task. Our approach is based on embedding ontological terms in a high-dimensional Euclidean space. This embedding is derived on the basis of a novel phrase retrofitting strategy through which semantic similarity information becomes inscribed onto fields of pre-trained word vectors. The resulting framework also incorporates a novel outlier detection mechanism based on a denoising autoencoder that is shown to improve performance. An ontology matching system derived using the proposed framework achieved an F-score of 94% on an alignment scenario involving the Adult Mouse Anatomical Dictionary and the Foundational Model of Anatomy ontology (FMA) as targets. This compares favorably with the best performing systems on the Ontology Alignment Evaluation Initiative anatomy challenge. We performed additional experiments on aligning FMA to NCI Thesaurus and to SNOMED CT based on a reference alignment extracted from the UMLS Metathesaurus. Our system obtained overall F-scores of 93.2% and 89.2% for these experiments, thus achieving state-of-the-art results

    Anomaly detection and explanation in big data

    Get PDF
    2021 Spring.Includes bibliographical references.Data quality tests are used to validate the data stored in databases and data warehouses, and to detect violations of syntactic and semantic constraints. Domain experts grapple with the issues related to the capturing of all the important constraints and checking that they are satisfied. The constraints are often identified in an ad hoc manner based on the knowledge of the application domain and the needs of the stakeholders. Constraints can exist over single or multiple attributes as well as records involving time series and sequences. The constraints involving multiple attributes can involve both linear and non-linear relationships among the attributes. We propose ADQuaTe as a data quality test framework that automatically (1) discovers different types of constraints from the data, (2) marks records that violate the constraints as suspicious, and (3) explains the violations. Domain knowledge is required to determine whether or not the suspicious records are actually faulty. The framework can incorporate feedback from domain experts to improve the accuracy of constraint discovery and anomaly detection. We instantiate ADQuaTe in two ways to detect anomalies in non-sequence and sequence data. The first instantiation (ADQuaTe2) uses an unsupervised approach called autoencoder for constraint discovery in non-sequence data. ADQuaTe2 is based on analyzing records in isolation to discover constraints among the attributes. We evaluate the effectiveness of ADQuaTe2 using real-world non-sequence datasets from the human health and plant diagnosis domains. We demonstrate that ADQuaTe2 can discover new constraints that were previously unspecified in existing data quality tests, and can report both previously detected and new faults in the data. We also use non-sequence datasets from the UCI repository to evaluate the improvement in the accuracy of ADQuaTe2 after incorporating ground truth knowledge and retraining the autoencoder model. The second instantiation (IDEAL) uses an unsupervised LSTM-autoencoder for constraint discovery in sequence data. IDEAL analyzes the correlations and dependencies among data records to discover constraints. We evaluate the effectiveness of IDEAL using datasets from Yahoo servers, NASA Shuttle, and Colorado State University Energy Institute. We demonstrate that IDEAL can detect previously known anomalies from these datasets. Using mutation analysis, we show that IDEAL can detect different types of injected faults. We also demonstrate that the accuracy of the approach improves after incorporating ground truth knowledge about the injected faults and retraining the LSTM-Autoencoder model. The novelty of this research lies in the development of a domain-independent framework that effectively and efficiently discovers different types of constraints from the data, detects and explains anomalous data, and minimizes false alarms through an interactive learning process

    Big data analytics for preventive medicine

    Get PDF
    © 2019, Springer-Verlag London Ltd., part of Springer Nature. Medical data is one of the most rewarding and yet most complicated data to analyze. How can healthcare providers use modern data analytics tools and technologies to analyze and create value from complex data? Data analytics, with its promise to efficiently discover valuable pattern by analyzing large amount of unstructured, heterogeneous, non-standard and incomplete healthcare data. It does not only forecast but also helps in decision making and is increasingly noticed as breakthrough in ongoing advancement with the goal is to improve the quality of patient care and reduces the healthcare cost. The aim of this study is to provide a comprehensive and structured overview of extensive research on the advancement of data analytics methods for disease prevention. This review first introduces disease prevention and its challenges followed by traditional prevention methodologies. We summarize state-of-the-art data analytics algorithms used for classification of disease, clustering (unusually high incidence of a particular disease), anomalies detection (detection of disease) and association as well as their respective advantages, drawbacks and guidelines for selection of specific model followed by discussion on recent development and successful application of disease prevention methods. The article concludes with open research challenges and recommendations

    Investigating the attainment of optimum data quality for EHR Big Data: proposing a new methodological approach

    Get PDF
    The value derivable from the use of data is continuously increasing since some years. Both commercial and non-commercial organisations have realised the immense benefits that might be derived if all data at their disposal could be analysed and form the basis of decision taking. The technological tools required to produce, capture, store, transmit and analyse huge amounts of data form the background to the development of the phenomenon of Big Data. With Big Data, the aim is to be able to generate value from huge amounts of data, often in non-structured format and produced extremely frequently. However, the potential value derivable depends on general level of governance of data, more precisely on the quality of the data. The field of data quality is well researched for traditional data uses but is still in its infancy for the Big Data context. This dissertation focused on investigating effective methods to enhance data quality for Big Data. The principal deliverable of this research is in the form of a methodological approach which can be used to optimize the level of data quality in the Big Data context. Since data quality is contextual, (that is a non-generalizable field), this research study focuses on applying the methodological approach in one use case, in terms of the Electronic Health Records (EHR). The first main contribution to knowledge of this study systematically investigates which data quality dimensions (DQDs) are most important for EHR Big Data. The two most important dimensions ascertained by the research methods applied in this study are accuracy and completeness. These are two well-known dimensions, and this study confirms that they are also very important for EHR Big Data. The second important contribution to knowledge is an investigation into whether Artificial Intelligence with a special focus upon machine learning could be used in improving the detection of dirty data, focusing on the two data quality dimensions of accuracy and completeness. Regression and clustering algorithms proved to be more adequate for accuracy and completeness related issues respectively, based on the experiments carried out. However, the limits of implementing and using machine learning algorithms for detecting data quality issues for Big Data were also revealed and discussed in this research study. It can safely be deduced from the knowledge derived from this part of the research study that use of machine learning for enhancing data quality issues detection is a promising area but not yet a panacea which automates this entire process. The third important contribution is a proposed guideline to undertake data repairs most efficiently for Big Data; this involved surveying and comparing existing data cleansing algorithms against a prototype developed for data reparation. Weaknesses of existing algorithms are highlighted and are considered as areas of practice which efficient data reparation algorithms must focus upon. Those three important contributions form the nucleus for a new data quality methodological approach which could be used to optimize Big Data quality, as applied in the context of EHR. Some of the activities and techniques discussed through the proposed methodological approach can be transposed to other industries and use cases to a large extent. The proposed data quality methodological approach can be used by practitioners of Big Data Quality who follow a data-driven strategy. As opposed to existing Big Data quality frameworks, the proposed data quality methodological approach has the advantage of being more precise and specific. It gives clear and proven methods to undertake the main identified stages of a Big Data quality lifecycle and therefore can be applied by practitioners in the area. This research study provides some promising results and deliverables. It also paves the way for further research in the area. Technical and technological changes in Big Data is rapidly evolving and future research should be focusing on new representations of Big Data, the real-time streaming aspect, and replicating same research methods used in this current research study but on new technologies to validate current results

    EDMON - Electronic Disease Surveillance and Monitoring Network: A Personalized Health Model-based Digital Infectious Disease Detection Mechanism using Self-Recorded Data from People with Type 1 Diabetes

    Get PDF
    Through time, we as a society have been tested with infectious disease outbreaks of different magnitude, which often pose major public health challenges. To mitigate the challenges, research endeavors have been focused on early detection mechanisms through identifying potential data sources, mode of data collection and transmission, case and outbreak detection methods. Driven by the ubiquitous nature of smartphones and wearables, the current endeavor is targeted towards individualizing the surveillance effort through a personalized health model, where the case detection is realized by exploiting self-collected physiological data from wearables and smartphones. This dissertation aims to demonstrate the concept of a personalized health model as a case detector for outbreak detection by utilizing self-recorded data from people with type 1 diabetes. The results have shown that infection onset triggers substantial deviations, i.e. prolonged hyperglycemia regardless of higher insulin injections and fewer carbohydrate consumptions. Per the findings, key parameters such as blood glucose level, insulin, carbohydrate, and insulin-to-carbohydrate ratio are found to carry high discriminative power. A personalized health model devised based on a one-class classifier and unsupervised method using selected parameters achieved promising detection performance. Experimental results show the superior performance of the one-class classifier and, models such as one-class support vector machine, k-nearest neighbor and, k-means achieved better performance. Further, the result also revealed the effect of input parameters, data granularity, and sample sizes on model performances. The presented results have practical significance for understanding the effect of infection episodes amongst people with type 1 diabetes, and the potential of a personalized health model in outbreak detection settings. The added benefit of the personalized health model concept introduced in this dissertation lies in its usefulness beyond the surveillance purpose, i.e. to devise decision support tools and learning platforms for the patient to manage infection-induced crises

    Data Mining in Internet of Things Systems: A Literature Review

    Get PDF
    The Internet of Things (IoT) and cloud technologies have been the main focus of recent research, allowing for the accumulation of a vast amount of data generated from this diverse environment. These data include without any doubt priceless knowledge if could correctly discovered and correlated in an efficient manner. Data mining algorithms can be applied to the Internet of Things (IoT) to extract hidden information from the massive amounts of data that are generated by IoT and are thought to have high business value. In this paper, the most important data mining approaches covering classification, clustering, association analysis, time series analysis, and outlier analysis from the knowledge will be covered. Additionally, a survey of recent work in in this direction is included. Another significant challenges in the field are collecting, storing, and managing the large number of devices along with their associated features. In this paper, a deep look on the data mining for the IoT platforms will be given concentrating on real applications found in the literatur
    corecore