1,146 research outputs found

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Time Series Outlier Detection Based on Sliding Window Prediction

    Get PDF
    In order to detect outliers in hydrological time series data for improving data quality and decision-making quality related to design, operation, and management of water resources, this research develops a time series outlier detection method for hydrologic data that can be used to identify data that deviate from historical patterns. The method first built a forecasting model on the history data and then used it to predict future values. Anomalies are assumed to take place if the observed values fall outside a given prediction confidence interval (PCI), which can be calculated by the predicted value and confidence coefficient. The use of PCI as threshold is mainly on the fact that it considers the uncertainty in the data series parameters in the forecasting model to address the suitable threshold selection problem. The method performs fast, incremental evaluation of data as it becomes available, scales to large quantities of data, and requires no preclassification of anomalies. Experiments with different hydrologic real-world time series showed that the proposed methods are fast and correctly identify abnormal data and can be used for hydrologic time series analysis

    Improved Algorithms for Discovery of New Genes in Bacterial Genomes

    Get PDF
    In this dissertation, we describe a new approach for gene finding that can utilize proteomics information in addition to DNA and RNA to identify new genes in prokaryote genomes. Proteomics processing pipelines require identification of small pieces of proteins called peptides. Peptide identification is a very error-prone process and we have developed a new algorithm for validating peptide identifications using a distance-based outlier detection method. We demonstrate that our method identifies more peptides than other popular methods using standard mixtures of known proteins. In addition, our algorithm provides a much more accurate estimate of the false discovery rate than other methods. Once peptides have been identified and validated, we use a second algorithm, proteogenomic mapping (PGM) to map these peptides to the genome to find the genetic signals that allow us to identify potential novel protein coding genes called expressed Protein Sequence Tags (ePSTs). We then collect and combine evidence for ePSTs we generated, and evaluate the likelihood that each ePST represents a true new protein coding gene using supervised machine learning techniques. We use machine learning approaches to evaluate the likelihood that the ePSTs represent new genes. Finally, we have developed new approaches to Bayesian learning that allow us to model the knowledge domain from sparse biological datasets. We have developed two new bootstrap approaches that utilize resampling to build networks with the most robust features that reoccur in many networks. These bootstrap methods yield improved prediction accuracy. We have also developed an unsupervised Bayesian network structure learning method that can be used when training data is not available or when labels may not be reliable

    Improved Algorithms for Discovery of New Genes in Bacterial Genomes

    Get PDF
    In this dissertation, we describe a new approach for gene finding that can utilize proteomics information in addition to DNA and RNA to identify new genes in prokaryote genomes. Proteomics processing pipelines require identification of small pieces of proteins called peptides. Peptide identification is a very error-prone process and we have developed a new algorithm for validating peptide identifications using a distance-based outlier detection method. We demonstrate that our method identifies more peptides than other popular methods using standard mixtures of known proteins. In addition, our algorithm provides a much more accurate estimate of the false discovery rate than other methods. Once peptides have been identified and validated, we use a second algorithm, proteogenomic mapping (PGM) to map these peptides to the genome to find the genetic signals that allow us to identify potential novel protein coding genes called expressed Protein Sequence Tags (ePSTs). We then collect and combine evidence for ePSTs we generated, and evaluate the likelihood that each ePST represents a true new protein coding gene using supervised machine learning techniques. We use machine learning approaches to evaluate the likelihood that the ePSTs represent new genes. Finally, we have developed new approaches to Bayesian learning that allow us to model the knowledge domain from sparse biological datasets. We have developed two new bootstrap approaches that utilize resampling to build networks with the most robust features that reoccur in many networks. These bootstrap methods yield improved prediction accuracy. We have also developed an unsupervised Bayesian network structure learning method that can be used when training data is not available or when labels may not be reliable

    Data Mining in Internet of Things Systems: A Literature Review

    Get PDF
    The Internet of Things (IoT) and cloud technologies have been the main focus of recent research, allowing for the accumulation of a vast amount of data generated from this diverse environment. These data include without any doubt priceless knowledge if could correctly discovered and correlated in an efficient manner. Data mining algorithms can be applied to the Internet of Things (IoT) to extract hidden information from the massive amounts of data that are generated by IoT and are thought to have high business value. In this paper, the most important data mining approaches covering classification, clustering, association analysis, time series analysis, and outlier analysis from the knowledge will be covered. Additionally, a survey of recent work in in this direction is included. Another significant challenges in the field are collecting, storing, and managing the large number of devices along with their associated features. In this paper, a deep look on the data mining for the IoT platforms will be given concentrating on real applications found in the literatur
    corecore