5,923 research outputs found

    A recommender system for process discovery

    Get PDF
    Over the last decade, several algorithms for process discovery and process conformance have been proposed. Still, it is well-accepted that there is no dominant algorithm in any of these two disciplines, and then it is often difficult to apply them successfully. Most of these algorithms need a close-to expert knowledge in order to be applied satisfactorily. In this paper, we present a recommender system that uses portfolio-based algorithm selection strategies to face the following problems: to find the best discovery algorithm for the data at hand, and to allow bridging the gap between general users and process mining algorithms. Experiments performed with the developed tool witness the usefulness of the approach for a variety of instances.Peer ReviewedPostprint (author’s final draft

    Inferring causal relations from multivariate time series : a fast method for large-scale gene expression data

    Get PDF
    Various multivariate time series analysis techniques have been developed with the aim of inferring causal relations between time series. Previously, these techniques have proved their effectiveness on economic and neurophysiological data, which normally consist of hundreds of samples. However, in their applications to gene regulatory inference, the small sample size of gene expression time series poses an obstacle. In this paper, we describe some of the most commonly used multivariate inference techniques and show the potential challenge related to gene expression analysis. In response, we propose a directed partial correlation (DPC) algorithm as an efficient and effective solution to causal/regulatory relations inference on small sample gene expression data. Comparative evaluations on the existing techniques and the proposed method are presented. To draw reliable conclusions, a comprehensive benchmarking on data sets of various setups is essential. Three experiments are designed to assess these methods in a coherent manner. Detailed analysis of experimental results not only reveals good accuracy of the proposed DPC method in large-scale prediction, but also gives much insight into all methods under evaluation

    Essential guidelines for computational method benchmarking

    Get PDF
    In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.Comment: Minor update

    Essential guidelines for computational method benchmarking

    Get PDF
    In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology

    Compressed Sensing - A New mode of Measurement

    Get PDF
    After introducing the concept of compressed sensing as a complementary measurement mode to the classical Shannon-Nyquist approach, I discuss some of the drivers, potential challenges and obstacles to its implementation. I end with a speculative attempt to embed compressed sensing as an enabling methodology within the emergence of data-driven discovery. As a consequence I predict the growth of non-nomological sciences where heuristic correlations will find applications but often bypass conventional pure basic and use-inspired basic research stages due to the lack of verifiable hypotheses

    Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples

    Full text link
    Motivation: Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods. Results: We made ten SNP and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10-15kb, but the error rate of post-filtered calls is reduced to 1 in 100-200kb without significant compromise on the sensitivity. Availability: BWA-MEM alignment: http://bit.ly/1g8XqRt; Scripts: https://github.com/lh3/varcmp; Additional data: https://figshare.com/articles/Towards_better_understanding_of_artifacts_in_variating_calling_from_high_coverage_samples/981073Comment: Published versio

    Uncovering Hidden Diversity in Plants

    Get PDF
    One of the greatest challenges to human civilization in the 21st century will be to provide global food security to a growing population while reducing the environmental footprint of agriculture. Despite increasing demand, the fundamental issue of limited genetic diversity in domesticated crops provides windows of opportunity for emerging pandemics and the insufficient ability of modern crops to respond to a changing global environment. The wild relatives of crop plants, with large reservoirs of untapped genetic diversity, offer great potential to improve the resilience of elite cultivars. Utilizing this diversity requires advanced technologies to comprehensively identify genetic diversity and understand the genetic architecture of beneficial traits. The primary focus of the dissertation is developing computational tools to facilitate variant discovery and trait mapping for plant genomics. In Chapter 1, I benchmarked the performance of variant discovery algorithms based on simulated and diverse plant datasets. The comparison of sequence aligners found that BWA-MEM consistently aligned the most plant reads with high accuracy, whereas Bowtie2 had a slightly higher overall accuracy. Variant callers, such as GATK HaplotypCaller and SAMtools mpileup, were shown to significantly differ in their ability to minimize the frequency of false negatives and maximize the discovery of true positives. A cross-reference experiment of Solanum lycopersicum and Solanum pennellii reference genomes revealed significant limitations of using a single reference genome for variant discovery. Next, I demonstrated that a machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff filtering strategy, resulting in a significantly higher number of true positive and fewer false-positive variants. Finally, I developed a 2-step imputation method resulted in up to 60% higher accuracy than direct LD-based imputation methods. In Chapter 2, I focused on developing a trait mapping algorithm tailored for plants considering the high levels of diversity found in plant datasets. This novel trait mapping framework, HapFM, had the ability to incorporate biological priors into the mapping model to identify casual haplotypes for traits of interest. Compared to conventional GWAS analyses, the haplotype-based approach significantly reduced the number of variables while aggregating small effect SNPs to increase mapping power. HapFM could account for LD between haplotype segments to infer the causal haplotypes directly. Furthermore, HapFM could systemically incorporate biological priors into the probability function during the mapping process resulting in greater mapping resolution. Overall, HapFM achieves a balance between powerfulness, interpretability, and verifiability. In Chapter 3, I developed a computational algorithm to select a pan-genome cohort to maximize the haplotype representativeness of the cohort. Increasing evidence suggest that a single reference genome is often inadequate for plant diversity studies due to extensive sequence and structural rearrangements found in many plant genomes. HapPS was developed to utilize local haplotype information to select the reference cohort. There are three steps in HapPS, including genome-wide block partition, representative haplotype identification, and genetic algorithm for reference cohort selection. The comparison of HapPS with global-distance-based selection showed that HapPS resulted in significantly higher block coverage in the highly diverse genic regions. The GO-term enrichment analysis of the highly diverse genic region identified by HapPS showed enrichment for genes involved in defense pathways and abiotic stress, which might identify genomic regions involved in local adaptation. In summary, HapPS provides a systemic and objective solution to pan-genome cohort selection

    An Integrated Principal Component Analysis And Weighted Apriori-T Algorithm For Imbalanced Data Root Cause Analysis

    Get PDF
    Root Cause Analysis (RCA) is often used in manufacturing analysis to prevent the reoccurrence of undesired events. Association rule mining (ARM) was introduced in RCA to extract frequently occur patterns, interesting correlations, associations or casual structures among items in the database. However, frequent pattern mining (FPM) using Apriori-like algorithms and support-confidence framework suffers from the myth of rare item problem in nature. This has greatly reduced the performance of RCA, especially in manufacturing domain, where existence of imbalanced data is a norm in a production plant. In addition, exponential growth of data causes high computational costs in Apriori-like algorithms. Hence, this research aims to propose a two stage FPM, integrating Principal Component Analysis (PCA) and Weighted Apriori-T (PCA-WAT) algorithm to address these problems. PCA is used to generate item weight by considering maximally distributed covariance to normalise the effect of rare items. Using PCA, significant rare item will have a higher weight while less significant high occurance item will have a lower weight. On the other hand, Apriori-T with indexing enumeration tree is used for low cost FPM. A semiconductor manufacturing case study with Work In Progress data and true alarm data is used to proof the proposed algorithm. The proposed PCA-WAT algorithm is benchmarked with the Apriori and Apriori-T algorithms.Comparison analysis on weighted support has been performed to evaluate the capability of PCA in normalising item’s support value. The experimental results have proven that PCA is able to normalise the item support value and reduce the influence of imbalance data in FPM.Both quality and performance measure are used as performance measurement. The quality measures aim to compare the frequent itemsets and interesting rules generated across different support and confidence thresholds, ranging from 5% to 20%, and 10% to 90% respectively.The rules validation involves a business analyst from the related field. The domain expert has verified that the generated rules are able to explain the contributing factors towards failure analysis. However, significant rare rules are not easily discovered because the normalized weighted support values are generally lower compared to the original suppport values. The performance measures aim to compare the execution time in second (s) and the execution Random Access Memory (RAM) in megabyte (MB). The experiment results proven that the implementation of Apriori-T has lowered the computational cost by at least 90% of computation time and 35.33% of computation RAM as compared to Apriori. The primary contribution of this study is to propose a two-stage FPM to perform RCA in manufacturing domain with the existence of imbalanced dataset. In conclusion, the proposed algorithm is able to overcome the rare item issue by implementing covariance based support value normalization and high computational costs issue by implementing indexing enumeration tree structure.Future work of this study should focus on rule interpretation to generate more human understandable rule by novice in data mining. In addition, suitable support and confidence thresholds are needed after the normalisation process to better discover the significant rare itemset
    corecore