20,617 research outputs found

    A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery

    Get PDF
    bbab314Typical clustering analysis for large-scale genomics data combines two unsupervised learning techniques: dimensionality reduction and clustering (DR-CL) methods. It has been demonstrated that transforming gene expression to pathway-level information can improve the robustness and interpretability of disease grouping results. This approach, referred to as biological knowledge-driven clustering (BK-CL) approach, is often neglected, due to a lack of tools enabling systematic comparisons with more established DR-based methods. Moreover, classic clustering metrics based on group separability tend to favor the DR-CL paradigm, which may increase the risk of identifying less actionable disease subtypes that have ambiguous biological and clinical explanations. Hence, there is a need for developing metrics that assess biological and clinical relevance. To facilitate the systematic analysis of BK-CL methods, we propose a computational protocol for quantitative analysis of clustering results derived from both DR-CL and BK-CL methods. Moreover, we propose a new BK-CL method that combines prior knowledge of disease relevant genes, network diffusion algorithms and gene set enrichment analysis to generate robust pathway-level information. Benchmarking studies were conducted to compare the grouping results from different DR-CL and BK-CL approaches with respect to standard clustering evaluation metrics, concordance with known subtypes, association with clinical outcomes and disease modules in co-expression networks of genes. No single approach dominated every metric, showing the importance multi-objective evaluation in clustering analysis. However, we demonstrated that, on gene expression data sets derived from TCGA samples, the BK-CL approach can find groupings that provide significant prognostic value in both breast and prostate cancers.Peer reviewe

    Advances in Big Data Analytics: Algorithmic Stability and Data Cleansing

    Get PDF
    Analysis of what has come to be called “big data” presents a number of challenges as data continues to grow in size, complexity and heterogeneity. To help addresses these challenges, we study a pair of foundational issues in algorithmic stability (robustness and tuning), with application to clustering in high-throughput computational biology, and an issue in data cleansing (outlier detection), with application to pre-processing in streaming meteorological measurement. These issues highlight major ongoing research aspects of modern big data analytics. First, a new metric, robustness, is proposed in the setting of biological data clustering to measure an algorithm’s tendency to maintain output coherence over a range of parameter settings. It is well known that different algorithms tend to produce different clusters, and that the choice of algorithm is often driven by factors such as data size and type, similarity measure(s) employed, and the sort of clusters desired. Even within the context of a single algorithm, clusters often vary drastically depending on parameter settings. Empirical comparisons performed over a variety of algorithms and settings show highly differential performance on transcriptomic data and demonstrate that many popular methods actually perform poorly. Second, tuning strategies are studied for maximizing biological fidelity when using the well-known paraclique algorithm. Three initialization strategies are compared, using ontological enrichment as a proxy for cluster quality. Although extant paraclique codes begin by simply employing the first maximum clique found, results indicate that by generating all maximum cliques and then choosing one of highest average edge weight, one can produce a small but statistically significant expected improvement in overall cluster quality. Third, a novel outlier detection method is described that helps cleanse data by combining Pearson correlation coefficients, K-means clustering, and Singular Spectrum Analysis in a coherent framework that detects instrument failures and extreme weather events in Atmospheric Radiation Measurement sensor data. The framework is tested and found to produce more accurate results than do traditional approaches that rely on a hand-annotated database

    Considerations about multistep community detection

    Full text link
    The problem and implications of community detection in networks have raised a huge attention, for its important applications in both natural and social sciences. A number of algorithms has been developed to solve this problem, addressing either speed optimization or the quality of the partitions calculated. In this paper we propose a multi-step procedure bridging the fastest, but less accurate algorithms (coarse clustering), with the slowest, most effective ones (refinement). By adopting heuristic ranking of the nodes, and classifying a fraction of them as `critical', a refinement step can be restricted to this subset of the network, thus saving computational time. Preliminary numerical results are discussed, showing improvement of the final partition.Comment: 12 page

    Deep generative modeling for single-cell transcriptomics.

    Get PDF
    Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task

    Sensitivity analysis of oscillator models in the space of phase-response curves: Oscillators as open systems

    Full text link
    Oscillator models are central to the study of system properties such as entrainment or synchronization. Due to their nonlinear nature, few system-theoretic tools exist to analyze those models. The paper develops a sensitivity analysis for phase-response curves, a fundamental one-dimensional phase reduction of oscillator models. The proposed theoretical and numerical analysis tools are illustrated on several system-theoretic questions and models arising in the biology of cellular rhythms
    corecore