310 research outputs found

    A Clustering Comparison Measure Using Density Profiles and its Application to the Discovery of Alternate Clusterings

    Get PDF
    Data clustering is a fundamental and very popular method of data analysis. Its subjective nature, however, means that different clustering algorithms or different parameter settings can produce widely varying and sometimes conflicting results. This has led to the use of clustering comparison measures to quantify the degree of similarity between alternative clusterings. Existing measures, though, can be limited in their ability to assess similarity and sometimes generate unintuitive results. They also cannot be applied to compare clusterings which contain different data points, an activity which is important for scenarios such as data stream analysis. In this paper, we introduce a new clustering similarity measure, known as ADCO, which aims to address some limitations of existing measures, by allowing greater flexibility of comparison via the use of density profiles to characterize a clustering. In particular, it adopts a ‘data mining style’ philosophy to clustering comparison, whereby two clusterings are considered to be more similar, if they are likely to give rise to similar types of prediction models. Furthermore, we show that this new measure can be applied as a highly effective objective function within a new algorithm, known as MAXIMUS, for generating alternate clusterings

    Spatially-Aware Comparison and Consensus for Clusterings

    Full text link
    This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Euclidean clustering, and is both simple and efficient. All of our results apply to both soft and hard clusterings. We accompany these algorithms with a detailed experimental evaluation that demonstrates the efficiency and quality of our techniques.Comment: 12 Pages, 9 figures, Proceedings of 2011 Siam International Conference on Data Minin

    Sample Size Evaluation and Comparison of K-Means Clusterings of RNA-Seq Gene Expression Data

    Get PDF
    The process by which DNA is transformed into gene products, such as RNA and proteins, is called gene expression. Gene expression profiling quantifies the expression of genes (amount of RNA) in a particular tissue at a particular time. Two commonly used high-throughput techniques for gene expression analysis are DNA microarrays and RNA-Seq, with RNA-Seq being the newer technique based on high-throughput sequencing. Statistical analysis is needed to deal with complex datasets — one commonly used statistical tool is clustering. Clustering comparison is an existing area dedicated to comparing multiple clusterings from one or more clustering algorithms. However, there has been limited application of cluster comparisons to clusterings of RNA-Seq gene expression data. In particular, cluster comparisons are useful in order to test the differences between clusterings obtained using a single algorithm when using different samples for clustering. Here we use a metric for cluster comparisons that is a variation of existing metrics. The metric is simply the minimal number of genes that need to be moved from one cluster to another in one given clustering to produce another given clustering. As the metric only has genes (or elements) as units, it is easy to interpret for RNA-Seq analysis. Moreover, three different algorithmic techniques — brute force, branch-and-bound, and maximal bipartite matching — for computing the proposed metric exactly are compared in terms of time to compute, with bipartite matching being significantly more time efficient. This metric is then applied to the important issue of understanding the effect of increasing the number of RNA-Seq samples to clusterings. Three datasets were used where a large number of samples were available: mouse embryonic stem cell tissue data, Drosophila melanogaster data from multiple tissues and micro-climates, and a mouse multi-tissue dataset. For each, a reference clustering was computed from all of the samples, and then it was compared to clusterings created from smaller subsets of the samples. All clusterings were created using a standard heuristic K-means clustering algorithm, while also systematically varying the numbers of clusters, and also using both Euclidean distance and Manhattan distance. The clustering comparisons suggest that for the three large datasets tested, there seems to be a limited impact of adding more RNA-Seq samples on K-means clusterings using both Euclidean distance and Manhattan distance (Manhattan distance gives a higher variation) beyond some small number of samples. That is, the clusterings compiled based on a limited number of samples were all either quite similar to the reference clustering or did not improve as additional samples were added. These findings were the same for different numbers of clusters. The methods developed could also be applied to other clustering comparison problems

    Doctor of Philosophy

    Get PDF
    dissertationWith the tremendous growth of data produced in the recent years, it is impossible to identify patterns or test hypotheses without reducing data size. Data mining is an area of science that extracts useful information from the data by discovering patterns and structures present in the data. In this dissertation, we will largely focus on clustering which is often the first step in any exploratory data mining task, where items that are similar to each other are grouped together, making downstream data analysis robust. Different clustering techniques have different strengths, and the resulting groupings provide different perspectives on the data. Due to the unsupervised nature i.e., the lack of domain experts who can label the data, validation of results is very difficult. While there are measures that compute "goodness" scores for clustering solutions as a whole, there are few methods that validate the assignment of individual data items to their clusters. To address these challenges we focus on developing a framework that can generate, compare, combine, and evaluate different solutions to make more robust and significant statements about the data. In the first part of this dissertation, we present fast and efficient techniques to generate and combine different clustering solutions. We build on some recent ideas on efficient representations of clusters of partitions to develop a well founded metric that is spatially aware to compare clusterings. With the ability to compare clusterings, we describe a heuristic to combine different solutions to produce a single high quality clustering. We also introduce a Markov chain Monte Carlo approach to sample different clusterings from the entire landscape to provide the users with a variety of choices. In the second part of this dissertation, we build certificates for individual data items and study their influence on effective data reduction. We present a geometric approach by defining regions of influence for data items and clusters and use this to develop adaptive sampling techniques to speedup machine learning algorithms. This dissertation is therefore a systematic approach to study the landscape of clusterings in an attempt to provide a better understanding of the data

    Time-series clustering of gene expression in irradiated and bystander fibroblasts: an application of FBPA clustering

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The radiation bystander effect is an important component of the overall biological response of tissues and organisms to ionizing radiation, but the signaling mechanisms between irradiated and non-irradiated bystander cells are not fully understood. In this study, we measured a time-series of gene expression after α-particle irradiation and applied the Feature Based Partitioning around medoids Algorithm (FBPA), a new clustering method suitable for sparse time series, to identify signaling modules that act in concert in the response to direct irradiation and bystander signaling. We compared our results with those of an alternate clustering method, Short Time series Expression Miner (STEM).</p> <p>Results</p> <p>While computational evaluations of both clustering results were similar, FBPA provided more biological insight. After irradiation, gene clusters were enriched for signal transduction, cell cycle/cell death and inflammation/immunity processes; but only FBPA separated clusters by function. In bystanders, gene clusters were enriched for cell communication/motility, signal transduction and inflammation processes; but biological functions did not separate as clearly with either clustering method as they did in irradiated samples. Network analysis confirmed p53 and NF-κB transcription factor-regulated gene clusters in irradiated and bystander cells and suggested novel regulators, such as KDM5B/JARID1B (lysine (K)-specific demethylase 5B) and HDACs (histone deacetylases), which could epigenetically coordinate gene expression after irradiation.</p> <p>Conclusions</p> <p>In this study, we have shown that a new time series clustering method, FBPA, can provide new leads to the mechanisms regulating the dynamic cellular response to radiation. The findings implicate epigenetic control of gene expression in addition to transcription factor networks.</p

    Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes

    Get PDF
    Intra-tumor heterogeneity (ITH) is a mechanism of therapeutic resistance and therefore an important clinical challenge. However, the extent, origin, and drivers of ITH across cancer types are poorly understood. To address this, we extensively characterize ITH across whole-genome sequences of 2,658 cancer samples spanning 38 cancer types. Nearly all informative samples (95.1 %) contain evidence of distinct subclonal expansions with frequent branching relationships between subclones, We observe positive selection of subclonal driver mutations across most cancer types and identify cancer type-specific subclonal patterns of driver gene mutations, fusions, structural variants, and copy number alterations as well as dynamic changes in mutational processes between subclonal expansions. Our results underline the importance of ITH and its drivers in tumor evolution and provide a pan-cancer resource of comprehensively annotated subclonal events from whole-genome sequencing data.Peer reviewe

    Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes.

    Get PDF
    Intra-tumor heterogeneity (ITH) is a mechanism of therapeutic resistance and therefore an important clinical challenge. However, the extent, origin, and drivers of ITH across cancer types are poorly understood. To address this, we extensively characterize ITH across whole-genome sequences of 2,658 cancer samples spanning 38 cancer types. Nearly all informative samples (95.1%) contain evidence of distinct subclonal expansions with frequent branching relationships between subclones. We observe positive selection of subclonal driver mutations across most cancer types and identify cancer type-specific subclonal patterns of driver gene mutations, fusions, structural variants, and copy number alterations as well as dynamic changes in mutational processes between subclonal expansions. Our results underline the importance of ITH and its drivers in tumor evolution and provide a pan-cancer resource of comprehensively annotated subclonal events from whole-genome sequencing data

    Algorithms to Explore the Structure and Evolution of Biological Networks

    Get PDF
    High-throughput experimental protocols have revealed thousands of relationships amongst genes and proteins under various conditions. These putative associations are being aggressively mined to decipher the structural and functional architecture of the cell. One useful tool for exploring this data has been computational network analysis. In this thesis, we propose a collection of novel algorithms to explore the structure and evolution of large, noisy, and sparsely annotated biological networks. We first introduce two information-theoretic algorithms to extract interesting patterns and modules embedded in large graphs. The first, graph summarization, uses the minimum description length principle to find compressible parts of the graph. The second, VI-Cut, uses the variation of information to non-parametrically find groups of topologically cohesive and similarly annotated nodes in the network. We show that both algorithms find structure in biological data that is consistent with known biological processes, protein complexes, genetic diseases, and operational taxonomic units. We also propose several algorithms to systematically generate an ensemble of near-optimal network clusterings and show how these multiple views can be used together to identify clustering dynamics that any single solution approach would miss. To facilitate the study of ancient networks, we introduce a framework called ``network archaeology'') for reconstructing the node-by-node and edge-by-edge arrival history of a network. Starting with a present-day network, we apply a probabilistic growth model backwards in time to find high-likelihood previous states of the graph. This allows us to explore how interactions and modules may have evolved over time. In experiments with real-world social and biological networks, we find that our algorithms can recover significant features of ancestral networks that have long since disappeared. Our work is motivated by the need to understand large and complex biological systems that are being revealed to us by imperfect data. As data continues to pour in, we believe that computational network analysis will continue to be an essential tool towards this end
    • …
    corecore