Search CORE

43 research outputs found

Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

Author: Gabasova Evelina
Reid John
Wernisch Lorenz
Publication venue: PLoS Comput Biol
Publication date: 01/10/2017
Field of study

Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm

Directory of Open Access Journals

Apollo (Cambridge)

FigShare

Analysing programming languages using dependency networks

Author: Evelina Gabasova (568388)
Publication venue
Publication date
Field of study

Proponents of different programming languages often argue about benefits of using their language of choice. In this work, we propose a more systematic approach using network analysis techniques. We examine how the choice of programming language affects network structure of code. </p

FigShare

Survival curves for global clusters in the breast cancer dataset from TCGA, using the model with 3 context-specific clusters and up to 18 global clusters.

Author: Evelina Gabasova (4514947)
John Reid (4173163)
Lorenz Wernisch (9596)
Publication venue
Publication date: 07/04/2017
Field of study

The differences between the survival curves are significant with p = 0.0382 using the log-rank test.</p

Hal - Université Grenoble Alpes

FigShare

Consistency between global clustering results for different number of local context-specific clusters, as measured by the ARI.

Author: Evelina Gabasova (4514947)
John Reid (4173163)
Lorenz Wernisch (9596)
Publication venue
Publication date
Field of study

The compared models were trained with 18 global clusters and 3 to 5 context-specific clusters.</p

FigShare

Average number of occupied clusters across different numbers of global clusters.

Author: Evelina Gabasova (4514947)
John Reid (4173163)
Lorenz Wernisch (9596)
Publication venue
Publication date
Field of study

The number of clusters is the average of the posterior number of global clusters that have any samples assigned to them across the MCMC iterations. The figure shows both the total number of occupied clusters and the number of clusters that have more than 5 samples assigned to them.</p

FigShare

Example of the simulated data for p = 0 and p = 0.5, which show different degrees of dependence.

Author: Evelina Gabasova (4514947)
John Reid (4173163)
Lorenz Wernisch (9596)
Publication venue
Publication date
Field of study

The x axis corresponds to the data in the first dataset (context), the y axis represents the data in the second dataset (context). The two subfigures show the two extreme situations: (a) For p = 0, we get two global clusters. Cluster membership is fully dependent on each other in both datasets. (b) For p = 0.5, we get four global clusters, where cluster membership in one dataset is fully independent on cluster membership in the second dataset.</p

FigShare

ARI comparing global clustering of simulated datasets for varying values of p (see Fig 2).

Author: Evelina Gabasova (4514947)
John Reid (4173163)
Lorenz Wernisch (9596)
Publication venue
Publication date
Field of study

Each point corresponds to the corresponding algorithm applied to one dataset, the plot shows also the loess curve for each method. Higher values correspond to better agreement between the estimated cluster assignments and the true cluster membership.</p

FigShare

Consistency between local clustering results for different number of global clusters with 3 context-specific clusters, as measured by the ARI.

Author: Evelina Gabasova (4514947)
John Reid (4173163)
Lorenz Wernisch (9596)
Publication venue
Publication date
Field of study

The ARI values show several local optima. (a) Gene expression context. (b) DNA methylation context. (c) miRNA context. (d) RPPA context.</p

FigShare

Consistency between global clustering results for different number of global clusters with 3 context-specific clusters, as measured by the ARI.

Author: Evelina Gabasova (4514947)
John Reid (4173163)
Lorenz Wernisch (9596)
Publication venue
Publication date
Field of study

Consistency between global clustering results for different number of global clusters with 3 context-specific clusters, as measured by the ARI.</p

FigShare

Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

Analysing programming languages using dependency networks

Survival curves for global clusters in the breast cancer dataset from TCGA, using the model with 3 context-specific clusters and up to 18 global clusters.

Consistency between global clustering results for different number of local context-specific clusters, as measured by the ARI.

Average number of occupied clusters across different numbers of global clusters.

Example of the simulated data for <i>p</i> = 0 and <i>p</i> = 0.5, which show different degrees of dependence.

ARI comparing global clustering of simulated datasets for varying values of <i>p</i> (see Fig 2).

Consistency between local clustering results for different number of global clusters with 3 context-specific clusters, as measured by the ARI.

Consistency between global clustering results for different number of global clusters with 3 context-specific clusters, as measured by the ARI.