2,913 research outputs found
Multiple kernel learning for integrative consensus clustering of omic datasets.
MOTIVATION: Diverse applications-particularly in tumour subtyping-have demonstrated the importance of integrative clustering techniques for combining information from multiple data sources. Cluster Of Clusters Analysis (COCA) is one such approach that has been widely applied in the context of tumour subtyping. However, the properties of COCA have never been systematically explored, and its robustness to the inclusion of noisy datasets is unclear. RESULTS: We rigorously benchmark COCA, and present Kernel Learning Integrative Clustering (KLIC) as an alternative strategy. KLIC frames the challenge of combining clustering structures as a multiple kernel learning problem, in which different datasets each provide a weighted contribution to the final clustering. This allows the contribution of noisy datasets to be down-weighted relative to more informative datasets. We compare the performances of KLIC and COCA in a variety of situations through simulation studies. We also present the output of KLIC and COCA in real data applications to cancer subtyping and transcriptional module discovery. AVAILABILITY AND IMPLEMENTATION: R packages klic and coca are available on the Comprehensive R Archive Network. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
A graph theoretical approach to data fusion.
The rapid development of high throughput experimental techniques has resulted in a growing diversity of genomic datasets being produced and requiring analysis. Therefore, it is increasingly being recognized that we can gain deeper understanding about underlying biology by combining the insights obtained from multiple, diverse datasets. Thus we propose a novel scalable computational approach to unsupervised data fusion. Our technique exploits network representations of the data to identify similarities among the datasets. We may work within the Bayesian formalism, using Bayesian nonparametric approaches to model each dataset; or (for fast, approximate, and massive scale data fusion) can naturally switch to more heuristic modeling techniques. An advantage of the proposed approach is that each dataset can initially be modeled independently (in parallel), before applying a fast post-processing step to perform data integration. This allows us to incorporate new experimental data in an online fashion, without having to rerun all of the analysis. We first demonstrate the applicability of our tool on artificial data, and then on examples from the literature, which include yeast cell cycle, breast cancer and sporadic inclusion body myositis datasets
Tailored Bayes: a risk modeling framework under unequal misclassification costs.
Risk prediction models are a crucial tool in healthcare. Risk prediction models with a binary outcome (i.e., binary classification models) are often constructed using methodology which assumes the costs of different classification errors are equal. In many healthcare applications, this assumption is not valid, and the differences between misclassification costs can be quite large. For instance, in a diagnostic setting, the cost of misdiagnosing a person with a life-threatening disease as healthy may be larger than the cost of misdiagnosing a healthy person as a patient. In this article, we present Tailored Bayes (TB), a novel Bayesian inference framework which "tailors" model fitting to optimize predictive performance with respect to unbalanced misclassification costs. We use simulation studies to showcase when TB is expected to outperform standard Bayesian methods in the context of logistic regression. We then apply TB to three real-world applications, a cardiac surgery, a breast cancer prognostication task, and a breast cancer tumor classification task and demonstrate the improvement in predictive performance over standard methods
Cellular population dynamics control the robustness of the stem cell niche.
Within populations of cells, fate decisions are controlled by an indeterminate combination of cell-intrinsic and cell-extrinsic factors. In the case of stem cells, the stem cell niche is believed to maintain 'stemness' through communication and interactions between the stem cells and one or more other cell-types that contribute to the niche conditions. To investigate the robustness of cell fate decisions in the stem cell hierarchy and the role that the niche plays, we introduce simple mathematical models of stem and progenitor cells, their progeny and their interplay in the niche. These models capture the fundamental processes of proliferation and differentiation and allow us to consider alternative possibilities regarding how niche-mediated signalling feedback regulates the niche dynamics. Generalised stability analysis of these stem cell niche systems enables us to describe the stability properties of each model. We find that although the number of feasible states depends on the model, their probabilities of stability in general do not: stem cell-niche models are stable across a wide range of parameters. We demonstrate that niche-mediated feedback increases the number of stable steady states, and show how distinct cell states have distinct branching characteristics. The ecological feedback and interactions mediated by the stem cell niche thus lend (surprisingly) high levels of robustness to the stem and progenitor cell population dynamics. Furthermore, cell-cell interactions are sufficient for populations of stem cells and their progeny to achieve stability and maintain homeostasis. We show that the robustness of the niche - and hence of the stem cell pool in the niche - depends only weakly, if at all, on the complexity of the niche make-up: simple as well as complicated niche systems are capable of supporting robust and stable stem cell dynamics
Noise-augmented directional clustering of genetic association data identifies distinct mechanisms underlying obesity.
Funder: NIHR Cambridge Biomedical Research CentreClustering genetic variants based on their associations with different traits can provide insight into their underlying biological mechanisms. Existing clustering approaches typically group variants based on the similarity of their association estimates for various traits. We present a new procedure for clustering variants based on their proportional associations with different traits, which is more reflective of the underlying mechanisms to which they relate. The method is based on a mixture model approach for directional clustering and includes a noise cluster that provides robustness to outliers. The procedure performs well across a range of simulation scenarios. In an applied setting, clustering genetic variants associated with body mass index generates groups reflective of distinct biological pathways. Mendelian randomization analyses support that the clusters vary in their effect on coronary heart disease, including one cluster that represents elevated body mass index with a favourable metabolic profile and reduced coronary heart disease risk. Analysis of the biological pathways underlying this cluster identifies inflammation as potentially explaining differences in the effects of increased body mass index on coronary heart disease
Recommended from our members
MR-Clust: clustering of genetic variants in Mendelian randomization with similar causal estimates.
MOTIVATION: Mendelian randomization is an epidemiological technique that uses genetic variants as instrumental variables to estimate the causal effect of a risk factor on an outcome. We consider a scenario in which causal estimates based on each variant in turn differ more strongly than expected by chance alone, but the variants can be divided into distinct clusters, such that all variants in the cluster have similar causal estimates. This scenario is likely to occur when there are several distinct causal mechanisms by which a risk factor influences an outcome with different magnitudes of causal effect. We have developed an algorithm MR-Clust that finds such clusters of variants, and so can identify variants that reflect distinct causal mechanisms. Two features of our clustering algorithm are that it accounts for differential uncertainty in the causal estimates, and it includes 'null' and 'junk' clusters, to provide protection against the detection of spurious clusters. RESULTS: Our algorithm correctly detected the number of clusters in a simulation analysis, outperforming methods that either do not account for uncertainty or do not include null and junk clusters. In an applied example considering the effect of blood pressure on coronary artery disease risk, the method detected four clusters of genetic variants. A post hoc hypothesis-generating search suggested that variants in the cluster with a negative effect of blood pressure on coronary artery disease risk were more strongly related to trunk fat percentage and other adiposity measures than variants not in this cluster. AVAILABILITY AND IMPLEMENTATION: MR-Clust can be downloaded from https://github.com/cnfoley/mrclust. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
Recommended from our members
GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution.
MOTIVATION: Many methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters. RESULTS: The proposed method, GPseudoClust, is a novel approach that jointly infers pseudotemporal ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with non-parametric Bayesian clustering methods, efficient Markov Chain Monte Carlo sampling and novel subsampling strategies which aid computation. We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings. AVAILABILITY AND IMPLEMENTATION: An implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/GPseudoClust. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
MDI-GPU: accelerating integrative modelling for genomic-scale data using GP-GPU computing.
The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct--but often complementary--information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/
Model selection in systems biology depends on experimental design.
Experimental design attempts to maximise the information available for modelling tasks. An optimal experiment allows the inferred models or parameters to be chosen with the highest expected degree of confidence. If the true system is faithfully reproduced by one of the models, the merit of this approach is clear - we simply wish to identify it and the true parameters with the most certainty. However, in the more realistic situation where all models are incorrect or incomplete, the interpretation of model selection outcomes and the role of experimental design needs to be examined more carefully. Using a novel experimental design and model selection framework for stochastic state-space models, we perform high-throughput in-silico analyses on families of gene regulatory cascade models, to show that the selected model can depend on the experiment performed. We observe that experimental design thus makes confidence a criterion for model choice, but that this does not necessarily correlate with a model's predictive power or correctness. Finally, in the special case of linear ordinary differential equation (ODE) models, we explore how wrong a model has to be before it influences the conclusions of a model selection analysis
- …