2,909 research outputs found

    Multiple kernel learning for integrative consensus clustering of omic datasets.

    Get PDF
    MOTIVATION: Diverse applications-particularly in tumour subtyping-have demonstrated the importance of integrative clustering techniques for combining information from multiple data sources. Cluster Of Clusters Analysis (COCA) is one such approach that has been widely applied in the context of tumour subtyping. However, the properties of COCA have never been systematically explored, and its robustness to the inclusion of noisy datasets is unclear. RESULTS: We rigorously benchmark COCA, and present Kernel Learning Integrative Clustering (KLIC) as an alternative strategy. KLIC frames the challenge of combining clustering structures as a multiple kernel learning problem, in which different datasets each provide a weighted contribution to the final clustering. This allows the contribution of noisy datasets to be down-weighted relative to more informative datasets. We compare the performances of KLIC and COCA in a variety of situations through simulation studies. We also present the output of KLIC and COCA in real data applications to cancer subtyping and transcriptional module discovery. AVAILABILITY AND IMPLEMENTATION: R packages klic and coca are available on the Comprehensive R Archive Network. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    A graph theoretical approach to data fusion.

    Get PDF
    The rapid development of high throughput experimental techniques has resulted in a growing diversity of genomic datasets being produced and requiring analysis. Therefore, it is increasingly being recognized that we can gain deeper understanding about underlying biology by combining the insights obtained from multiple, diverse datasets. Thus we propose a novel scalable computational approach to unsupervised data fusion. Our technique exploits network representations of the data to identify similarities among the datasets. We may work within the Bayesian formalism, using Bayesian nonparametric approaches to model each dataset; or (for fast, approximate, and massive scale data fusion) can naturally switch to more heuristic modeling techniques. An advantage of the proposed approach is that each dataset can initially be modeled independently (in parallel), before applying a fast post-processing step to perform data integration. This allows us to incorporate new experimental data in an online fashion, without having to rerun all of the analysis. We first demonstrate the applicability of our tool on artificial data, and then on examples from the literature, which include yeast cell cycle, breast cancer and sporadic inclusion body myositis datasets

    Tailored Bayes: a risk modeling framework under unequal misclassification costs.

    Get PDF
    Risk prediction models are a crucial tool in healthcare. Risk prediction models with a binary outcome (i.e., binary classification models) are often constructed using methodology which assumes the costs of different classification errors are equal. In many healthcare applications, this assumption is not valid, and the differences between misclassification costs can be quite large. For instance, in a diagnostic setting, the cost of misdiagnosing a person with a life-threatening disease as healthy may be larger than the cost of misdiagnosing a healthy person as a patient. In this article, we present Tailored Bayes (TB), a novel Bayesian inference framework which "tailors" model fitting to optimize predictive performance with respect to unbalanced misclassification costs. We use simulation studies to showcase when TB is expected to outperform standard Bayesian methods in the context of logistic regression. We then apply TB to three real-world applications, a cardiac surgery, a breast cancer prognostication task, and a breast cancer tumor classification task and demonstrate the improvement in predictive performance over standard methods

    Cellular population dynamics control the robustness of the stem cell niche.

    Get PDF
    Within populations of cells, fate decisions are controlled by an indeterminate combination of cell-intrinsic and cell-extrinsic factors. In the case of stem cells, the stem cell niche is believed to maintain 'stemness' through communication and interactions between the stem cells and one or more other cell-types that contribute to the niche conditions. To investigate the robustness of cell fate decisions in the stem cell hierarchy and the role that the niche plays, we introduce simple mathematical models of stem and progenitor cells, their progeny and their interplay in the niche. These models capture the fundamental processes of proliferation and differentiation and allow us to consider alternative possibilities regarding how niche-mediated signalling feedback regulates the niche dynamics. Generalised stability analysis of these stem cell niche systems enables us to describe the stability properties of each model. We find that although the number of feasible states depends on the model, their probabilities of stability in general do not: stem cell-niche models are stable across a wide range of parameters. We demonstrate that niche-mediated feedback increases the number of stable steady states, and show how distinct cell states have distinct branching characteristics. The ecological feedback and interactions mediated by the stem cell niche thus lend (surprisingly) high levels of robustness to the stem and progenitor cell population dynamics. Furthermore, cell-cell interactions are sufficient for populations of stem cells and their progeny to achieve stability and maintain homeostasis. We show that the robustness of the niche - and hence of the stem cell pool in the niche - depends only weakly, if at all, on the complexity of the niche make-up: simple as well as complicated niche systems are capable of supporting robust and stable stem cell dynamics

    Noise-augmented directional clustering of genetic association data identifies distinct mechanisms underlying obesity.

    Get PDF
    Funder: NIHR Cambridge Biomedical Research CentreClustering genetic variants based on their associations with different traits can provide insight into their underlying biological mechanisms. Existing clustering approaches typically group variants based on the similarity of their association estimates for various traits. We present a new procedure for clustering variants based on their proportional associations with different traits, which is more reflective of the underlying mechanisms to which they relate. The method is based on a mixture model approach for directional clustering and includes a noise cluster that provides robustness to outliers. The procedure performs well across a range of simulation scenarios. In an applied setting, clustering genetic variants associated with body mass index generates groups reflective of distinct biological pathways. Mendelian randomization analyses support that the clusters vary in their effect on coronary heart disease, including one cluster that represents elevated body mass index with a favourable metabolic profile and reduced coronary heart disease risk. Analysis of the biological pathways underlying this cluster identifies inflammation as potentially explaining differences in the effects of increased body mass index on coronary heart disease

    MDI-GPU: accelerating integrative modelling for genomic-scale data using GP-GPU computing.

    Get PDF
    The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct--but often complementary--information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/

    Model selection in systems biology depends on experimental design.

    Get PDF
    Experimental design attempts to maximise the information available for modelling tasks. An optimal experiment allows the inferred models or parameters to be chosen with the highest expected degree of confidence. If the true system is faithfully reproduced by one of the models, the merit of this approach is clear - we simply wish to identify it and the true parameters with the most certainty. However, in the more realistic situation where all models are incorrect or incomplete, the interpretation of model selection outcomes and the role of experimental design needs to be examined more carefully. Using a novel experimental design and model selection framework for stochastic state-space models, we perform high-throughput in-silico analyses on families of gene regulatory cascade models, to show that the selected model can depend on the experiment performed. We observe that experimental design thus makes confidence a criterion for model choice, but that this does not necessarily correlate with a model's predictive power or correctness. Finally, in the special case of linear ordinary differential equation (ODE) models, we explore how wrong a model has to be before it influences the conclusions of a model selection analysis
    corecore