14,517 research outputs found
Bayesian correlated clustering to integrate multiple datasets
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets.
Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods
Discovering transcriptional modules by Bayesian data integration
Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.
Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs
Detection of regulator genes and eQTLs in gene networks
Genetic differences between individuals associated to quantitative phenotypic
traits, including disease states, are usually found in non-coding genomic
regions. These genetic variants are often also associated to differences in
expression levels of nearby genes (they are "expression quantitative trait
loci" or eQTLs for short) and presumably play a gene regulatory role, affecting
the status of molecular networks of interacting genes, proteins and
metabolites. Computational systems biology approaches to reconstruct causal
gene networks from large-scale omics data have therefore become essential to
understand the structure of networks controlled by eQTLs together with other
regulatory genes, and to generate detailed hypotheses about the molecular
mechanisms that lead from genotype to phenotype. Here we review the main
analytical methods and softwares to identify eQTLs and their associated genes,
to reconstruct co-expression networks and modules, to reconstruct causal
Bayesian gene and module networks, and to validate predicted networks in
silico.Comment: minor revision with typos corrected; review article; 24 pages, 2
figure
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Mixture model with multiple allocations for clustering spatially correlated observations in the analysis of ChIP-Seq data
Model-based clustering is a technique widely used to group a collection of
units into mutually exclusive groups. There are, however, situations in which
an observation could in principle belong to more than one cluster. In the
context of Next-Generation Sequencing (NGS) experiments, for example, the
signal observed in the data might be produced by two (or more) different
biological processes operating together and a gene could participate in both
(or all) of them. We propose a novel approach to cluster NGS discrete data,
coming from a ChIP-Seq experiment, with a mixture model, allowing each unit to
belong potentially to more than one group: these multiple allocation clusters
can be flexibly defined via a function combining the features of the original
groups without introducing new parameters. The formulation naturally gives rise
to a `zero-inflation group' in which values close to zero can be allocated,
acting as a correction for the abundance of zeros that manifest in this type of
data. We take into account the spatial dependency between observations, which
is described through a latent Conditional Auto-Regressive process that can
reflect different dependency patterns. We assess the performance of our model
within a simulation environment and then we apply it to ChIP-seq real data.Comment: 25 pages; 3 tables, 6 figure
Inferring clonal evolution of tumors from single nucleotide somatic mutations
High-throughput sequencing allows the detection and quantification of
frequencies of somatic single nucleotide variants (SNV) in heterogeneous tumor
cell populations. In some cases, the evolutionary history and population
frequency of the subclonal lineages of tumor cells present in the sample can be
reconstructed from these SNV frequency measurements. However, automated methods
to do this reconstruction are not available and the conditions under which
reconstruction is possible have not been described.
We describe the conditions under which the evolutionary history can be
uniquely reconstructed from SNV frequencies from single or multiple samples
from the tumor population and we introduce a new statistical model, PhyloSub,
that infers the phylogeny and genotype of the major subclonal lineages
represented in the population of cancer cells. It uses a Bayesian nonparametric
prior over trees that groups SNVs into major subclonal lineages and
automatically estimates the number of lineages and their ancestry. We sample
from the joint posterior distribution over trees to identify evolutionary
histories and cell population frequencies that have the highest probability of
generating the observed SNV frequency data. When multiple phylogenies are
consistent with a given set of SNV frequencies, PhyloSub represents the
uncertainty in the tumor phylogeny using a partial order plot. Experiments on a
simulated dataset and two real datasets comprising tumor samples from acute
myeloid leukemia and chronic lymphocytic leukemia patients demonstrate that
PhyloSub can infer both linear (or chain) and branching lineages and its
inferences are in good agreement with ground truth, where it is available
- …