207 research outputs found
Sparse multitask regression for identifying common mechanism of response to therapeutic targets
Motivation: Molecular association of phenotypic responses is an important step in hypothesis generation and for initiating design of new experiments. Current practices for associating gene expression data with multidimensional phenotypic data are typically (i) performed one-to-one, i.e. each gene is examined independently with a phenotypic index and (ii) tested with one stress condition at a time, i.e. different perturbations are analyzed separately. As a result, the complex coordination among the genes responsible for a phenotypic profile is potentially lost. More importantly, univariate analysis can potentially hide new insights into common mechanism of response
A general framework for penalized mixed-effects multitask learning with applications on DNA methylation surrogate biomarkers creation
Recent evidence highlights the usefulness of DNA methylation (DNAm)
biomarkers as surrogates for exposure to risk factors for noncommunicable
diseases in epidemiological studies and randomized trials. DNAm variability
has been demonstrated to be tightly related to lifestyle behavior and exposure
to environmental risk factors, ultimately providing an unbiased proxy of
an individual state of health. At present, the creation of DNAm surrogates
relies on univariate penalized regression models, with elastic-net regularizer
being the gold standard when accomplishing the task. Nonetheless, more advanced
modeling procedures are required in the presence of multivariate outcomes
with a structured dependence pattern among the study samples. In this
work we propose a general framework for mixed-effects multitask learning
in presence of high-dimensional predictors to develop a multivariate DNAm
biomarker from a multicenter study. A penalized estimation scheme, based
on an expectation-maximization algorithm, is devised in which any penalty
criteria for fixed-effects models can be conveniently incorporated in the fitting
process. We apply the proposed methodology to create novel DNAm
surrogate biomarkers for multiple correlated risk factors for cardiovascular
diseases and comorbidities. We show that the proposed approach, modeling
multiple outcomes together, outperforms state-of-the-art alternatives both in
predictive power and biomolecular interpretation of the results
Interactive Exploration of Multitask Dependency Networks
Scientists increasingly depend on machine learning algorithms to discover patterns in complex data. Two examples addressed in this dissertation are identifying how information sharing among regions of the brain develops due to learning; and, learning dependency networks of blood proteins associated with cancer. Dependency networks, or graphical models, are learned from the observed data in order to make comparisons between the sub-populations of the dataset. Rarely is there sufficient data to infer robust individual networks for each sub-population. The multiple networks must be considered simultaneously; exploding the hypothesis space of the learning problem. Exploring this complex solution space requires input from the domain scientist to refine the objective function. This dissertation introduces a framework to incorporate domain knowledge in transfer learning to facilitate the exploration of solutions. The framework is a generalization of existing algorithms for multiple network structure identification. Solutions produced with human input narrow down the variance of solutions to those that answer questions of interest to domain scientists. Patterns, such as identifying differences between networks, are learned with higher confidence using transfer learning than through the standard method of bootstrapping. Transfer learning may be the ideal method for making comparisons among dependency networks, whether looking for similarities or differences. Domain knowledge input and visualization of solutions are combined in an interactive tool that enables domain scientists to explore the space of solutions efficiently
A Fair Experimental Comparison of Neural Network Architectures for Latent Representations of Multi-Omics for Drug Response Prediction
Recent years have seen a surge of novel neural network architectures for the
integration of multi-omics data for prediction. Most of the architectures
include either encoders alone or encoders and decoders, i.e., autoencoders of
various sorts, to transform multi-omics data into latent representations. One
important parameter is the depth of integration: the point at which the latent
representations are computed or merged, which can be either early,
intermediate, or late. The literature on integration methods is growing
steadily, however, close to nothing is known about the relative performance of
these methods under fair experimental conditions and under consideration of
different use cases. We developed a comparison framework that trains and
optimizes multi-omics integration methods under equal conditions. We
incorporated early integration and four recently published deep learning
methods: MOLI, Super.FELT, OmiEmbed, and MOMA. Further, we devised a novel
method, Omics Stacking, that combines the advantages of intermediate and late
integration. Experiments were conducted on a public drug response data set with
multiple omics data (somatic point mutations, somatic copy number profiles and
gene expression profiles) that was obtained from cell lines, patient-derived
xenografts, and patient samples. Our experiments confirmed that early
integration has the lowest predictive performance. Overall, architectures that
integrate triplet loss achieved the best results. Statistical differences can,
overall, rarely be observed, however, in terms of the average ranks of methods,
Super.FELT is consistently performing best in a cross-validation setting and
Omics Stacking best in an external test set setting. The source code of all
experiments is available under
\url{https://github.com/kramerlab/Multi-Omics_analysis
Joint learning from multiple information sources for biological problems
Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies.
Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised modelsâ knowledge base with publicly available related data to enhance the computational modelsâ prediction performance.
Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approachesâ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systemsâ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the modelâs generated prediction results to facilitate field expertsâ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between âComputer Scienceâ and âBiologyâ that will open a new era of fruitful collaboration between computer scientists and biological field experts
Data- and knowledge-based modeling of gene regulatory networks: an update
Gene regulatory network inference is a systems biology approach which predicts interactions between genes with the help of high-throughput data. In this review, we present current and updated network inference methods focusing on novel techniques for data acquisition, network inference assessment, network inference for interacting species and the integration of prior knowledge. After the advance of Next-Generation-Sequencing of cDNAs derived from RNA samples (RNA-Seq) we discuss in detail its application to network inference. Furthermore, we present progress for large-scale or even full-genomic network inference as well as for small-scale condensed network inference and review advances in the evaluation of network inference methods by crowdsourcing. Finally, we reflect the current availability of data and prior knowledge sources and give an outlook for the inference of gene regulatory networks that reflect interacting species, in particular pathogen-host interactions
- âŠ