207 research outputs found

    Sparse multitask regression for identifying common mechanism of response to therapeutic targets

    Get PDF
    Motivation: Molecular association of phenotypic responses is an important step in hypothesis generation and for initiating design of new experiments. Current practices for associating gene expression data with multidimensional phenotypic data are typically (i) performed one-to-one, i.e. each gene is examined independently with a phenotypic index and (ii) tested with one stress condition at a time, i.e. different perturbations are analyzed separately. As a result, the complex coordination among the genes responsible for a phenotypic profile is potentially lost. More importantly, univariate analysis can potentially hide new insights into common mechanism of response

    A general framework for penalized mixed-effects multitask learning with applications on DNA methylation surrogate biomarkers creation

    Get PDF
    Recent evidence highlights the usefulness of DNA methylation (DNAm) biomarkers as surrogates for exposure to risk factors for noncommunicable diseases in epidemiological studies and randomized trials. DNAm variability has been demonstrated to be tightly related to lifestyle behavior and exposure to environmental risk factors, ultimately providing an unbiased proxy of an individual state of health. At present, the creation of DNAm surrogates relies on univariate penalized regression models, with elastic-net regularizer being the gold standard when accomplishing the task. Nonetheless, more advanced modeling procedures are required in the presence of multivariate outcomes with a structured dependence pattern among the study samples. In this work we propose a general framework for mixed-effects multitask learning in presence of high-dimensional predictors to develop a multivariate DNAm biomarker from a multicenter study. A penalized estimation scheme, based on an expectation-maximization algorithm, is devised in which any penalty criteria for fixed-effects models can be conveniently incorporated in the fitting process. We apply the proposed methodology to create novel DNAm surrogate biomarkers for multiple correlated risk factors for cardiovascular diseases and comorbidities. We show that the proposed approach, modeling multiple outcomes together, outperforms state-of-the-art alternatives both in predictive power and biomolecular interpretation of the results

    Interactive Exploration of Multitask Dependency Networks

    Get PDF
    Scientists increasingly depend on machine learning algorithms to discover patterns in complex data. Two examples addressed in this dissertation are identifying how information sharing among regions of the brain develops due to learning; and, learning dependency networks of blood proteins associated with cancer. Dependency networks, or graphical models, are learned from the observed data in order to make comparisons between the sub-populations of the dataset. Rarely is there sufficient data to infer robust individual networks for each sub-population. The multiple networks must be considered simultaneously; exploding the hypothesis space of the learning problem. Exploring this complex solution space requires input from the domain scientist to refine the objective function. This dissertation introduces a framework to incorporate domain knowledge in transfer learning to facilitate the exploration of solutions. The framework is a generalization of existing algorithms for multiple network structure identification. Solutions produced with human input narrow down the variance of solutions to those that answer questions of interest to domain scientists. Patterns, such as identifying differences between networks, are learned with higher confidence using transfer learning than through the standard method of bootstrapping. Transfer learning may be the ideal method for making comparisons among dependency networks, whether looking for similarities or differences. Domain knowledge input and visualization of solutions are combined in an interactive tool that enables domain scientists to explore the space of solutions efficiently

    A Fair Experimental Comparison of Neural Network Architectures for Latent Representations of Multi-Omics for Drug Response Prediction

    Full text link
    Recent years have seen a surge of novel neural network architectures for the integration of multi-omics data for prediction. Most of the architectures include either encoders alone or encoders and decoders, i.e., autoencoders of various sorts, to transform multi-omics data into latent representations. One important parameter is the depth of integration: the point at which the latent representations are computed or merged, which can be either early, intermediate, or late. The literature on integration methods is growing steadily, however, close to nothing is known about the relative performance of these methods under fair experimental conditions and under consideration of different use cases. We developed a comparison framework that trains and optimizes multi-omics integration methods under equal conditions. We incorporated early integration and four recently published deep learning methods: MOLI, Super.FELT, OmiEmbed, and MOMA. Further, we devised a novel method, Omics Stacking, that combines the advantages of intermediate and late integration. Experiments were conducted on a public drug response data set with multiple omics data (somatic point mutations, somatic copy number profiles and gene expression profiles) that was obtained from cell lines, patient-derived xenografts, and patient samples. Our experiments confirmed that early integration has the lowest predictive performance. Overall, architectures that integrate triplet loss achieved the best results. Statistical differences can, overall, rarely be observed, however, in terms of the average ranks of methods, Super.FELT is consistently performing best in a cross-validation setting and Omics Stacking best in an external test set setting. The source code of all experiments is available under \url{https://github.com/kramerlab/Multi-Omics_analysis

    Joint learning from multiple information sources for biological problems

    Get PDF
    Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts

    Data- and knowledge-based modeling of gene regulatory networks: an update

    Get PDF
    Gene regulatory network inference is a systems biology approach which predicts interactions between genes with the help of high-throughput data. In this review, we present current and updated network inference methods focusing on novel techniques for data acquisition, network inference assessment, network inference for interacting species and the integration of prior knowledge. After the advance of Next-Generation-Sequencing of cDNAs derived from RNA samples (RNA-Seq) we discuss in detail its application to network inference. Furthermore, we present progress for large-scale or even full-genomic network inference as well as for small-scale condensed network inference and review advances in the evaluation of network inference methods by crowdsourcing. Finally, we reflect the current availability of data and prior knowledge sources and give an outlook for the inference of gene regulatory networks that reflect interacting species, in particular pathogen-host interactions
    • 

    corecore