56 research outputs found

    A message passing framework with multiple data integration for miRNA-disease association prediction

    Get PDF
    Micro RNA or miRNA is a highly conserved class of non-coding RNA that plays an important role in many diseases. Identifying miRNA-disease associations can pave the way for better clinical diagnosis and finding potential drug targets. We propose a biologically-motivated data-driven approach for the miRNA-disease association prediction, which overcomes the data scarcity problem by exploiting information from multiple data sources. The key idea is to enrich the existing miRNA/disease-protein-coding gene (PCG) associations via a message passing framework, followed by the use of disease ontology information for further feature filtering. The enriched and filtered PCG associations are then used to construct the inter-connected miRNA-PCG-disease network to train a structural deep network embedding (SDNE) model. Finally, the pre-trained embeddings and the biologically relevant features from the miRNA family and disease semantic similarity are concatenated to form the pair input representations to a Random Forest classifier whose task is to predict the miRNA-disease association probabilities. We present large-scale comparative experiments, ablation, and case studies to showcase our approach’s superiority. Besides, we make the model prediction results for 1618 miRNAs and 3679 diseases, along with all related information, publicly available at http://software.mpm.leibniz-ai-lab.de/ to foster assessments and future adoption

    Large margin methods for partner specific prediction of interfaces in protein complexes

    Get PDF
    2014 Spring.The study of protein interfaces and binding sites is a very important domain of research in bioinformatics. Information about the interfaces between proteins can be used not only in understanding protein function but can also be directly employed in drug design and protein engineering. However, the experimental determination of protein interfaces is cumbersome, expensive and not possible in some cases with today's technology. As a consequence, the computational prediction of protein interfaces from sequence and structure has emerged as a very active research area. A number of machine learning based techniques have been proposed for the solution to this problem. However, the prediction accuracy of most such schemes is very low. In this dissertation we present large-margin classification approaches that have been designed to directly model different aspects of protein complex formation as well as the characteristics of available data. Most existing machine learning techniques for this task are partner-independent in nature, i.e., they ignore the fact that the binding propensity of a protein to bind to another protein is dependent upon characteristics of residues in both proteins. We have developed a pairwise support vector machine classifier called PAIRpred to predict protein interfaces in a partner-specific fashion. Due to its more detailed model of the problem, PAIRpred offers state of the art accuracy in predicting both binding sites at the protein level as well as inter-protein residue contacts at the complex level. PAIRpred uses sequence and structure conservation, local structural similarity and surface geometry, residue solvent exposure and template based features derived from the unbound structures of proteins forming a protein complex. We have investigated the impact of explicitly modeling the inter-dependencies between residues that are imposed by the overall structure of a protein during the formation of a protein complex through transductive and semi-supervised learning models. We also present a novel multiple instance learning scheme called MI-1 that explicitly models imprecision in sequence-level annotations of binding sites in proteins that bind calmodulin to achieve state of the art prediction accuracy for this task

    Joint learning from multiple information sources for biological problems

    Get PDF
    Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts

    An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

    Full text link

    Combining learning and constraints for genome-wide protein annotation

    Get PDF
    BackgroundThe advent of high-throughput experimental techniques paved the way to genome-wide computational analysis and predictive annotation studies. When considering the joint annotation of a large set of related entities, like all proteins of a certain genome, many candidate annotations could be inconsistent, or very unlikely, given the existing knowledge. A sound predictive framework capable of accounting for this type of constraints in making predictions could substantially contribute to the quality of machine-generated annotations at a genomic scale.ResultsWe present Ocelot, a predictive pipeline which simultaneously addresses functional and interaction annotation of all proteins of a given genome. The system combines sequence-based predictors for functional and protein-protein interaction (PPI) prediction with a consistency layer enforcing (soft) constraints as fuzzy logic rules. The enforced rules represent the available prior knowledge about the classification task, including taxonomic constraints over each GO hierarchy (e.g. a protein labeled with a GO term should also be labeled with all ancestor terms) as well as rules combining interaction and function prediction. An extensive experimental evaluation on the Yeast genome shows that the integration of prior knowledge via rules substantially improves the quality of the predictions. The system largely outperforms GoFDR, the only high-ranking system at the last CAFA challenge with a readily available implementation, when GoFDR is given access to intra-genome information only (as Ocelot), and has comparable or better results (depending on the hierarchy and performance measure) when GoFDR is allowed to use information from other genomes. Our system also compares favorably to recent methods based on deep learning

    Positive Definite Kernels in Machine Learning

    Full text link
    This survey is an introduction to positive definite kernels and the set of methods they have inspired in the machine learning literature, namely kernel methods. We first discuss some properties of positive definite kernels as well as reproducing kernel Hibert spaces, the natural extension of the set of functions {k(x,⋅),x∈X}\{k(x,\cdot),x\in\mathcal{X}\} associated with a kernel kk defined on a space X\mathcal{X}. We discuss at length the construction of kernel functions that take advantage of well-known statistical models. We provide an overview of numerous data-analysis methods which take advantage of reproducing kernel Hilbert spaces and discuss the idea of combining several kernels to improve the performance on certain tasks. We also provide a short cookbook of different kernels which are particularly useful for certain data-types such as images, graphs or speech segments.Comment: draft. corrected a typo in figure
    • 

    corecore