440 research outputs found

    Context-specific subcellular localization prediction: Leveraging protein interaction networks and scientific texts

    Get PDF
    Zhu L. Context-specific subcellular localization prediction: Leveraging protein interaction networks and scientific texts. Bielefeld: Universität Bielefeld; 2018.One essential task in proteomics analysis is to explore the functions of proteins in conducting and regulating the activities at the subcellular level. Compartmentalization of cells allows proteins to perform their activities efficiently. A protein functions correctly only if it occurs at the right place, at the right time, and interacts with the right molecules. Therefore, the knowledge of protein subcellular localization (SCL) can provide valuable insights for understanding protein functions and related cellular mechanisms. Thus, the systematic study of the subcellular distribution of human proteins is an essential task for fully characterizing the human proteome. The context-specific analysis is an important and challenging task in systems biology research. Proteins may perform different functions at different subcellular compartments (SCCs). Hence, the dynamic and context-specific alterations of the subcellular spatial distribution of proteins are essential in identifying cellular function. While this important feature is well-known in molecular and cell biology, most large-scale protein annotation studies to-date have ignored it. Tissue is one particularly crucial biological context for human biology. Proteins show their tissue specificity at the subcellular level by localizing to different SCCs in different tissues. For example, glutamine synthetase localizes in mitochondria in liver cells while in the cytoplasm in brain cells. The knowledge of the tissue-specific SCLs can enrich the human protein annotation, and thus will increase our understanding of human biology. Conventional wet-lab experiments are used to determine the SCL of proteins. Due to the expense and low-throughput of wet-lab experimental approaches, various algorithms and tools have been developed for predicting protein SCLs by integrating biological background knowledge into machine learning methods. Most of the existing approaches are designed for handling general genome-wide large-scale analysis. Thus, they cannot be used for context-specific analysis of protein SCL. The focus of this work is to develop new methods to perform tissue-specific SCL prediction. (1) First, we developed Bayesian collective Markov Random Fields (BCMRFs) to address the general multi-SCL problem. BCMRFs integrate both protein-protein interaction network (PPIN) features and the protein sequence features, consider the spatial adjacency of SCCs, and employ transductive learning on imbalanced SCL data sets. Our experimental results show that BCMRFs achieve higher performance in comparison with the state-of-art PPI-based method in SCL prediction. (2) We then integrated BCMRFs into a novel end-to-end computational approach to perform tissue-specific SCL prediction on tissue-specific PPINs. In total, 1314 proteins which SCLs were previously proven cell lines dependent were successfully localized based on nine tissue-specific PPINs. Furthermore, 549 new tissue-specific localized candidate proteins were predicted and confirmed by scientific literature. Due to the high performance of BCMRFs on known tissue-specific proteins, these are excellent candidates for further wet-lab experimental validation. (3) In addition to the proteomics data, the existing scientific literature contains an abundance of tissue-specific SCL data. To collect these data, we developed a scoring-based text mining system and extracted tissue-specific SCL associations from the abstracts of a large number of biomedical papers. The obtained data are accessible from the web based database TS-SCL DB. (4) We concluded the study with an application case study of the tissue-specific subcellular distribution of human argonaute-2 (AGO2) protein. We demonstrated how to perform tissue-specific SCL prediction on AGO2-related PPINs. Most of the resulting tissue-specific SCLs are confirmed by literature results available in TS-SCL DB

    Computational and Experimental Approaches to Reveal the Effects of Single Nucleotide Polymorphisms with Respect to Disease Diagnostics

    Get PDF
    DNA mutations are the cause of many human diseases and they are the reason for natural differences among individuals by affecting the structure, function, interactions, and other properties of DNA and expressed proteins. The ability to predict whether a given mutation is disease-causing or harmless is of great importance for the early detection of patients with a high risk of developing a particular disease and would pave the way for personalized medicine and diagnostics. Here we review existing methods and techniques to study and predict the effects of DNA mutations from three different perspectives: in silico, in vitro and in vivo. It is emphasized that the problem is complicated and successful detection of a pathogenic mutation frequently requires a combination of several methods and a knowledge of the biological phenomena associated with the corresponding macromolecules

    Statistical Relational Learning for Proteomics: Function, Interactions and Evolution

    Get PDF
    In recent years, the field of Statistical Relational Learning (SRL) [1, 2] has produced new, powerful learning methods that are explicitly designed to solve complex problems, such as collective classification, multi-task learning and structured output prediction, which natively handle relational data, noise, and partial information. Statistical-relational methods rely on some First- Order Logic as a general, expressive formal language to encode both the data instances and the relations or constraints between them. The latter encode background knowledge on the problem domain, and are use to restrict or bias the model search space according to the instructions of domain experts. The new tools developed within SRL allow to revisit old computational biology problems in a less ad hoc fashion, and to tackle novel, more complex ones. Motivated by these developments, in this thesis we describe and discuss the application of SRL to three important biological problems, highlighting the advantages, discussing the trade-offs, and pointing out the open problems. In particular, in Chapter 3 we show how to jointly improve the outputs of multiple correlated predictors of protein features by means of a very gen- eral probabilistic-logical consistency layer. The logical layer — based on grounding-specific Markov Logic networks [3] — enforces a set of weighted first-order rules encoding biologically motivated constraints between the pre- dictions. The refiner then improves the raw predictions so that they least violate the constraints. Contrary to canonical methods for the prediction of protein features, which typically take predicted correlated features as in- puts to improve the output post facto, our method can jointly refine all predictions together, with potential gains in overall consistency. In order to showcase our method, we integrate three stand-alone predictors of corre- lated features, namely subcellular localization (Loctree[4]), disulfide bonding state (Disulfind[5]), and metal bonding state (MetalDetector[6]), in a way that takes into account the respective strengths and weaknesses. The ex- perimental results show that the refiner can improve the performance of the underlying predictors by removing rule violations. In addition, the proposed method is fully general, and could in principle be applied to an array of heterogeneous predictions without requiring any change to the underlying software. In Chapter 4 we consider the multi-level protein–protein interaction (PPI) prediction problem. In general, PPIs can be seen as a hierarchical process occurring at three related levels: proteins bind by means of specific domains, which in turn form interfaces through patches of residues. Detailed knowl- edge about which domains and residues are involved in a given interaction has extensive applications to biology, including better understanding of the bind- ing process and more efficient drug/enzyme design. We cast the prediction problem in terms of multi-task learning, with one task per level (proteins, domains and residues), and propose a machine learning method that collec- tively infers the binding state of all object pairs, at all levels, concurrently. Our method is based on Semantic Based Regularization (SBR) [7], a flexible and theoretically sound SRL framework that employs First-Order Logic con- straints to tie the learning tasks together. Contrarily to most current PPI prediction methods, which neither identify which regions of a protein actu- ally instantiate an interaction nor leverage the hierarchy of predictions, our method resolves the prediction problem up to residue level, enforcing con- sistent predictions between the hierarchy levels, and fruitfully exploits the hierarchical nature of the problem. We present numerical results showing that our method substantially outperforms the baseline in several experi- mental settings, indicating that our multi-level formulation can indeed lead to better predictions. Finally, in Chapter 5 we consider the problem of predicting drug-resistant protein mutations through a combination of Inductive Logic Programming [8, 9] and Statistical Relational Learning. In particular, we focus on viral pro- teins: viruses are typically characterized by high mutation rates, which allow them to quickly develop drug-resistant mutations. Mining relevant rules from mutation data can be extremely useful to understand the virus adaptation mechanism and to design drugs that effectively counter potentially resistant mutants. We propose a simple approach for mutant prediction where the in- put consists of mutation data with drug-resistance information, either as sets of mutations conferring resistance to a certain drug, or as sets of mutants with information on their susceptibility to the drug. The algorithm learns a set of relational rules characterizing drug-resistance, and uses them to generate a set of potentially resistant mutants. Learning a weighted combination of rules allows to attach generated mutants with a resistance score as predicted by the statistical relational model and select only the highest scoring ones. Promising results were obtained in generating resistant mutations for both nucleoside and non-nucleoside HIV reverse transcriptase inhibitors. The ap- proach can be generalized quite easily to learning mutants characterized by more complex rules correlating multiple mutations

    Profiling patterns of interhelical associations in membrane proteins.

    Get PDF
    A novel set of methods has been developed to characterize polytopic membrane proteins at the topological, organellar and functional level, in order to reduce the existing functional gap in the membrane proteome. Firstly, a novel clustering tool was implemented, named PROCLASS, to facilitate the manual curation of large sets of proteins, in readiness for feature extraction. TMLOOP and TMLOOP writer were implemented to refine current topological models by predicting membrane dipping loops. TMLOOP applies weighted predictive rules in a collective motif method, to overcome the inherent limitations of single motif methods. The approach achieved 92.4% accuracy in sensitivity and 100% reliability in specificity and 1,392 topological models described in the Swiss-Prot database were refined. The subcellular location (TMLOCATE) and molecular function (TMFUN) prediction methods rely on the TMDEPTH feature extraction method along data mining techniques. TMDEPTH uses refined topological models and amino acid sequences to calculate pairs of residues located at a similar depth in the membrane. Evaluation of TMLOCATE showed a normalized accuracy of 75% in discriminating between proteins belonging to the main organelles. At a sequence similarity threshold of 40%, TMFLTN predicted main functional classes with a sensitivity of 64.1-71.4%) and 70% of the olfactory GPCRs were correctly predicted. At a sequence similarity threshold of 90%, main functional classes were predicted with a sensitivity of 75.6-92.8%) and class A GPCRs were sub-classified with a sensitivity of 84.5%>-92.9%. These results reflect a direct association between the spatial arrangement of residues in the transmembrane regions and the capacity for polytopic membrane proteins to carry out their functions. The developed methods have for the first time categorically shown that the transmembrane regions hold essential information associated with a wide range of functional properties such as filtering and gating processes, subcellular location and molecular function

    MI-NODES multiscale models of metabolic reactions, brain connectome, ecological, epidemic, world trade, and legal-social networks

    Get PDF
    [Abstract] Complex systems and networks appear in almost all areas of reality. We find then from proteins residue networks to Protein Interaction Networks (PINs). Chemical reactions form Metabolic Reactions Networks (MRNs) in living beings or Atmospheric reaction networks in planets and moons. Network of neurons appear in the worm C. elegans, in Human brain connectome, or in Artificial Neural Networks (ANNs). Infection spreading networks exist for contagious outbreaks networks in humans and in malware epidemiology for infection with viral software in internet or wireless networks. Social-legal networks with different rules evolved from swarm intelligence, to hunter-gathered societies, or citation networks of U.S. Supreme Court. In all these cases, we can see the same question. Can we predict the links based on structural information? We propose to solve the problem using Quantitative Structure-Property Relationship (QSPR) techniques commonly used in chemo-informatics. In so doing, we need software able to transform all types of networks/graphs like drug structure, drug-target interactions, protein structure, protein interactions, metabolic reactions, brain connectome, or social networks into numerical parameters. Consequently, we need to process in alignment-free mode multitarget, multiscale, and multiplexing, information. Later, we have to seek the QSPR model with Machine Learning techniques. MI-NODES is this type of software. Here we review the evolution of the software from chemoinformatics to bioinformatics and systems biology. This is an effort to develop a universal tool to study structure-property relationships in complex systems

    Automatic Segmentation of Cells of Different Types in Fluorescence Microscopy Images

    Get PDF
    Recognition of different cell compartments, types of cells, and their interactions is a critical aspect of quantitative cell biology. This provides a valuable insight for understanding cellular and subcellular interactions and mechanisms of biological processes, such as cancer cell dissemination, organ development and wound healing. Quantitative analysis of cell images is also the mainstay of numerous clinical diagnostic and grading procedures, for example in cancer, immunological, infectious, heart and lung disease. Computer automation of cellular biological samples quantification requires segmenting different cellular and sub-cellular structures in microscopy images. However, automating this problem has proven to be non-trivial, and requires solving multi-class image segmentation tasks that are challenging owing to the high similarity of objects from different classes and irregularly shaped structures. This thesis focuses on the development and application of probabilistic graphical models to multi-class cell segmentation. Graphical models can improve the segmentation accuracy by their ability to exploit prior knowledge and model inter-class dependencies. Directed acyclic graphs, such as trees have been widely used to model top-down statistical dependencies as a prior for improved image segmentation. However, using trees, a few inter-class constraints can be captured. To overcome this limitation, polytree graphical models are proposed in this thesis that capture label proximity relations more naturally compared to tree-based approaches. Polytrees can effectively impose the prior knowledge on the inclusion of different classes by capturing both same-level and across-level dependencies. A novel recursive mechanism based on two-pass message passing is developed to efficiently calculate closed form posteriors of graph nodes on polytrees. Furthermore, since an accurate and sufficiently large ground truth is not always available for training segmentation algorithms, a weakly supervised framework is developed to employ polytrees for multi-class segmentation that reduces the need for training with the aid of modeling the prior knowledge during segmentation. Generating a hierarchical graph for the superpixels in the image, labels of nodes are inferred through a novel efficient message-passing algorithm and the model parameters are optimized with Expectation Maximization (EM). Results of evaluation on the segmentation of simulated data and multiple publicly available fluorescence microscopy datasets indicate the outperformance of the proposed method compared to state-of-the-art. The proposed method has also been assessed in predicting the possible segmentation error and has been shown to outperform trees. This can pave the way to calculate uncertainty measures on the resulting segmentation and guide subsequent segmentation refinement, which can be useful in the development of an interactive segmentation framework

    Identifying protein complexes and disease genes from biomolecular networks

    Get PDF
    With advances in high-throughput measurement techniques, large-scale biological data, such as protein-protein interaction (PPI) data, gene expression data, gene-disease association data, cellular pathway data, and so on, have been and will continue to be produced. Those data contain insightful information for understanding the mechanisms of biological systems and have been proved useful for developing new methods in disease diagnosis, disease treatment and drug design. This study focuses on two main research topics: (1) identifying protein complexes and (2) identifying disease genes from biomolecular networks. Firstly, protein complexes are groups of proteins that interact with each other at the same time and place within living cells. They are molecular entities that carry out cellular processes. The identification of protein complexes plays a primary role for understanding the organization of proteins and the mechanisms of biological systems. Many previous algorithms are designed based on the assumption that protein complexes are densely connected sub-graphs in PPI networks. In this research, a dense sub-graph detection algorithm is first developed following this assumption by using clique seeds and graph entropy. Although the proposed algorithm generates a large number of reasonable predictions and its f-score is better than many previous algorithms, it still cannot identify many known protein complexes. After that, we analyze characteristics of known yeast protein complexes and find that not all of the complexes exhibit dense structures in PPI networks. Many of them have a star-like structure, which is a very special case of the core-attachment structure and it cannot be identified by many previous core-attachment-structure-based algorithms. To increase the prediction accuracy of protein complex identification, a multiple-topological-structure-based algorithm is proposed to identify protein complexes from PPI networks. Four single-topological-structure-based algorithms are first employed to detect raw predictions with clique, dense, core-attachment and star-like structures, respectively. A merging and trimming step is then adopted to generate final predictions based on topological information or GO annotations of predictions. A comprehensive review about the identification of protein complexes from static PPI networks to dynamic PPI networks is also given in this study. Secondly, genetic diseases often involve the dysfunction of multiple genes. Various types of evidence have shown that similar disease genes tend to lie close to one another in various biomolecular networks. The identification of disease genes via multiple data integration is indispensable towards the understanding of the genetic mechanisms of many genetic diseases. However, the number of known disease genes related to similar genetic diseases is often small. It is not easy to capture the intricate gene-disease associations from such a small number of known samples. Moreover, different kinds of biological data are heterogeneous and no widely acceptable criterion is available to standardize them to the same scale. In this study, a flexible and reliable multiple data integration algorithm is first proposed to identify disease genes based on the theory of Markov random fields (MRF) and the method of Bayesian analysis. A novel global-characteristic-based parameter estimation method and an improved Gibbs sampling strategy are introduced, such that the proposed algorithm has the capability to tune parameters of different data sources automatically. However, the Markovianity characteristic of the proposed algorithm means it only considers information of direct neighbors to formulate the relationship among genes, ignoring the contribution of indirect neighbors in biomolecular networks. To overcome this drawback, a kernel-based MRF algorithm is further proposed to take advantage of the global characteristics of biological data via graph kernels. The kernel-based MRF algorithm generates predictions better than many previous disease gene identification algorithms in terms of the area under the receiver operating characteristic curve (AUC score). However, it is very time-consuming, since the Gibbs sampling process of the algorithm has to maintain a long Markov chain for every single gene. Finally, to reduce the computational time of the MRF-based algorithm, a fast and high performance logistic-regression-based algorithm is developed for identifying disease genes from biomolecular networks. Numerical experiments show that the proposed algorithm outperforms many existing methods in terms of the AUC score and running time. To summarize, this study has developed several computational algorithms for identifying protein complexes and disease genes from biomolecular networks, respectively. These proposed algorithms are better than many other existing algorithms in the literature

    Soft Computing Techiniques for the Protein Folding Problem on High Performance Computing Architectures

    Get PDF
    The protein-folding problem has been extensively studied during the last fifty years. The understanding of the dynamics of global shape of a protein and the influence on its biological function can help us to discover new and more effective drugs to deal with diseases of pharmacological relevance. Different computational approaches have been developed by different researchers in order to foresee the threedimensional arrangement of atoms of proteins from their sequences. However, the computational complexity of this problem makes mandatory the search for new models, novel algorithmic strategies and hardware platforms that provide solutions in a reasonable time frame. We present in this revision work the past and last tendencies regarding protein folding simulations from both perspectives; hardware and software. Of particular interest to us are both the use of inexact solutions to this computationally hard problem as well as which hardware platforms have been used for running this kind of Soft Computing techniques.This work is jointly supported by the FundaciónSéneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under grants 15290/PI/2010 and 18946/JLI/13, by the Spanish MEC and European Commission FEDER under grant with reference TEC2012-37945-C02-02 and TIN2012-31345, by the Nils Coordinated Mobility under grant 012-ABEL-CM-2014A, in part financed by the European Regional Development Fund (ERDF). We also thank NVIDIA for hardware donation within UCAM GPU educational and research centers.Ingeniería, Industria y Construcció

    Origins and control of single-cell transcript heterogeneity

    Full text link

    Deep Learning in Single-Cell Analysis

    Full text link
    Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi
    • …
    corecore