1,233 research outputs found

    Protein contact map prediction using multi-stage hybrid intelligence inference systems

    Get PDF
    AbstractProteins are one of the most important molecules in organisms. Protein function can be inferred from its 3D structure. The gap between the number of discovered protein sequences and the number of structures determined by the experimental methods is increasing. Accurate prediction of protein contact map is an important step toward the reconstruction of the protein’s 3D structure. In spite of continuous progress in developing contact map predictors, highly accurate prediction is still unresolved problem. In this paper, we introduce a new predictor, JUSTcon, which consists of multiple parallel stages that are based on adaptive neuro-fuzzy inference System (ANFIS) and K nearest neighbors (KNNs) classifier. A smart filtering operation is performed on the final outputs to ensure normal connectivity behaviors of amino acids pairs. The window size of the filter is selected by a simple expert system. The dataset was divided into testing dataset of 50 proteins and training dataset of 450 proteins. The system produced an average accuracy of 45.2% for the sequence separation of six amino acids. In addition, JUSTcon outperformed SVMcon and PROFcon predictors in the cases of large separation distances. JUSTcon produced an average accuracy of 15% for the sequence separation of 24 amino acids after applying it on CASP9 targets

    Combining local- and large-scale models to predict the distributions of invasive plant species

    Get PDF
    Habitat-distribution models are increasingly used to predict the potential distributions of invasive species and to inform monitoring. However, these models assume that species are in equilibrium with the environment, which is clearly not true for most invasive species. Although this assumption is frequently acknowledged, solutions have not been adequately addressed. There are several potential methods for improving habitat-distribution models. Models that require only presence data may be more effective for invasive species, but this assumption has rarely been tested. In addition, combining modeling types to form ‘ensemble’ models may improve the accuracy of predictions. However, even with these improvements, models developed for recently invaded areas are greatly influenced by the current distributions of species and thus reflect near- rather than long-term potential for invasion. Larger scale models from species’ native and invaded ranges may better reflect long-term invasion potential, but they lack finer scale resolution. We compared logistic regression (which uses presence/absence data) and two presence-only methods for modeling the potential distributions of three invasive plant species on the Olympic Peninsula in Washington State, USA. We then combined the three methods to create ensemble models. We also developed climate-envelope models for the same species based on larger scale distributions and combined models from multiple scales to create an index of near- and long-term invasion risk to inform monitoring in Olympic National Park (ONP). Neither presence-only nor ensemble models were more accurate than logistic regression for any of the species. Larger scale models predicted much greater areas at risk of invasion. Our index of near- and long-term invasion risk indicates that \u3c4% of ONP is at high near-term risk of invasion while 67-99% of the Park is at moderate or high long-term risk of invasion. We demonstrate how modeling results can be used to guide the design of monitoring protocols and monitoring results can in turn be used to refine models. We propose that by using models from multiple scales to predict invasion risk and by explicitly linking model development to monitoring, it may be possible to overcome some of the limitations of habitat-distribution models

    New evolutionary approaches to protein structure prediction

    Get PDF
    Programa de doctorado en Biotecnología y Tecnología QuímicaThe problem of Protein Structure Prediction (PSP) is one of the principal topics in Bioinformatics. Multiple approaches have been developed in order to predict the protein structure of a protein. Determining the three dimensional structure of proteins is necessary to understand the functions of molecular protein level. An useful, and commonly used, representation for protein 3D structure is the protein contact map, which represents binary proximities (contact or non-contact) between each pair of amino acids of a protein. This thesis work, includes a compilation of the soft computing techniques for the protein structure prediction problem (secondary and tertiary structures). A novel evolutionary secondary structure predictor is also widely described in this work. Results obtained confirm the validity of our proposal. Furthermore, we also propose a multi-objective evolutionary approach for contact map prediction based on physico-chemical properties of amino acids. The evolutionary algorithm produces a set of decision rules that identifies contacts between amino acids. The rules obtained by the algorithm impose a set of conditions based on amino acid properties in order to predict contacts. Results obtained by our approach on four different protein data sets are also presented. Finally, a statistical study was performed to extract valid conclusions from the set of prediction rules generated by our algorithm.Universidad Pablo de Olavide. Centro de Estudios de Postgrad

    An Efficient Nearest Neighbor Method for Protein Contact Prediction

    Get PDF
    A variety of approaches for protein inter-residue contact pre diction have been developed in recent years. However, this problem is far from being solved yet. In this article, we present an efficient nearest neigh bor (NN) approach, called PKK-PCP, and an application for the protein inter-residue contact prediction. The great strength of using this approach is its adaptability to that problem. Furthermore, our method improves considerably the efficiency with regard to other NN approaches. Our NN-based method combines parallel execution with k-d tree as search algorithm. The input data used by our algorithm is based on structural features and physico-chemical properties of amino acids besides of evo lutionary information. Results obtained show better efficiency rates, in terms of time and memory consumption, than other similar approaches.Ministerio de Educación y Ciencia TIN2011-28956-C02-0

    Exploring the potential of 3D Zernike descriptors and SVM for protein\u2013protein interface prediction

    Get PDF
    Abstract Background The correct determination of protein–protein interaction interfaces is important for understanding disease mechanisms and for rational drug design. To date, several computational methods for the prediction of protein interfaces have been developed, but the interface prediction problem is still not fully understood. Experimental evidence suggests that the location of binding sites is imprinted in the protein structure, but there are major differences among the interfaces of the various protein types: the characterising properties can vary a lot depending on the interaction type and function. The selection of an optimal set of features characterising the protein interface and the development of an effective method to represent and capture the complex protein recognition patterns are of paramount importance for this task. Results In this work we investigate the potential of a novel local surface descriptor based on 3D Zernike moments for the interface prediction task. Descriptors invariant to roto-translations are extracted from circular patches of the protein surface enriched with physico-chemical properties from the HQI8 amino acid index set, and are used as samples for a binary classification problem. Support Vector Machines are used as a classifier to distinguish interface local surface patches from non-interface ones. The proposed method was validated on 16 classes of proteins extracted from the Protein–Protein Docking Benchmark 5.0 and compared to other state-of-the-art protein interface predictors (SPPIDER, PrISE and NPS-HomPPI). Conclusions The 3D Zernike descriptors are able to capture the similarity among patterns of physico-chemical and biochemical properties mapped on the protein surface arising from the various spatial arrangements of the underlying residues, and their usage can be easily extended to other sets of amino acid properties. The results suggest that the choice of a proper set of features characterising the protein interface is crucial for the interface prediction task, and that optimality strongly depends on the class of proteins whose interface we want to characterise. We postulate that different protein classes should be treated separately and that it is necessary to identify an optimal set of features for each protein class

    Structured States of Disordered Proteins from Genomic Sequences

    Get PDF
    Protein flexibility ranges from simple hinge movements to functional disorder. Around half of all human proteins contain apparently disordered regions with little 3D or functional information, and many of these proteins are associated with disease. Building on the evolutionary couplings approach previously successful in predicting 3D states of ordered proteins and RNA, we developed a method to predict the potential for ordered states for all apparently disordered proteins with sufficiently rich evolutionary information. The approach is highly accurate (79%) for residue interactions as tested in more than 60 known disordered regions captured in a bound or specific condition. Assessing the potential for structure of more than 1,000 apparently disordered regions of human proteins reveals a continuum of structural order with at least 50% with clear propensity for three-or two-dimensional states. Co-evolutionary constraints reveal hitherto unseen structures of functional importance in apparently disordered proteins. Keywords: Evolutionary couplings disorder; conformational flexibility; statistical physics; maximum entropy; EVfold; bioinformatics; computational biology; structure predictionNational Institutes of Health (U.S.) (Grant R01GM081871

    Cooperative "folding transition" in the sequence space facilitates function-driven evolution of protein families

    Full text link
    In the protein sequence space, natural proteins form clusters of families which are characterized by their unique native folds whereas the great majority of random polypeptides are neither clustered nor foldable to unique structures. Since a given polypeptide can be either foldable or unfoldable, a kind of "folding transition" is expected at the boundary of a protein family in the sequence space. By Monte Carlo simulations of a statistical mechanical model of protein sequence alignment that coherently incorporates both short-range and long-range interactions as well as variable-length insertions to reproduce the statistics of the multiple sequence alignment of a given protein family, we demonstrate the existence of such transition between natural-like sequences and random sequences in the sequence subspaces for 15 domain families of various folds. The transition was found to be highly cooperative and two-state-like. Furthermore, enforcing or suppressing consensus residues on a few of the well-conserved sites enhanced or diminished, respectively, the natural-like pattern formation over the entire sequence. In most families, the key sites included ligand binding sites. These results suggest some selective pressure on the key residues, such as ligand binding activity, may cooperatively facilitate the emergence of a protein family during evolution. From a more practical aspect, the present results highlight an essential role of long-range effects in precisely defining protein families, which are absent in conventional sequence models.Comment: 13 pages, 7 figures, 2 tables (a new subsection added

    Probabilistic grammatical model of protein language and its application to helix-helix contact site classification

    Get PDF
    BACKGROUND: Hidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. RESULTS: In this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites. CONCLUSIONS: We demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists

    An Overview of the Use of Neural Networks for Data Mining Tasks

    Get PDF
    In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks
    corecore