754 research outputs found

    Prediction of protein distance maps by assembling fragments according to physicochemical similarities

    Get PDF
    The prediction of protein structures is a current issue of great significance in structural bioinformatics. More specifically, the prediction of the tertiary structure of a protein consists of determining its three-dimensional conformation based solely on its amino acid sequence. This study proposes a method in which protein fragments are assembled according to their physicochemical similarities, using information extracted from known protein structures. Many approaches cited in the literature use the physicochemical properties of amino acids, generally hydrophobicity, polarity and charge, to predict structure. In our method, implemented with parallel multithreading, a set of 30 physicochemical amino acid properties selected from the AAindex database were used. Several protein tertiary structure prediction methods produce a contact map. Our proposed method produces a distance map, which provides more information about the structure of a protein than a contact map. The results of experiments with several non-homologous protein sets demonstrate the generality of this method and its prediction quality using the amino acid properties considered

    Highly Accurate Fragment Library for Protein Fold Recognition

    Get PDF
    Proteins play a crucial role in living organisms as they perform many vital tasks in every living cell. Knowledge of protein folding has a deep impact on understanding the heterogeneity and molecular functions of proteins. Such information leads to crucial advances in drug design and disease understanding. Fold recognition is a key step in the protein structure discovery process, especially when traditional computational methods fail to yield convincing structural homologies. In this work, we present a new protein fold recognition approach using machine learning and data mining methodologies. First, we identify a protein structural fragment library (Frag-K) composed of a set of backbone fragments ranging from 4 to 20 residues as the structural “keywords” that can effectively distinguish between major protein folds. We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large-scale of high-quality, non-homologous protein structures available in PDB. We analyze the impacts of clustering cut-offs on the performance of the fragment libraries. Then, the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins). Our results show that a structural dictionary with ~400 4- to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy. Then, based on Frag-k, we design a novel deep learning architecture, so-called DeepFrag-k, which identifies fold discriminative features to improve the accuracy of protein fold recognition. DeepFrag-k is composed of two stages: the first stage employs a multimodal Deep Belief Network (DBN) to predict the potential structural fragments given a sequence, represented as a fragment vector, and then the second stage uses a deep convolution neural network (CNN) to classify the fragment vectors into the corresponding folds. Our results show that DeepFrag-k yields 92.98% accuracy in predicting the top-100 most popular fragments, which can be used to generate discriminative fragment feature vectors to improve protein fold recognition

    Structural Cheminformatics for Kinase-Centric Drug Design

    Get PDF
    Drug development is a long, expensive, and iterative process with a high failure rate, while patients wait impatiently for treatment. Kinases are one of the main drug targets studied for the last decades to combat cancer, the second leading cause of death worldwide. These efforts resulted in a plethora of structural, chemical, and pharmacological kinase data, which are collected in the KLIFS database. In this thesis, we apply ideas from structural cheminformatics to the rich KLIFS dataset, aiming to provide computational tools that speed up the complex drug discovery process. We focus on methods for target prediction and fragment-based drug design that study characteristics of kinase binding sites (also called pockets). First, we introduce the concept of computational target prediction, which is vital in the early stages of drug discovery. This approach identifies biological entities such as proteins that may (i) modulate a disease of interest (targets or on-targets) or (ii) cause unwanted side effects due to their similarity to on-targets (off-targets). We focus on the research field of binding site comparison, which lacked a freely available and efficient tool to determine similarities between the highly conserved kinase pockets. We fill this gap with the novel method KiSSim, which encodes and compares spatial and physicochemical pocket properties for all kinases (kinome) that are structurally resolved. We study kinase similarities in the form of kinome-wide phylogenetic trees and detect expected and unexpected off-targets. To allow multiple perspectives on kinase similarity, we propose an automated and production-ready pipeline; user-defined kinases can be inspected complementarily based on their pocket sequence and structure (KiSSim), pocket-ligand interactions, and ligand profiles. Second, we introduce the concept of fragment-based drug design, which is useful to identify and optimize active and promising molecules (hits and leads). This approach identifies low-molecular-weight molecules (fragments) that bind weakly to a target and are then grown into larger high-affinity drug-like molecules. With the novel method KinFragLib, we provide a fragment dataset for kinases (fragment library) by viewing kinase inhibitors as combinations of fragments. Kinases have a highly conserved pocket with well-defined regions (subpockets); based on the subpockets that they occupy, we fragment kinase inhibitors in experimentally resolved protein-ligand complexes. The resulting dataset is used to generate novel kinase-focused molecules that are recombinations of the previously fragmented kinase inhibitors while considering their subpockets. The KinFragLib and KiSSim methods are published as freely available Python tools. Third, we advocate for open and reproducible research that applies FAIR principles ---data and software shall be findable, accessible, interoperable, and reusable--- and software best practices. In this context, we present the TeachOpenCADD platform that contains pipelines for computer-aided drug design. We use open source software and data to demonstrate ligand-based applications from cheminformatics and structure-based applications from structural bioinformatics. To emphasize the importance of FAIR data, we dedicate several topics to accessing life science databases such as ChEMBL, PubChem, PDB, and KLIFS. These pipelines are not only useful to novices in the field to gain domain-specific skills but can also serve as a starting point to study research questions. Furthermore, we show an example of how to build a stand-alone tool that formalizes reoccurring project-overarching tasks: OpenCADD-KLIFS offers a clean and user-friendly Python API to interact with the KLIFS database and fetch different kinase data types. This tool has been used in this thesis and beyond to support kinase-focused projects. We believe that the FAIR-based methods, tools, and pipelines presented in this thesis (i) are valuable additions to the toolbox for kinase research, (ii) provide relevant material for scientists who seek to learn, teach, or answer questions in the realm of computer-aided drug design, and (iii) contribute to making drug discovery more efficient, reproducible, and reusable

    Graph theory-based sequence descriptors as remote homology predictors

    Get PDF
    Indexación: Scopus.Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs. We also highlight the tendency of integrating AF features/measures with the AB ones, either into the same prediction model or by assembling the predictions from different algorithms using voting/weighting strategies, for improving the detection of remote signals. Lastly, we briefly discuss the efforts made to scale up AB and AF features/measures for the comparison of multiple genomes and proteomes. Alongside the achieved experiences in remote homology detection by both the most popular AF tools and other less known ones, we provide our own using the graphical–numerical methodologies, MARCH-INSIDE, TI2BioP, and ProtDCal. We also present a new Python-based tool (SeqDivA) with a friendly graphical user interface (GUI) for delimiting the twilight zone by using several similar criteria.https://www.mdpi.com/2218-273X/10/1/2

    Protein-segment universe exhibiting transitions at intermediate segment length in conformational subspaces

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many studies have examined rules governing two aspects of protein structures: short segments and proteins' structural domains. Nevertheless, the organization and nature of the conformational space of segments with intermediate length between short segments and domains remain unclear. Conformational spaces of intermediate length segments probably differ from those of short segments. We investigated the identification and characterization of the boundary(s) between peptide-like (short segment) and protein-like (long segment) distributions. We generated ensembles embedded in globular proteins comprising segments 10–50 residues long. We explored the relationships between the conformational distribution of segments and their lengths, and also protein structural classes using principal component analysis based on the intra-segment <b>C</b><sub>α</sub>-<b>C</b><sub>α </sub>atomic distances.</p> <p>Results</p> <p>Our statistical analyses of segment conformations and length revealed critical dual transitions in their conformational distribution with segments derived from all four structural classes. Dual transitions were identified with the intermediate phase between the short segments and domains. Consequently, protein segment universes were categorized. i) Short segments (10–22 residues) showed a distribution with a high frequency of secondary structure clusters. ii) Medium segments (23���26 residues) showed a distribution corresponding to an intermediate state of transitions. iii) Long segments (27–50 residues) showed a distribution converging on one huge cluster containing compact conformations with a smaller radius of gyration. This distribution reflects the protein structures' organization and protein domains' origin. Three major conformational components (radius of gyration, structural symmetry with respect to the N-terminal and C-terminal halves, and single-turn/two-turn structure) well define most of the segment universes. Furthermore, we identified several conformational components that were unique to each structural class. Those characteristics suggest that protein segment conformation is described by compositions of the three common structural variables with large contributions and specific structural variables with small contributions.</p> <p>Conclusion</p> <p>The present results of the analyses of four protein structural classes show the universal role of three major components as segment conformational descriptors. The obtained perspectives of distribution changes related to the segment lengths using the three key components suggest both the adequacy and the possibility of further progress on the prediction strategies used in the recent <it>de novo </it>structure-prediction methods.</p

    Entropy-scaling search of massive biological data

    Get PDF
    Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

    Genome-wide Protein-chemical Interaction Prediction

    Get PDF
    The analysis of protein-chemical reactions on a large scale is critical to understanding the complex interrelated mechanisms that govern biological life at the cellular level. Chemical proteomics is a new research area aimed at genome-wide screening of such chemical-protein interactions. Traditional approaches to such screening involve in vivo or in vitro experimentation, which while becoming faster with the application of high-throughput screening technologies, remains costly and time-consuming compared to in silico methods. Early in silico methods are dependant on knowing 3D protein structures (docking) or knowing binding information for many chemicals (ligand-based approaches). Typical machine learning approaches follow a global classification approach where a single predictive model is trained for an entire data set, but such an approach is unlikely to generalize well to the protein-chemical interaction space considering its diversity and heterogeneous distribution. In response to the global approach, work on local models has recently emerged to improve generalization across the interaction space by training a series of independant models localized to each predict a single interaction. This work examines current approaches to genome-wide protein-chemical interaction prediction and explores new computational methods based on modifications to the boosting framework for ensemble learning. The methods are described and compared to several competing classification methods. Genome-wide chemical-protein interaction data sets are acquired from publicly available resources, and a series of experimental studies are performed in order to compare the the performance of each method under a variety of conditions

    Protein microenvironments for topology analysis

    Get PDF
    Previously held under moratorium from 1st December 2016 until 1st December 2021Amino Acid Residues are often the focus of research on protein structures. However, in a folded protein, each residue finds itself in an environment that is defined by the properties of its surrounding residues. The term microenvironment is used herein to refer to these local ensembles. Not only do they have chemical properties but also topological properties which quantify concepts such as density, boundaries between domains and junction complexity. These quantifications are used to project a protein’s backbone structure into a series of scores. The hypothesis was that these sequences of scores can be used to discover protein domains and motifs and that they can be used to align and compare groups of 3D protein structures. This research sought to implement a system that could efficiently compute microenvironments such that they can be applied routinely to large datasets. The computation of the microenvironments was the most challenging aspect in terms of performance, and the optimisations required are described. Methods of scoring microenvironments were developed to enable the extraction of domain and motif data without 3D alignment. The problem of allosteric site detection was addressed with a classifier that gave high rates of allosteric site detection. Overall, this work describes the development of a system that scales well with increasing dataset sizes. It builds on existing techniques, in order to automatically detect the boundaries of domains and demonstrates the ability to process large datasets by application to allosteric site detection, a problem that has not previously been adequately solved.Amino Acid Residues are often the focus of research on protein structures. However, in a folded protein, each residue finds itself in an environment that is defined by the properties of its surrounding residues. The term microenvironment is used herein to refer to these local ensembles. Not only do they have chemical properties but also topological properties which quantify concepts such as density, boundaries between domains and junction complexity. These quantifications are used to project a protein’s backbone structure into a series of scores. The hypothesis was that these sequences of scores can be used to discover protein domains and motifs and that they can be used to align and compare groups of 3D protein structures. This research sought to implement a system that could efficiently compute microenvironments such that they can be applied routinely to large datasets. The computation of the microenvironments was the most challenging aspect in terms of performance, and the optimisations required are described. Methods of scoring microenvironments were developed to enable the extraction of domain and motif data without 3D alignment. The problem of allosteric site detection was addressed with a classifier that gave high rates of allosteric site detection. Overall, this work describes the development of a system that scales well with increasing dataset sizes. It builds on existing techniques, in order to automatically detect the boundaries of domains and demonstrates the ability to process large datasets by application to allosteric site detection, a problem that has not previously been adequately solved
    • …
    corecore