24 research outputs found

    Assessing functional novelty of PSI structures via structure-function analysis of large and diverse superfamilies

    Get PDF
    The structural genomics initiatives have had as one of their aims to improve our understanding of protein function by providing representative structures for many structurally uncharacterised protein families. As suggested by the recent assessment of the Protein Structure Initiative (Structural Genomics Initiative, funded by the NIH), doubts have arisen as to whether Structural Genomics as initially planned were really beneficial to our understanding of biological issues, and in particular of protein function.
A few protein domain superfamilies have been shown to account for unexpectedly large numbers of proteins encoded in fully sequenced genomes. These large superfamilies are generally very diverse, spanning a wide range of functions, both in terms of molecular activities and biological processes. Some of these superfamilies, such as the Rossmann-fold P-loop nucleotide hydrolases or the TIM-barrel glycosidases, have been the subject of extensive structural studies which in turn have shed light on how evolution of the sequence and structure properties produce functional diversity amongst homologues. Recently, the Structure-Function Linkage Database (SFLD) has been setup with the aim of helping the study of structure-function correlations in such superfamilies. Since the evolutionary success of these large superfamilies suggests biological importance, several Structural Genomics Centers have focused on providing full structural coverage for representatives of all sequence families in these superfamilies.
In this work we evaluate structure/function diversity in a set of these large superfamilies and attempt to assess the quality and quantity of biological information gained from Structural Genomics.
&#xa

    Benchmarking network propagation methods for disease gene identification

    Get PDF
    In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genesPeer ReviewedPostprint (published version

    New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.

    Get PDF
    CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily

    Benchmarking network propagation methods for disease gene identification

    No full text
    In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genesPeer Reviewe

    Benchmarking network propagation methods for disease gene identification

    No full text
    In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genesPeer Reviewe

    Integrated Pathway Clusters with Coherent Biological Themes for Target Prioritisation

    No full text
    <div><p>Prioritising candidate genes for further experimental characterisation is an essential, yet challenging task in biomedical research. One way of achieving this goal is to identify specific biological themes that are enriched within the gene set of interest to obtain insights into the biological phenomena under study. Biological pathway data have been particularly useful in identifying functional associations of genes and/or gene sets. However, biological pathway information as compiled in varied repositories often differs in scope and content, preventing a more effective and comprehensive characterisation of gene sets. Here we describe a new approach to constructing biologically coherent gene sets from pathway data in major public repositories and employing them for functional analysis of large gene sets. We first revealed significant overlaps in gene content between different pathways and then defined a clustering method based on the shared gene content and the similarity of gene overlap patterns. We established the biological relevance of the constructed pathway clusters using independent quantitative measures and we finally demonstrated the effectiveness of the constructed pathway clusters in comparative functional enrichment analysis of gene sets associated with diverse human diseases gathered from the literature. The pathway clusters and gene mappings have been integrated into the TargetMine data warehouse and are likely to provide a concise, manageable and biologically relevant means of functional analysis of gene sets and to facilitate candidate gene prioritisation.</p></div

    Hierarchical clusters of functionally related pathways based on gene overlap profiles.

    No full text
    <p>A) Dendrogram generated by using a matrix of gene overlap indices for all pairwise pathway comparisons. Specific pathway clusters no27 (red) and no15 (blue) are highlighted. B) Cluster no27 “Metabolism of lipids and lipoproteins” included eight pathways, all associated with lipid metabolism. C) Cluster no15 “Glycolysis/Gluconeogenesis | Lysine degradation | Valine, leucine and isoleucine degradation” consisted of ten pathways, most of which were associated with amino acid metabolism.</p

    Functional similarity (<i>FS</i>) scores within a pathway cluster (Intra-cluster) were much higher than scores across pathway clusters (Inter-cluster).

    No full text
    <p>Functional similarity (<i>FS</i>) scores within a pathway cluster (Intra-cluster) were much higher than scores across pathway clusters (Inter-cluster).</p

    The online user interface allows the users to query and visualise integrated pathway clusters and perform GSFE analysis with the supplied list of genes.

    No full text
    <p>The online user interface allows the users to query and visualise integrated pathway clusters and perform GSFE analysis with the supplied list of genes.</p
    corecore