321 research outputs found

    Assessing functional novelty of PSI structures via structure-function analysis of large and diverse superfamilies

    Get PDF
    The structural genomics initiatives have had as one of their aims to improve our understanding of protein function by providing representative structures for many structurally uncharacterised protein families. As suggested by the recent assessment of the Protein Structure Initiative (Structural Genomics Initiative, funded by the NIH), doubts have arisen as to whether Structural Genomics as initially planned were really beneficial to our understanding of biological issues, and in particular of protein function.
A few protein domain superfamilies have been shown to account for unexpectedly large numbers of proteins encoded in fully sequenced genomes. These large superfamilies are generally very diverse, spanning a wide range of functions, both in terms of molecular activities and biological processes. Some of these superfamilies, such as the Rossmann-fold P-loop nucleotide hydrolases or the TIM-barrel glycosidases, have been the subject of extensive structural studies which in turn have shed light on how evolution of the sequence and structure properties produce functional diversity amongst homologues. Recently, the Structure-Function Linkage Database (SFLD) has been setup with the aim of helping the study of structure-function correlations in such superfamilies. Since the evolutionary success of these large superfamilies suggests biological importance, several Structural Genomics Centers have focused on providing full structural coverage for representatives of all sequence families in these superfamilies.
In this work we evaluate structure/function diversity in a set of these large superfamilies and attempt to assess the quality and quantity of biological information gained from Structural Genomics.
&#xa

    Protein function prediction using domain families

    Get PDF
    Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons

    Sequence and Structural Differences between Enzyme and Nonenzyme Homologs

    Get PDF
    AbstractTo improve our understanding of the evolution of novel functions, we performed a sequence, structural, and functional analysis of homologous enzymes and nonenzymes of known three-dimensional structure. In most examples identified, the nonenzyme is derived from an ancestral catalytic precursor (as opposed to the reverse evolutionary scenario, nonenzyme to enzyme), and the active site pocket has been disrupted in some way, owing to the substitution of critical catalytic residues and/or steric interactions that impede substrate binding and catalysis. Pairwise sequence identity is typically insignificant, and almost one-half of the enzyme and nonenzyme pairs do not share any similarity in function. Heterooligomeric enzymes comprising homologous subunits in which one chain is catalytically inactive and enzyme polypeptides that contain internal catalytic and noncatalytic duplications of an ancient enzyme domain are also discussed

    Exploiting protein family and protein network data to identify novel drug targets for bladder cancer

    Get PDF
    Bladder cancer remains one of the most common forms of cancer and yet there are limited small molecule targeted therapies. Here, we present a computational platform to identify new potential targets for bladder cancer therapy. Our method initially exploited a set of known driver genes for bladder cancer combined with predicted bladder cancer genes from mutationally enriched protein domain families. We enriched this initial set of genes using protein network data to identify a comprehensive set of 323 putative bladder cancer targets. Pathway and cancer hallmarks analyses highlighted putative mechanisms in agreement with those previously reported for this cancer and revealed protein network modules highly enriched in potential drivers likely to be good targets for targeted therapies. 21 of our potential drug targets are targeted by FDA approved drugs for other diseases - some of them are known drivers or are already being targeted for bladder cancer (FGFR3, ERBB3, HDAC3, EGFR). A further 4 potential drug targets were identified by inheriting drug mappings across our in-house CATH domain functional families (FunFams). Our FunFam data also allowed us to identify drug targets in families that are less prone to side effects i.e., where structurally similar protein domain relatives are less dispersed across the human protein network. We provide information on our novel potential cancer driver genes, together with information on pathways, network modules and hallmarks associated with the predicted and known bladder cancer drivers and we highlight those drivers we predict to be likely drug targets

    CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures

    Get PDF
    We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification

    Microarray analysis after RNA amplification can detect pronounced differences in gene expression using limma

    Get PDF
    BACKGROUND: RNA amplification is necessary for profiling gene expression from small tissue samples. Previous studies have shown that the T7 based amplification techniques are reproducible but may distort the true abundance of targets. However, the consequences of such distortions on the ability to detect biological variation in expression have not been explored sufficiently to define the true extent of usability and limitations of such amplification techniques. RESULTS: We show that expression ratios are occasionally distorted by amplification using the Affymetrix small sample protocol version 2 due to a disproportional shift in intensity across biological samples. This occurs when a shift in one sample cannot be reflected in the other sample because the intensity would lie outside the dynamic range of the scanner. Interestingly, such distortions most commonly result in smaller ratios with the consequence of reducing the statistical significance of the ratios. This becomes more critical for less pronounced ratios where the evidence for differential expression is not strong. Indeed, statistical analysis by limma suggests that up to 87% of the genes with the largest and therefore most significant ratios (p < 10e(-20)) in the unamplified group have a p-value below 10e(-20 )in the amplified group. On the other hand, only 69% of the more moderate ratios (10e(-20 )< p < 10e(-10)) in the unamplified group have a p-value below 10e(-10 )in the amplified group. Our analysis also suggests that, overall, limma shows better overlap of genes found to be significant in the amplified and unamplified groups than the Z-scores statistics. CONCLUSION: We conclude that microarray analysis of amplified samples performs best at detecting differences in gene expression, when these are large and when limma statistics are used

    Srinivasan (1962-2021) in Bioinformatics and beyond

    Get PDF

    Protein diversification through post-translational modifications, alternative splicing, and gene duplication

    Get PDF
    Proteins provide the basis for cellular function. Having multiple versions of the same protein within a single organism provides a way of regulating its activity or developing novel functions. Post-translational modifications of proteins, by means of adding/removing chemical groups to amino acids, allow for a well-regulated and controlled way of generating functionally distinct protein species. Alternative splicing is another method with which organisms possibly generate new isoforms. Additionally, gene duplication events throughout evolution generate multiple paralogs of the same genes, resulting in multiple versions of the same protein within an organism. In this review, we discuss recent advancements in the study of these three methods of protein diversification and provide illustrative examples of how they affect protein structure and function

    Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

    Get PDF
    We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves
    • …
    corecore