685 research outputs found
Automatic annotation of experimentally derived, evolutionarily conserved post-translational modifications onto multiple genomes
New generation sequencing technologies have resulted in significant increases in the number of complete genomes. Functional characterization of these genomes, such as by high-throughput proteomics, is an important but challenging task due to the difficulty of scaling up existing experimental techniques. By use of comparative genomics techniques, experimental results can be transferred from one genome to another, while at the same time minimizing errors by requiring discovery in multiple genomes. In this study, protein phosphorylation, an essential component of many cellular processes, is studied using data from large-scale proteomics analyses of the phosphoproteome. Phosphorylation sites from Homo sapiens, Mus musculus and Drosophila melanogaster phosphopeptide data sets were mapped onto conserved domains in NCBI’s manually curated portion of Conserved Domain Database (CDD). In this subset, 25 phosphorylation sites are found to be evolutionarily conserved between the three species studied. Transfer of phosphorylation annotation of these conserved sites onto sequences sharing the same conserved domains yield 3253 phosphosite annotations for proteins from coelomata, the taxonomic division that spans H. sapiens, M. musculus and D. melanogaster. The method scales automatically, so as the amount of experimental phosphoproteomics data increases, more conserved phosphorylation sites may be revealed
wKinMut: An integrated tool for the analysis and interpretation of mutations in human protein kinases
BACKGROUND: Protein kinases are involved in relevant physiological functions and a broad number of mutations in this superfamily have been reported in the literature to affect protein function and stability. Unfortunately, the exploration of the consequences on the phenotypes of each individual mutation remains a considerable challenge. RESULTS: The wKinMut web-server offers direct prediction of the potential pathogenicity of the mutations from a number of methods, including our recently developed prediction method based on the combination of information from a range of diverse sources, including physicochemical properties and functional annotations from FireDB and Swissprot and kinase-specific characteristics such as the membership to specific kinase groups, the annotation with disease-associated GO terms or the occurrence of the mutation in PFAM domains, and the relevance of the residues in determining kinase subfamily specificity from S3Det. This predictor yields interesting results that compare favourably with other methods in the field when applied to protein kinases. Together with the predictions, wKinMut offers a number of integrated services for the analysis of mutations. These include: the classification of the kinase, information about associations of the kinase with other proteins extracted from iHop, the mapping of the mutations onto PDB structures, pathogenicity records from a number of databases and the classification of mutations in large-scale cancer studies. Importantly, wKinMut is connected with the SNP2L system that extracts mentions of mutations directly from the literature, and therefore increases the possibilities of finding interesting functional information associated to the studied mutations. CONCLUSIONS: wKinMut facilitates the exploration of the information available about individual mutations by integrating prediction approaches with the automatic extraction of information from the literature (text mining) and several state-of-the-art databases. wKinMut has been used during the last year for the analysis of the consequences of mutations in the context of a number of cancer genome projects, including the recent analysis of Chronic Lymphocytic Leukemia cases and is publicly available at http://wkinmut.bioinfo.cnio.es
The computational analysis of post-translational modifications
The post-translational modification (PTMs) of proteins presents a means to increase the proteome size and diversity of an organism through the inclusion of structural elements not encoded at the sequence-level alone. Their erroneous inclusion or exclusion has been linked to a variety of diseases and disorders thus their characterisation has the potential to present viable drug targets. The proliferation of newer high-throughput methods, such as mass spectrometry, to identify such modifications has led to a rapid increase in the number of databases and tools to display and analyse such vast amounts of data effectively. This study covers the development of one such tool; PTM Browser, and the construction of the underlying database that it is based upon. This new database was initially seeded with annotations from the Swiss-Prot and Phospho.ELM resources. The initial database of PTMs was then expanded to include a large repertoire of previously unannotated proteins for a selection of topical species (e.g. Danio rerio and Tetraodon nigroviridis). Orthologue assignments have also been added to the database – to allow for queries to be performed regarding the conservation of modifications between homologous proteins. The PTM Browser tool allows for a full exploration of this new database of PTMs – with a special focus on allowing users to identify modifications that are both shared between and are specific to particular species. This tool is freely available for non-commercial use at the following URL: http://www.ptmbrowser.org. An analysis is presented on the conservation of modifications between members of the tumour suppressor family, p53, using this new tool. This tool has also been used to analysis the conservation of modifications between super-kingdoms and Eukaryote species
Recommended from our members
Mechanisms of change in protein architecture
Proteins are the basic building blocks and functional units in all living organisms.
Moreover, differences between species can frequently be explained with
differences in their protein complements. Importantly, proteins are often
composed of segments, i.e. domains that have a certain level of evolutionary,
structural and/or functional independence. The majority of proteins in nature
contain two or more domains, and an individual domain can often occur in
combinations with different domain partners.
In the first part of my thesis, I traced the history of animal gene families
and the proteins these genes encode. By this means, I was able to infer events
where changes in protein domain architectures took place. This showed that
both insertions and deletions of single copy domains preferentially occur at
protein termini, but also that changes are more likely to occur after gene
duplication than organism speciation. Finally, domains that were most
frequently gained were the ones that are related to an increase in organismal
complexity, thus underlining the important role of domain shuffling in animal
evolution.
In the second part of my thesis, I focused on a set of high confidence
domain gain events and investigated the evidence for molecular mechanisms
that caused these domain gains. In agreement with observations from the first
part - that changes preferentially occur at the termini - I have found that the
strongest contribution to gains of novel domains in proteins comes from gene
fusion through the joining of exons from adjacent genes into a novel gene unit.
Two other mechanisms that have been suggested to play a major role in the
evolution of animal proteins, retroposition and middle insertions through
intronic recombination, have a smaller role in comparison to gene fusions. Since
the majority of these domain gains are again observed after gene duplication,
this suggests a powerful mechanism for neofunctionalization after gene
duplication.
iii
Finally, in the last part of my thesis, I address a mechanism that increases
the number and variety of proteins in an organism – alternative splicing. In
particular, I investigate the functional consequences of tissue-specific alternative
splicing events. I found that tissue-specific splicing tends to affect exons that
encode protein regions without defined secondary or tertiary structure.
Importantly, it is known that these disordered regions frequently play a role in
protein interactions. In agreement with this, I observed significant enrichment of
tissue-specifically encoded protein segments in disordered binding peptides and
posttranslationally modified sites. A possible result of the finely regulated
alternative splicing of these segments is a tissue-specific rewiring of protein
network. In conclusion, both alternative splicing and domain shuffling can
increase proteome diversity. However, a protein with a new function can often
directly or indirectly shape the functions of other proteins in its environment
Deep sequencing of pre-translational mRNPs reveals hidden flux through evolutionarily conserved AS-NMD pathways
Deep sequencing of mRNAs (RNA-Seq) is now the preferred method for transcriptome-wide quantification of gene expression. Yet many mRNA isoforms, such as those eliminated by nonsense-mediated decay (NMD), are inherently unstable. Thus a significant drawback of steady-state RNA-Seq is that it provides marginal information on the flux through alternative splicing pathways. Measurement of such flux necessitates capture of newly made species prior to mRNA decay. One means to capture nascent mRNAs is affinity purifying either the exon junction complex (EJC) or activated spliceosomes. Late-stage spliceosomes deposit the EJC upstream of exon-exon junctions, where it remains associated until the first round of translation. As most mRNA decay pathways are translation-dependent, these EJC- or spliceosome-associated, pre-translational mRNAs should provide an accurate record of the initial population of alternate mRNA isoforms.
Previous work has analyzed the protein composition and structure of pre- translational mRNPs in detail. While in the Moore lab, my project has focused on exploring the diversity of mRNA isoforms contained within these complexes. As expected, known NMD isoforms are more highly represented in pre-translational mRNPs than in RNA-Seq libraries. To investigate whether pre-translational mRNPs contain novel mRNA isoforms, we created a bioinformatics pipeline that identified thousands of previously unannotated splicing events. Though many can be attributed to “splicing noise”, others are evolutionarily-conserved events that produce new AS-NMD isoforms likely involved in maintenance of protein homeostasis. Several of these occur in genes whose overexpression has been linked to poor cancer prognosis
Gene expression data analysis using novel methods: Predicting time delayed correlations and evolutionarily conserved functional modules
Microarray technology enables the study of gene expression on a large scale. One of the main challenges has been to devise methods to cluster genes that share similar expression profiles. In gene expression time courses, a particular gene may encode transcription factor and thus controlling several genes downstream; in this case, the gene expression profiles may be staggered, indicating a time-delayed response in transcription of the later genes. The standard clustering algorithms consider gene expression profiles in a global way, thus often ignoring such local time-delayed correlations. We have developed novel methods to capture time-delayed correlations between expression profiles: (1) A method using dynamic programming and (2) CLARITY, an algorithm that uses a local shape based similarity measure to predict time-delayed correlations and local correlations. We used CLARITY on a dataset describing the change in gene expression during the mitotic cell cycle in Saccharomyces cerevisiae. The obtained clusters were significantly enriched with genes that share similar functions, reflecting the fact that genes with a similar function are often co-regulated and thus co-expressed. Time-shifted as well as local correlations could also be predicted using CLARITY.
In datasets, where the expression profiles of independent experiments are compared, the standard clustering algorithms often cluster according to all conditions, considering all genes. This increases the background noise and can lead to the missing of genes that change the expression only under particular conditions. We have employed a genetic algorithm based module predictor that is capable to identify group of genes that change their expression only in a subset of conditions. With the aim of supplementing the Ustilago maydis genome annotation, we have used the module prediction algorithm on various independent datasets from Ustilago maydis. The predicted modules were cross-referenced in various Saccharomyces cerevisiae datasets to check its evolutionarily conservation between these two organisms. The key contributions of this thesis are novel methods that explore biological information from DNA microarray data
Identification, organisation and visualisation of complete proteomes in UniProt throughout all taxonomic ranks :|barchaea, bacteria, eukatyote and virus
Users of uniprot.org want to be able to query, retrieve and download proteome sets for an organism of their choice. They expect the data to be easily accessed, complete and up to date based on current available knowledge. UniProt release 2012_01 (25th Jan 2012) contains the proteomes of 2,923 organisms; 50% of which are bacteria, 38% viruses, 8% eukaryota and 4% archaea. Note that the term 'organism' is used in a broad sense to include subspecies, strains and isolates. Each completely sequenced organism is processed as an independent organism, hence the availability of 38 strain-specific proteomes Escherichia coli that are accessible for download.
There is a project within UniProt dedicated to the mammoth task of maintaining the “Proteomes database”. This active resource is essential for UniProt to continually provide high quality proteome sets to the users. Accurate identification and incorporation of new, publically available, proteomes as well as the maintenance of existing proteomes permits sustained growth of the proteomes project. This is a huge, complicated and vital task accomplished by the activities of both curators and programmers.
This thesis explains the data input and output of the proteomes database: the flow of genome project data from the nucleotide database into the proteomes database, then from each genome how a proteome is identified, augmented and made visible to uniprot.org users. Along this journey of discovery many issues arose, puzzles concerning data gathering, data integrity and also data visualisation. All were resolved and the outcome is a well-documented, actively maintained database that strives to provide optimal proteome information to its users
- …