4,730 research outputs found

    Automatic discovery of cross-family sequence features associated with protein function

    Get PDF
    BACKGROUND: Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed. RESULTS: We show that a self-supervised data mining approach is able to find relationships between sequence features and functional annotations. No preconceived ideas about functional categories are required, and the training data is simply a set of protein sequences and their UniProt/Swiss-Prot annotations. The main technical aspect of the approach is the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming. Our experiments on a strictly non-redundant set of eukaryotic proteins reveal that the strongest and most easily detected sequence-to-function relationships are concerned with targeting to various cellular compartments, which is an area already well studied both experimentally and computationally. Of more interest are a number of broad functional roles which can also be correlated with sequence features. These include inhibition, biosynthesis, transcription and defence against bacteria. Despite substantial overlaps between these functions and their corresponding cellular compartments, we find clear differences in the sequence motifs used to predict some of these functions. For example, the presence of polyglutamine repeats appears to be linked more strongly to the "transcription" function than to the general "nuclear" function/location. CONCLUSION: We have developed a novel and useful approach for knowledge discovery in annotated sequence data. The technique is able to identify functionally important sequence features and does not require expert knowledge. By viewing protein function from a sequence perspective, the approach is also suitable for discovering unexpected links between biological processes, such as the recently discovered role of ubiquitination in transcription

    Membrane proteins in the outer mebrane of plastids and mitochondria

    Get PDF
    Channels of the plastid and mitochondrial outer membranes facilitate the turnover of molecules and ions via these membranes. Although channels have been studied many questions pertaining to the whole diversity of plastid and mitochondrial channels in Arabidopsis thaliana and Pisum sativum remain unanswered. In this thesis I studied OEP16, OEP37 and VDAC families in two model plants, in Arabidopsis and pea. The Arabidopsis OEP16 family represents four channels of α-helical structure, similar to the pea OEP16 protein. These channels are suggested to transport amino acids and compounds with primary amino groups. Immunoblot analysis, GFP/RFP protein fusion expression, as well as proteomic analysis showed that AtOEP16.1, AtOEP16.2 and AtOEP16.4 are located in the outer envelope membrane of plastids, while AtOEP16.3 is in mitochondria. The gene expression and immunoblot analyses revealed that AtOEP16.1 and AtOEP16.3 proteins are highly abundant and ubiquitous; expression of AtOEP16.1 is regulated by light and cold. AtOEP16.2 is highly expressed in pollen, seeds and seedlings. AtOEP16.4 is a low expressed housekeeping protein. Single knockout mutants of AtOEP16.1, AtOEP16.2 and AtOEP16.4, and double mutants of AtOEP16 gene family did not show any remarkable phenotype. However, macroarray analysis of Atoep16.1-p T-DNA mutant revealed 10 down-regulated and 6 up-regulated genes. In contrast to the α-helical OEP16 proteins, the OEP37 and VDAC proteins are of β-barrel structure. The PsOEP37 and AtOEP37 channel proteins form a selective barrier in the outer envelope of chloroplasts. Electrophysiological studies in lipid bilayer membranes showed that the PsOEP37 channel is permeable for cations. Specific expression profiles showed that AtOEP37 and PsOEP37 are highly expressed in the entire plant. The isolated PsVDAC gene encodes a protein, which is located in mitochondria. In Arabidopsis gene database, five Arabidopsis genes, which code for VDAC-like proteins were announced. One gene was not detected, whereas four of these genes expressed in leaves, roots, flower buds and pollen

    The role of Iron Regulated 2 and Iron Regulated Transporter 1 in nickel hyperaccumulation traits in Senecio coronatus

    Get PDF
    Metal hyperaccumulating plants accumulate exceptionally high concentrations of metal ions in their above ground tissues and are defined as containing 1000 ÎĽg/g dry mass Co, Cu, Cr, Pb, Zn or Ni. This is remarkable because plants typically only require small amounts of these metals for survival, such as 0.004 ÎĽg/g Ni and 15-20 ÎĽg/g Zn. Scientific investigation has sought to understand the mechanisms underpinning hyperaccumulation in order to apply them in the phyto-technological processes of phytoremediation (removal of metal pollutants from the environment) and phytomining. However, little is known about the molecular mechanisms underlying Ni hyperaccumulation despite the fact that Ni hyperaccumulators account for almost three quarters of all known hyperaccumulating species. A comparative RNA-Seq experiment carried out on Ni accumulating and non-accumulating populations of the South African Ni hyperaccumulator Senecio coronatus (Asteraceae) identified a number of putative transport proteins that are constitutively upregulated in the hyperaccumulator plants. This MSc project focused on two of these, iron regulated 2 (ScIREG2) and iron regulated transporter 1 (ScIRT1), and aimed to validate the RNA-Seq derived nucleotide sequences, test for Ni transport activity and determine their sub-cellular localisation. Full-length ScIREG2 and ScIRT1 protein coding sequences were obtained using RT-PCR and conformed to the predicted sequences derived from the RNA-Seq data. Heterologous expression of ScIRT1 in yeast consistently conferred an increased Ni resistance phenotype to yeast across a variety of experimental conditions, suggesting that this protein is capable of transporting Ni, and may function as a Ni export protein in yeast. In contrast, the results obtained from heterologous expression of ScIREG2 were variable and thus inconclusive. An attempt was made to determine the subcellular localization of ScIRT1 using transient expression of an ScIRT1-YFP fusion protein in onion cells. While inconclusive, a YFP signal was detected in these cells, and appeared to localise to the plasma membrane. The work conducted serves as a pilot study to optimize the experimental systems necessary to identify Ni transporters from S. coronatus. These experimental systems can now be applied to characterise the remaining transport proteins identified in the RNA-Seq analysis

    Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence

    Get PDF
    BACKGROUND: Knowing the submitochondria localization of a mitochondria protein is an important step to understand its function. We develop a method which is based on an extended version of pseudo-amino acid composition to predict the protein localization within mitochondria. This work goes one step further than predicting protein subcellular location. We also try to predict the membrane protein type for mitochondrial inner membrane proteins. RESULTS: By using leave-one-out cross validation, the prediction accuracy is 85.5% for inner membrane, 94.5% for matrix and 51.2% for outer membrane. The overall prediction accuracy for submitochondria location prediction is 85.2%. For proteins predicted to localize at inner membrane, the accuracy is 94.6% for membrane protein type prediction. CONCLUSION: Our method is an effective method for predicting protein submitochondria location. But even with our method or the methods at subcellular level, the prediction of protein submitochondria location is still a challenging problem. The online service SubMito is now available at

    The effect of organelle discovery upon sub-cellular protein localisation.

    Get PDF
    Prediction of protein sub-cellular localisation by employing quantitative mass spectrometry experiments is an expanding field. Several methods have led to the assignment of proteins to specific subcellular localisations by partial separation of organelles across a fractionation scheme coupled with computational analysis. Methods developed to analyse organelle data have largely employed supervised machine learning algorithms to map unannotated abundance profiles to known protein–organelle associations. Such approaches are likely to make association errors if organelle-related groupings present in experimental output are not included in data used to create a protein–organelle classifier. Currently, there is no automated way to detect organelle-specific clusters within such datasets. In order to address the above issues we adapted a phenotype discovery algorithm, originally created to filter image-based output for RNAi screens, to identify putative subcellular groupings in organelle proteomics experiments. We were able to mine datasets to a deeper level and extract interesting phenotype clusters for more comprehensive evaluation in an unbiased fashion upon application of this approach. Organelle-related protein clusters were identified beyond those sufficiently annotated for use as training data. Furthermore, we propose avenues for the incorporation of observations made into general practice for the classification of protein–organelle membership from quantitative MS experiments. Biological significance Protein sub-cellular localisation plays an important role in molecular interactions, signalling and transport mechanisms. The prediction of protein localisation by quantitative mass-spectrometry (MS) proteomics is a growing field and an important endeavour in improving protein annotation. Several such approaches use gradient-based separation of cellular organelle content to measure relative protein abundance across distinct gradient fractions. The distribution profiles are commonly mapped in silico to known protein–organelle associations via supervised machine learning algorithms, to create classifiers that associate unannotated proteins to specific organelles. These strategies are prone to error, however, if organelle-related groupings present in experimental output are not represented, for example owing to the lack of existing annotation, when creating the protein–organelle mapping. Here, the application of a phenotype discovery approach to LOPIT gradient-based MS data identifies candidate organelle phenotypes for further evaluation in an unbiased fashion. Software implementation and usage guidelines are provided for application to wider protein–organelle association experiments. In the wider context, semi-supervised organelle discovery is discussed as a paradigm with which to generate new protein annotations from MS-based organelle proteomics experiments. This article is part of a Special Issue entitled: New Horizons and Applications for Proteomics [EuPA 2012]

    Non-classical protein secretion in bacteria

    Get PDF
    BACKGROUND: We present an overview of bacterial non-classical secretion and a prediction method for identification of proteins following signal peptide independent secretion pathways. We have compiled a list of proteins found extracellularly despite the absence of a signal peptide. Some of these proteins also have known roles in the cytoplasm, which means they could be so-called "moon-lightning" proteins having more than one function. RESULTS: A thorough literature search was conducted to compile a list of currently known bacterial non-classically secreted proteins. Pattern finding methods were applied to the sequences in order to identify putative signal sequences or motifs responsible for their secretion. We have found no signal or motif characteristic to any majority of the proteins in the compiled list of non-classically secreted proteins, and conclude that these proteins, indeed, seem to be secreted in a novel fashion. However, we also show that the apparently non-classically secreted proteins are still distinguished from cellular proteins by properties such as amino acid composition, secondary structure and disordered regions. Specifically, prediction of disorder reveals that bacterial secretory proteins are more structurally disordered than their cytoplasmic counterparts. Finally, artificial neural networks were used to construct protein feature based methods for identification of non-classically secreted proteins in both Gram-positive and Gram-negative bacteria. CONCLUSION: We present a publicly available prediction method capable of discriminating between this group of proteins and other proteins, thus allowing for the identification of novel non-classically secreted proteins. We suggest candidates for non-classically secreted proteins in Escherichia coli and Bacillus subtilis. The prediction method is available online

    NestedMICA as an ab initio protein motif discovery tool.

    Get PDF
    BACKGROUND: Discovering overrepresented patterns in amino acid sequences is an important step in protein functional element identification. We adapted and extended NestedMICA, an ab initio motif finder originally developed for finding transcription binding site motifs, to find short protein signals, and compared its performance with another popular protein motif finder, MEME. NestedMICA, an open source protein motif discovery tool written in Java, is driven by a Monte Carlo technique called Nested Sampling. It uses multi-class sequence background models to represent different "uninteresting" parts of sequences that do not contain motifs of interest. In order to assess NestedMICA as a protein motif finder, we have tested it on synthetic datasets produced by spiking instances of known motifs into a randomly selected set of protein sequences. NestedMICA was also tested using a biologically-authentic test set, where we evaluated its performance with respect to varying sequence length. RESULTS: Generally NestedMICA recovered most of the short (3-9 amino acid long) test protein motifs spiked into a test set of sequences at different frequencies. We showed that it can be used to find multiple motifs at the same time, too. In all the assessment experiments we carried out, its overall motif discovery performance was better than that of MEME. CONCLUSION: NestedMICA proved itself to be a robust and sensitive ab initio protein motif finder, even for relatively short motifs that exist in only a small fraction of sequences. AVAILABILITY: NestedMICA is available under the Lesser GPL open-source license from: http://www.sanger.ac.uk/Software/analysis/nmica/RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

    Machine learning methods for omics data integration

    Get PDF
    High-throughput technologies produce genome-scale transcriptomic and metabolomic (omics) datasets that allow for the system-level studies of complex biological processes. The limitation lies in the small number of samples versus the larger number of features represented in these datasets. Machine learning methods can help integrate these large-scale omics datasets and identify key features from each dataset. A novel class dependent feature selection method integrates the F statistic, maximum relevance binary particle swarm optimization (MRBPSO), and class dependent multi-category classification (CDMC) system. A set of highly differentially expressed genes are pre-selected using the F statistic as a filter for each dataset. MRBPSO and CDMC function as a wrapper to select desirable feature subsets for each class and classify the samples using those chosen class-dependent feature subsets. The results indicate that the class-dependent approaches can effectively identify unique biomarkers for each cancer type and improve classification accuracy compared to class independent feature selection methods. The integration of transcriptomics and metabolomics data is based on a classification framework. Compared to principal component analysis and non-negative matrix factorization based integration approaches, our proposed method achieves 20-30% higher prediction accuracies on Arabidopsis tissue development data. Metabolite-predictive genes and gene-predictive metabolites are selected from transcriptomic and metabolomic data respectively. The constructed gene-metabolite correlation network can infer the functions of unknown genes and metabolites. Tissue-specific genes and metabolites are identified by the class-dependent feature selection method. Evidence from subcellular locations, gene ontology, and biochemical pathways support the involvement of these entities in different developmental stages and tissues in Arabidopsis

    Prediction of eukaryotic protein subcellular multi- localisation with a combined KNN-SVM ensemble classifier

    Get PDF
    Proteins may exist in or shift among two or more different subcellular locations, and this phenomenon is closely related to biological function. It is challenging to deal with multiple locations during eukaryotic protein subcellular localisation prediction with routine methods; therefore, a reliable and automatic ensemble classifier for protein subcellular localisation is needed. We propose a new ensemble classifier combined with the KNN (K-nearest neighbour) and SVM (support vector machine) algorithms to predict the subcellular localisation of eukaryotic proteins from the GO (gene ontology) annotations. This method was developed by fusing basic individual classifiers through a voting system. The overall prediction accuracies thus obtained via the jackknife test and resubstitution test were 70.5 and 77.6% for eukaryotic proteins respectively, which are significantly higher than other methods presented in the previous studies and reveal that our strategy better predicts eukaryotic protein subcellular localisation
    • …
    corecore