30 research outputs found

    Going from where to why—interpretable prediction of protein subcellular localization

    Get PDF
    Motivation: Protein subcellular localization is pivotal in understanding a protein's function. Computational prediction of subcellular localization has become a viable alternative to experimental approaches. While current machine learning-based methods yield good prediction accuracy, most of them suffer from two key problems: lack of interpretability and dealing with multiple locations

    Protein (Multi-)Location Prediction: Using Location Inter-Dependencies in a Probabilistic Framework

    Full text link
    Knowing the location of a protein within the cell is important for understanding its function, role in biological processes, and potential use as a drug target. Much progress has been made in developing computational methods that predict single locations for proteins, assuming that proteins localize to a single location. However, it has been shown that proteins localize to multiple locations. While a few recent systems have attempted to predict multiple locations of proteins, they typically treat locations as independent or capture inter-dependencies by treating each locations-combination present in the training set as an individual location-class. We present a new method and a preliminary system we have developed that directly incorporates inter-dependencies among locations into the multiple-location-prediction process, using a collection of Bayesian network classifiers. We evaluate our system on a dataset of single- and multi-localized proteins. Our results, obtained by incorporating inter-dependencies are significantly higher than those obtained by classifiers that do not use inter-dependencies. The performance of our system on multi-localized proteins is comparable to a top performing system (YLoc+), without restricting predictions to be based only on location-combinations present in the training set.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

    YLoc—an interpretable web server for predicting subcellular localization

    Get PDF
    Predicting subcellular localization has become a valuable alternative to time-consuming experimental methods. Major drawbacks of many of these predictors is their lack of interpretability and the fact that they do not provide an estimate of the confidence of an individual prediction. We present YLoc, an interpretable web server for predicting subcellular localization. YLoc uses natural language to explain why a prediction was made and which biological property of the protein was mainly responsible for it. In addition, YLoc estimates the reliability of its own predictions. YLoc can, thus, assist in understanding protein localization and in location engineering of proteins. The YLoc web server is available online at www.multiloc.org/YLoc

    Evidence for the localization of the Arabidopsis cytokinin receptors AHK3 and AHK4 in the endoplasmic reticulum

    Get PDF
    Cytokinins are hormones that are involved in various processes of plant growth and development. The model of cytokinin signalling starts with hormone perception through membrane-localized histidine kinase receptors. Although the biochemical properties and functions of these receptors have been extensively studied, there is no solid proof of their subcellular localization. Here, cell biological and biochemical evidence for the localization of functional fluorophor-tagged fusions of Arabidopsis histidine kinase 3 (AHK3) and 4 (AHK4), members of the cytokinin receptor family, in the endoplasmic reticulum (ER) is provided. Furthermore, membrane-bound AHK3 interacts with AHK4 in vivo. The ER localization and putative function of cytokinin receptors from the ER have major impacts on the concept of cytokinin perception and signalling, and hormonal cross-talk in plants

    Minimalist Ensemble Algorithms for Genome-Wide Protein Localization Prediction

    Get PDF
    Background Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. Results This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors. Conclusions We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi

    TESTLoc: protein subcellular localization prediction from EST data

    Get PDF
    Abstract Background The eukaryotic cell has an intricate architecture with compartments and substructures dedicated to particular biological processes. Knowing the subcellular location of proteins not only indicates how bio-processes are organized in different cellular compartments, but also contributes to unravelling the function of individual proteins. Computational localization prediction is possible based on sequence information alone, and has been successfully applied to proteins from virtually all subcellular compartments and all domains of life. However, we realized that current prediction tools do not perform well on partial protein sequences such as those inferred from Expressed Sequence Tag (EST) data, limiting the exploitation of the large and taxonomically most comprehensive body of sequence information from eukaryotes. Results We developed a new predictor, TESTLoc, suited for subcellular localization prediction of proteins based on their partial sequence conceptually translated from ESTs (EST-peptides). Support Vector Machine (SVM) is used as computational method and EST-peptides are represented by different features such as amino acid composition and physicochemical properties. When TESTLoc was applied to the most challenging test case (plant data), it yielded high accuracy (~85%). Conclusions TESTLoc is a localization prediction tool tailored for EST data. It provides a variety of models for the users to choose from, and is available for download at http://megasun.bch.umontreal.ca/~shenyq/TESTLoc/TESTLoc.html</p

    DDAG K-TIPCAC : an ensemble method for protein subcellular localization

    Get PDF
    Protein subcellular location prediction is one of the most difficult multiclass prediction problems in modern computational biology. Many methods have been proposed in the literature to solve this problem, but all the existing approaches are affected by some limitations. In this contribution we propose a novel method for protein subcellular location prediction that performs multiclass classification by combining kernel classifiers through DDAG. Each base classifier, called K-TIPCAC, projects the points on a Fisher subspace estimated on the training data by means of a novel technique. Experimental results clearly indicated that DDAG K-TIPCAC performs equally, if not better, than state-of-the-art ensemble methods for protein subcellular location

    Integration of molecular biology tools for identifying promoters and genes abundantly expressed in flowers of Oncidium Gower Ramsey

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Orchids comprise one of the largest families of flowering plants and generate commercially important flowers. However, model plants, such as <it>Arabidopsis thaliana </it>do not contain all plant genes, and agronomic and horticulturally important genera and species must be individually studied.</p> <p>Results</p> <p>Several molecular biology tools were used to isolate flower-specific gene promoters from <it>Oncidium </it>'Gower Ramsey' (<it>Onc</it>. GR). A cDNA library of reproductive tissues was used to construct a microarray in order to compare gene expression in flowers and leaves. Five genes were highly expressed in flower tissues, and the subcellular locations of the corresponding proteins were identified using lip transient transformation with fluorescent protein-fusion constructs. BAC clones of the 5 genes, together with 7 previously published flower- and reproductive growth-specific genes in <it>Onc</it>. GR, were identified for cloning of their promoter regions. Interestingly, 3 of the 5 novel flower-abundant genes were putative trypsin inhibitor (<it>TI</it>) genes (<it>OnTI1</it>, <it>OnTI2 </it>and <it>OnTI3</it>), which were tandemly duplicated in the same BAC clone. Their promoters were identified using transient GUS reporter gene transformation and stable <it>A. thaliana </it>transformation analyses.</p> <p>Conclusions</p> <p>By combining cDNA microarray, BAC library, and bombardment assay techniques, we successfully identified flower-directed orchid genes and promoters.</p

    Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites

    Get PDF
    It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using Gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches

    Compartmentation of Redox Metabolism in Malaria Parasites

    Get PDF
    Malaria, caused by the apicomplexan parasite Plasmodium, still represents a major threat to human health and welfare and leads to about one million human deaths annually. Plasmodium is a rapidly multiplying unicellular organism undergoing a complex developmental cycle in man and mosquito – a life style that requires rapid adaptation to various environments. In order to deal with high fluxes of reactive oxygen species and maintain redox regulatory processes and pathogenicity, Plasmodium depends upon an adequate redox balance. By systematically studying the subcellular localization of the major antioxidant and redox regulatory proteins, we obtained the first complete map of redox compartmentation in Plasmodium falciparum. We demonstrate the targeting of two plasmodial peroxiredoxins and a putative glyoxalase system to the apicoplast, a non-photosynthetic plastid. We furthermore obtained a complete picture of the compartmentation of thioredoxin- and glutaredoxin-like proteins. Notably, for the two major antioxidant redox-enzymes – glutathione reductase and thioredoxin reductase – Plasmodium makes use of alternative-translation-initiation (ATI) to achieve differential targeting. Dual localization of proteins effected by ATI is likely to occur also in other Apicomplexa and might open new avenues for therapeutic intervention