201 research outputs found

    AMS 3.0: prediction of post-translational modifications

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We present here the recent update of AMS algorithm for identification of post-translational modification (PTM) sites in proteins based only on sequence information, using artificial neural network (ANN) method. The query protein sequence is dissected into overlapping short sequence segments. Ten different physicochemical features describe each amino acid; therefore nine residues long segment is represented as a point in a 90 dimensional space. The database of sequence segments with confirmed by experiments post-translational modification sites are used for training a set of ANNs.</p> <p>Results</p> <p>The efficiency of the classification for each type of modification and the prediction power of the method is estimated here using recall (sensitivity), precision values, the area under receiver operating characteristic (ROC) curves and leave-one-out tests (LOOCV). The significant differences in the performance for differently optimized neural networks are observed, yet the AMS 3.0 tool integrates those heterogeneous classification schemes into the single consensus scheme, and it is able to boost the precision and recall values independent of a PTM type in comparison with the currently available state-of-the art methods.</p> <p>Conclusions</p> <p>The standalone version of AMS 3.0 presents an efficient way to indentify post-translational modifications for whole proteomes. The training datasets, precompiled binaries for AMS 3.0 tool and the source code are available at <url>http://code.google.com/p/automotifserver</url> under the Apache 2.0 license scheme.</p

    Prediction of protein structural features by use of artificial neural networks

    Get PDF

    SiteSeek: Post-translational modification analysis using adaptive locality-effective kernel methods and new profiles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Post-translational modifications have a substantial influence on the structure and functions of protein. Post-translational phosphorylation is one of the most common modification that occur in intracellular proteins. Accurate prediction of protein phosphorylation sites is of great importance for the understanding of diverse cellular signalling processes in both the human body and in animals. In this study, we propose a new machine learning based protein phosphorylation site predictor, SiteSeek. SiteSeek is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence. The newly proposed method proves to be more accurate and exhibits a much stable predictive performance than currently existing phosphorylation site predictors.</p> <p>Results</p> <p>The performance of the proposed model was compared to nine existing different machine learning models and four widely known phosphorylation site predictors with the newly proposed PS-Benchmark_1 dataset to contrast their accuracy, sensitivity, specificity and correlation coefficient. SiteSeek showed better predictive performance with 86.6% accuracy, 83.8% sensitivity, 92.5% specificity and 0.77 correlation-coefficient on the four main kinase families (CDK, CK2, PKA, and PKC).</p> <p>Conclusion</p> <p>Our newly proposed methods used in SiteSeek were shown to be useful for the identification of protein phosphorylation sites as it performed much better than widely known predictors on the newly built PS-Benchmark_1 dataset.</p

    Predicting Flavonoid UGT Regioselectivity with Graphical Residue Models and Machine Learning.

    Get PDF
    Machine learning is applied to a challenging and biologically significant protein classification problem: the prediction of flavonoid UGT acceptor regioselectivity from primary protein sequence. Novel indices characterizing graphical models of protein residues are introduced. The indices are compared with existing amino acid indices and found to cluster residues appropriately. A variety of models employing the indices are then investigated by examining their performance when analyzed using nearest neighbor, support vector machine, and Bayesian neural network classifiers. Improvements over nearest neighbor classifications relying on standard alignment similarity scores are reported

    Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering

    Get PDF
    The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.\ua0\ua0\ua0\ua0 High-throughput mass spectrometry platforms provide detailed snapshots of a cell\u27s protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.\ua0\ua0\ua0 Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance

    Multidimensional Feature Engineering for Post-Translational Modification Prediction Problems

    Get PDF
    Protein sequence data has been produced at an astounding speed. This creates an opportunity to characterize these proteins for the treatment of illness. A crucial characterization of proteins is their post translational modifications (PTM). There are 20 amino acids coded by DNA after coding (translation) nearly every protein is modified at an amino acid level. We focus on three specific PTMs. First is the bonding formed between two cysteine amino acids, thus introducing a loop to the straight chain of a protein. Second, we predict which cysteines can generally be modified (oxidized). Finally, we predict which lysine amino acids are modified by the active form of Vitamin B6 (PLP/pyridoxal-5-phosphate.) Our work aims to predict the PTM\u27s from protein sequencing data. When available, we integrate other data sources to improve prediction. Data mining finds patterns in data and uses these patterns to give a confidence score to unknown PTMs. There are many steps to data mining; however, our focus is on the feature engineering step i.e. the transforming of raw data into an intelligible form for a prediction algorithm. Our primary innovation is as follows: First, we created the Local Similarity Matrix (LSM), a description of the evolutionarily relatedness of a cysteine and its neighboring amino acids. This feature is taken two at a time and template matched to other cysteine pairs. If they are similar, then we give a high probability of it sharing the same bonding state. LSM is a three step algorithm, 1) a matrix of amino acid probabilities is created for each cysteine and its neighbors from an alignment. 2) We multiply the iv square of the BLOSUM62 matrix diagonal to each of the corresponding amino acids. 3) We z-score normalize the matrix by row. Next, we innovated the Residue Adjacency Matrix (RAM) for sequential and 3-D space (integration of protein coordinate data). This matrix describes cysteine\u27s neighbors but at much greater distances than most algorithms. It is particularly effective at finding conserved residues that are further away while still remaining a compact description. More data than necessary incurs the curse of dimensionality. RAM runs in O(n) time, making it very useful for large datasets. Finally, we produced the Windowed Alignment Scoring algorithm (WAS). This is a vector of protein window alignment bit scores. The alignments are one to all. Then we apply dimensionality reduction for gains in speed and performance. WAS uses the BLAST algorithm to align sequences within a window surrounding potential PTMs, in this case PLP attached to Lysine. In the case of WAS, we tried many alignment algorithms and used the approximation that BLAST provides to reduce computational time from months to days. The performances of different alignment algorithms did not vary significantly. The applications of this work are many. It has been shown that cysteine bonding configurations play a critical role in the folding of proteins. Solving the protein folding problem will help us to find the solution to Alzheimer\u27s disease that is due to a misfolding of the amyloid-beta protein. Cysteine oxidation has been shown to play a role in oxidative stress, a situation when free radicals become too abundant in the body. Oxidative stress leads to chronic illness such as diabetes, cancer, heart disease and Parkinson\u27s. Lysine in concert with PLP catalyzes the aminotransferase reaction. Research suggests that anti-cancer drugs will potentially selectively inhibit this reaction. Others have targeted this reaction for the treatment of epilepsy and addictions

    Predicting and analyzing HIV-1 adaptation to broadly neutralizing antibodies and the host immune system using machine learning

    Get PDF
    Thanks to its extraordinarily high mutation and replication rate, the human immunodeficiency virus type 1 (HIV-1) is able to rapidly adapt to the selection pressure imposed by the host immune system or antiretroviral drug exposure. With neither a cure nor a vaccine at hand, viral control is a major pillar in the combat of the HIV-1 pandemic. Without drug exposure, interindividual differences in viral control are partly influenced by host genetic factors like the human leukocyte antigen (HLA) system, and viral genetic factors like the predominant coreceptor usage of the virus. Thus, a close monitoring of the viral population within the patients and adjustments in the treatment regimens, as well as a continuous development of new drug components are indispensable measures to counteract the emergence of viral escape variants. To this end, a fast and accurate determination of the viral adaptation is essential for a successful treatment. This thesis is based upon four studies that aim to develop and apply statistical learning methods to (i) predict adaptation of the virus to broadly neutralizing antibodies (bNAbs), a promising new treatment option, (ii) advance antibody-mediated immunotherapy for clinical usage, and (iii) predict viral adaptation to the HLA system to further understand the switch in HIV-1 coreceptor usage. In total, this thesis comprises several statistical learning approaches to predict HIV-1 adaptation, thereby, enabling a better control of HIV-1 infections.Dank seiner außergewöhnlich hohen Mutations- und Replikationsrate ist das humane Immundefizienzvirus Typ 1 (HIV-1) in der Lage sich schnell an den vom Immunsystem des Wirtes oder durch die antiretrovirale Arzneimittelexposition ausgeübten Selektionsdruck anzupassen. Da weder ein Heilmittel noch ein Impfstoff verfügbar sind, ist die Viruskontrolle eine wichtige Säule im Kampf gegen die HIV-1-Pandemie. Ohne Arzneimittelexposition werden interindividuelle Unterschiede in der Viruskontrolle teilweise durch genetische Faktoren des Wirts wie das humane Leukozytenantigensystem (HLA) und virale genetische Faktoren wie die vorherrschende Korezeptornutzung des Virus beeinflusst. Eine genaue Überwachung der Viruspopulation innerhalb des Patienten, gegebenfalls Anpassungen der Behandlungsschemata sowie eine kontinuierliche Entwicklung neuer Wirkstoffkomponenten sind daher unerlässliche Maßnahmen, um dem Auftreten viraler Fluchtvarianten entgegenzuwirken. Für eine erfolgreiche Behandlung ist eine schnelle und genaue Bestimmung der Anpassung einer Variante essentiell. Die Thesis basiert auf vier Studien, deren Ziel es ist statistische Lernverfahren zu entwickeln und anzuwenden, um (1) die Anpassung von HIV-1 an breit neutralisierende Antikörper, eine neuartige vielversprechende Therapieoption, vorherzusagen, (2) den Einsatz von Antikörper-basierte Immuntherapien für den klinischen Einsatz voranzutreiben, und (3) die virale Anpassung von HIV-1 an das HLA-System vorherzusagen, um den Wechsel der HIV-1 Korezeptornutzung besser zu verstehen. Zusammenfassend umfasst diese Thesis mehrere statistische Lernverfahrenansätze, um HIV Anpassung vorherzusagen, wodurch eine bessere Kontrolle von HIV-1 Infektionen ermöglicht wird
    corecore