616 research outputs found

    Computational Prediction of O-linked Glycosylation Sites That Preferentially Map on Intrinsically Disordered Regions of Extracellular Proteins

    Get PDF
    O-glycosylation of mammalian proteins is one of the important posttranslational modifications. We applied a support vector machine (SVM) to predict whether Ser or Thr is glycosylated, in order to elucidate the O-glycosylation mechanism. O-glycosylated sites were often found clustered along the sequence, whereas other sites were located sporadically. Therefore, we developed two types of SVMs for predicting clustered and isolated sites separately. We found that the amino acid composition was effective for predicting the clustered type, whereas the site-specific algorithm was effective for the isolated type. The highest prediction accuracy for the clustered type was 74%, while that for the isolated type was 79%. The existence frequency of amino acids around the O-glycosylation sites was different in the two types: namely, Pro, Val and Ala had high existence probabilities at each specific position relative to a glycosylation site, especially for the isolated type. Independent component analyses for the amino acid sequences around O-glycosylation sites showed the position-specific existences of the identified amino acids as independent components. The O-glycosylation sites were preferentially located within intrinsically disordered regions of extracellular proteins: particularly, more than 90% of the clustered O-GalNAc glycosylation sites were observed in intrinsically disordered regions. This feature could be the key for understanding the non-conservation property of O-glycosylation, and its role in functional diversity and structural stability

    Predicting Flavonoid UGT Regioselectivity

    Get PDF
    Machine learning was applied to a challenging and biologically significant protein classification problem: the prediction of avonoid UGT acceptor regioselectivity from primary sequence. Novel indices characterizing graphical models of residues were proposed and found to be widely distributed among existing amino acid indices and to cluster residues appropriately. UGT subsequences biochemically linked to regioselectivity were modeled as sets of index sequences. Several learning techniques incorporating these UGT models were compared with classifications based on standard sequence alignment scores. These techniques included an application of time series distance functions to protein classification. Time series distances defined on the index sequences were used in nearest neighbor and support vector machine classifiers. Additionally, Bayesian neural network classifiers were applied to the index sequences. The experiments identified improvements over the nearest neighbor and support vector machine classifications relying on standard alignment similarity scores, as well as strong correlations between specific subsequences and regioselectivities

    Systems Biology of Protein Secretion in Human Cells: Multi-omics Analysis and Modeling of the Protein Secretion Process in Human Cells and its Application.

    Get PDF
    Since the emergence of modern biotechnology, the production of recombinant pharmaceutical proteins has been an expanding field with high demand from industry. Pharmaceutical proteins have constituted the majority of top-selling drugs in the pharma industry during recent years. Many of these proteins require post-translational modifications and are therefore produced using mammalian cells such as Chinese Hamster Ovary cells. Despite frequent improvements in developing efficient cell factories for producing recombinant proteins, the natural complexity of the protein secretion process still poses serious challenges for the production of some proteins at the desired quantity and accepted quality. These challenges have been intensified by the growing demands of the pharma industry to produce novel products with greater structural complexity,\ua0\ua0as well as increasing expectations from regulatory authorities in the form of new quality control criteria to guarantee product safety.This thesis focuses on different aspects of the protein secretion process, including its engineering for cell factory development and analysis in diseases associated with its deregulation. A major part of this thesis involved the use of HEK293 cells as a human model cell-line for investigating the protein secretion process by generating different types of omics data and developing a computational model of the human protein secretion pathway. We compared the transcriptomic profile of cell lines producing erythropoietin (EPO; as a model secretory protein) at different rates to identify key genes that potentially contributed to higher rates of protein secretion. Moreover, by performing a transcriptomic comparison of cells producing green fluorescent protein (GFP; as a model non-secretory protein) with EPO producers, we captured differences that specifically relate to secretory protein production. We sought to further investigate the factors contributing to increased recombinant protein production by analyzing additional omic layers such as proteomics and metabolomics in cells that exhibited different rates of EPO production. Moreover, we developed a toolbox (HumanSec) to extend the reference human genome-scale metabolic model (Human1) to encompass protein-specific reactions for each secretory protein detected in our proteomics dataset. By generating cell-line specific protein secretion models and constraining the models using metabolomics data, we could predict the top host cell proteins (HCPs) that compete with EPO for metabolic and energetic resources.\ua0Finally,\ua0based on the detected patterns of changes in our multi-omics investigations combined with a protein secretion sensitivity analysis using the metabolic model, we identified a list of genes and pathways that potentially play a key role in recombinant protein production and could serve as promising candidates for targeted cell factory design.In another part of the thesis, we studied the link between the expression profiles of genes involved in the protein secretory pathway (PSP) and various hallmarks of cancer. By\ua0implementing a dual approach involving differential expression analysis and eight different machine learning algorithms, we investigated the expression changes in secretory pathway components across different cancer types to identify PSP genes whose expression was associated with tumor characteristics. We demonstrated that a combined machine learning and differential expression approach have a complementary nature and could highlight key PSP components relevant to features of tumor pathophysiology that may constitute potential therapeutic targets

    Data mining techniques for protein sequence analysis

    Get PDF
    This thesis concerns two areas of bioinformatics related by their role in protein structure and function: protein structure prediction and post translational modification of proteins. The dihedral angles Ψ and Φ are predicted using support vector regression. For the prediction of Ψ dihedral angles the addition of structural information is examined and the normalisation of Ψ and Φ dihedral angles is examined. An application of the dihedral angles is investigated. The relationship between dihedral angles and three bond J couplings determined from NMR experiments is described by the Karplus equation. We investigate the determination of the correct solution of the Karplus equation using predicted Φ dihedral angles. Glycosylation is an important post translational modification of proteins involved in many different facets of biology. The work here investigates the prediction of N-linked and O-linked glycosylation sites using the random forest machine learning algorithm and pairwise patterns in the data. This methodology produces more accurate results when compared to state of the art prediction methods. The black box nature of random forest is addressed by using the trepan algorithm to generate a decision tree with comprehensible rules that represents the decision making process of random forest. The prediction of our program GPP does not distinguish between glycans at a given glycosylation site. We use farthest first clustering, with the idea of classifying each glycosylation site by the sugar linking the glycan to protein. This thesis demonstrates the prediction of protein backbone torsion angles and improves the current state of the art for the prediction of glycosylation sites. It also investigates potential applications and the interpretation of these methods

    Predicting Flavonoid UGT Regioselectivity with Graphical Residue Models and Machine Learning.

    Get PDF
    Machine learning is applied to a challenging and biologically significant protein classification problem: the prediction of flavonoid UGT acceptor regioselectivity from primary protein sequence. Novel indices characterizing graphical models of protein residues are introduced. The indices are compared with existing amino acid indices and found to cluster residues appropriately. A variety of models employing the indices are then investigated by examining their performance when analyzed using nearest neighbor, support vector machine, and Bayesian neural network classifiers. Improvements over nearest neighbor classifications relying on standard alignment similarity scores are reported

    Glycoproteins and Glycosylation Site Assignments in Cereal seed Proteomes

    Get PDF

    SiteSeek: Post-translational modification analysis using adaptive locality-effective kernel methods and new profiles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Post-translational modifications have a substantial influence on the structure and functions of protein. Post-translational phosphorylation is one of the most common modification that occur in intracellular proteins. Accurate prediction of protein phosphorylation sites is of great importance for the understanding of diverse cellular signalling processes in both the human body and in animals. In this study, we propose a new machine learning based protein phosphorylation site predictor, SiteSeek. SiteSeek is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence. The newly proposed method proves to be more accurate and exhibits a much stable predictive performance than currently existing phosphorylation site predictors.</p> <p>Results</p> <p>The performance of the proposed model was compared to nine existing different machine learning models and four widely known phosphorylation site predictors with the newly proposed PS-Benchmark_1 dataset to contrast their accuracy, sensitivity, specificity and correlation coefficient. SiteSeek showed better predictive performance with 86.6% accuracy, 83.8% sensitivity, 92.5% specificity and 0.77 correlation-coefficient on the four main kinase families (CDK, CK2, PKA, and PKC).</p> <p>Conclusion</p> <p>Our newly proposed methods used in SiteSeek were shown to be useful for the identification of protein phosphorylation sites as it performed much better than widely known predictors on the newly built PS-Benchmark_1 dataset.</p

    Data mining techniques for protein sequence analysis

    Get PDF
    This thesis concerns two areas of bioinformatics related by their role in protein structure and function: protein structure prediction and post translational modification of proteins. The dihedral angles Ψ and Φ are predicted using support vector regression. For the prediction of Ψ dihedral angles the addition of structural information is examined and the normalisation of Ψ and Φ dihedral angles is examined. An application of the dihedral angles is investigated. The relationship between dihedral angles and three bond J couplings determined from NMR experiments is described by the Karplus equation. We investigate the determination of the correct solution of the Karplus equation using predicted Φ dihedral angles. Glycosylation is an important post translational modification of proteins involved in many different facets of biology. The work here investigates the prediction of N-linked and O-linked glycosylation sites using the random forest machine learning algorithm and pairwise patterns in the data. This methodology produces more accurate results when compared to state of the art prediction methods. The black box nature of random forest is addressed by using the trepan algorithm to generate a decision tree with comprehensible rules that represents the decision making process of random forest. The prediction of our program GPP does not distinguish between glycans at a given glycosylation site. We use farthest first clustering, with the idea of classifying each glycosylation site by the sugar linking the glycan to protein. This thesis demonstrates the prediction of protein backbone torsion angles and improves the current state of the art for the prediction of glycosylation sites. It also investigates potential applications and the interpretation of these methods

    Characterization of the Isoforms of the Multiple Sclerosis Risk Protein, IL-22 Binding Protein (IL-22BP)

    Get PDF
    262 p.The human IL22RA2 gene co-produces three protein isoforms in dendritic cells (IL-22 binding protein isoform-1 [IL-22BPi1], -2 [IL-22BPi2], and -3 [IL-22BPi3]). Two of these, namely, IL-22BPi2 and IL-22BPi3, are capable of neutralizing the biological activity of IL-22. The function of IL-22BPi1, which differs from IL-22BPi2 through an in-frame, 32-amino-acid insertion provided by an alternatively spliced exon, remains unknown.This thesis focuses on the biochemical characterization of the three isoforms (in silico and in vitro), as well as the potential pharmacological targeting thereof. Additionally, it addresses the functionality of the non-synonymous single nucleotide polymorphism (SNP), rs28385692, which is located in IL22RA2 and has been found to be associated with multiple sclerosis (MS).El gen humano IL22RA2 coproduce tres isoformas en las células dendríticas (la isoforma-1 de la proteína de unión a la IL-22 [IL-22BPi1], la -2 [IL-22BPi2] y la -3 [IL-22BPi3]). Dos de ellas, la IL-22BPi2 y la IL-22BPi3, son capaces de neutralizar la actividad biológica de la IL-22. La función de la IL-22BPi1, que difiere de la IL-22BPi2 por una inserción de 32 aminoácidos mediante splicing alternativo, sigue siendo desconocida. El objetivo general de esta tesis es comprender la función y el destino de la IL-22BP mediante la caracterización bioquímica de sus isoformas. Los obteivos específicos son: estudiar la expresión de los transcritos de IL22RA2 y relacionarlos con la secreción de sus isoformas mediante su caracterización bioquímica, así como identificar los factores clave implicados en su plegamiento y secreción e investigar su secreción y función dentro de la célula; evaluar si las chaperonas identificadas pudieran ser utilizadas como posibles dianas terapéuticas que tuvieran efectos diferenciales en la secreción de las isoformas de IL-22BP; y determinar si la variante de riesgo de la EM, rs28385692, tiene efectos funcionales sobre las isoformas de la IL-22BP y en qué medida
    corecore