52 research outputs found

    Improving PTM Site Prediction by Coupling of Multi-Granularity Structure and Multi-Scale Sequence Representation

    Full text link
    Protein post-translational modification (PTM) site prediction is a fundamental task in bioinformatics. Several computational methods have been developed to predict PTM sites. However, existing methods ignore the structure information and merely utilize protein sequences. Furthermore, designing a more fine-grained structure representation learning method is urgently needed as PTM is a biological event that occurs at the atom granularity. In this paper, we propose a PTM site prediction method by Coupling of Multi-Granularity structure and Multi-Scale sequence representation, PTM-CMGMS for brevity. Specifically, multigranularity structure-aware representation learning is designed to learn neighborhood structure representations at the amino acid, atom, and whole protein granularity from AlphaFold predicted structures, followed by utilizing contrastive learning to optimize the structure representations.Additionally, multi-scale sequence representation learning is used to extract context sequence information, and motif generated by aligning all context sequences of PTM sites assists the prediction. Extensive experiments on three datasets show that PTM-CMGMS outperforms the state-of-the-art methods

    iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization

    Get PDF
    Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.Zhen Chen, Pei Zhao, Chen Li, Fuyi Li, Dongxu Xiang, Yong-Zi Chen, Tatsuya Akutsu, Roger J. Daly, Geoffrey I. Webb, Quanzhi Zhao, Lukasz Kurgan, and Jiangning Son

    Development of language modelling techniques for protein sequence analysis

    Get PDF
    Dissertação de mestrado em BioinformaticsNowadays, the ability to predict protein functions directly from amino-acid sequences alone remains a major biological challenge. The understanding of protein properties and functions is extremely important and can have a wide range of biotechnological and medical applications. Technological advances have led to an exponential growth of biological data challenging conventional analysis strategies. High-level representations from the field of deep learning can provide new alternatives to address these problems, particularly NLP methods, such as word embeddings, have shown particular success when applied for protein sequence analysis. Here, a module that eases the implementation of word embedding models toward protein representation and classification is presented. Furthermore, this module was integrated in the ProPythia framework, allowing to straightforwardly integrate WE representations with the training and testing of ML and DL models. This module was validated using two protein classification problems namely, identification of plant ubiquitylation sites and lysine crotonylation site prediction. This module was further used to explore enzyme functional annotation. Several WE were tested and fed to different ML and DL networks. Overall, WE achieved good results being even competitive with state-of-the-art models, reinforcing the idea that language based methods can be applied with success to a wide range of protein classification problems. This work presents a freely available tool to perform word embedding techniques for protein classification. The case studies presented reinforce the usability and importance of using NLP and ML in protein classification problems.Hoje em dia, a habilidade de prever a função de proteínas a partir apenas da sequências de amino-ácidos permanece um dos grandes desafios biológicos. A compreensão das propriedades e das funções das proteinas é de extrema importância e pode ter uma grande variedade de aplicações médicas e biotecnológicas. Os avanços nas tecnologia levaram a um crescimento exponencial de dados biológicos, desafiando as estratégias convencionais de análise. O campo do Deep Learning pode providenciar novas alternativas para atender à resolução destes problemas, em particular, os métodos de processamento de linguagem, como por exemplo word embeddings, mostraram especial sucesso quando aplicados para análise de sequências proteicas. Aqui, é apresentado um módulo que facilita a implementação de modelos de “word embedding” para representação e classificação de proteínas. Além disso, este módulo foi integrado na framework ProPythia, permitindo integrar diretamente as representações WE com o treino e teste de modelos ML e DL. Este módulo foi validado usando dois problemas de classificação de proteínas, identificação de locais de ubiquitilação de plantas e previsão de locais de crotonilação de lisinas. Este módulo foi usado também para explorar a anotação funcional de enzimas. Vários WE foram testados e utilizados em diferentes redes ML e DL. No geral, as técnicas de WE obtiveram bons resultados sendo competitivas, mesmo com modelos descritos no estado da arte, reforçando a ideia de que métodos baseados em linguagem podem ser aplicados com sucesso a uma ampla gama de problemas de classificação de proteínas. Este trabalho apresenta uma ferramenta para realizar técnicas de word embedding para classificação de proteínas. Os caso de estudo apresentados reforçam a usabilidade e importância do uso de NLP e ML em problemas de classificação de proteínas

    Stacking-ac4C: an ensemble model using mixed features for identifying n4-acetylcytidine in mRNA

    Get PDF
    N4-acetylcytidine (ac4C) is a modification of cytidine at the nitrogen-4 position, playing a significant role in the translation process of mRNA. However, the precise mechanism and details of how ac4C modifies translated mRNA remain unclear. Since identifying ac4C sites using conventional experimental methods is both labor-intensive and time-consuming, there is an urgent need for a method that can promptly recognize ac4C sites. In this paper, we propose a comprehensive ensemble learning model, the Stacking-based heterogeneous integrated ac4C model, engineered explicitly to identify ac4C sites. This innovative model integrates three distinct feature extraction methodologies: Kmer, electron-ion interaction pseudo-potential values (PseEIIP), and pseudo-K-tuple nucleotide composition (PseKNC). The model also incorporates the robust Cluster Centroids algorithm to enhance its performance in dealing with imbalanced data and alleviate underfitting issues. Our independent testing experiments indicate that our proposed model improves the Mcc by 15.61% and the ROC by 5.97% compared to existing models. To test our model’s adaptability, we also utilized a balanced dataset assembled by the authors of iRNA-ac4C. Our model showed an increase in Sn of 4.1%, an increase in Acc of nearly 1%, and ROC improvement of 0.35% on this balanced dataset. The code for our model is freely accessible at https://github.com/louliliang/ST-ac4C.git, allowing users to quickly build their model without dealing with complicated mathematical equations

    Prevention and prediction of production instability of CHO-K1 cell lines by the examination of epigenetic mechanisms

    Get PDF
    The CHO-K1 cell line is the most common expression system for therapeutic proteins in the pharmaceutical industry. Due to the nature of economics, the cell lines and the vector design are subject to constant change to increase product quality and quantity. During the cultivation, the production cell lines are susceptible to decreasing productivity over time. Often the loss of production can be associated with a reduction of copy number and the silencing of transgenes. During cell line development, the most promising cell lines are cultivated in large batch culture. Consequently, the loss of a stable production cell line can be very cost-intensive. For this reason I developed different strategies to avoid a reduced productivity. Instability of production cell lines can be predicted by the degree of CpG methylation of the driving promoter. Considering that the DNA methylation is at the end of an epigenetic cascade and associated with the maintenance of the repressive state, I investigated the upstream signals of histone modifications with the assumption to obtain a higher predictive power of production instability. For this reason I performed a chromatin immunoprecipitation of the histone modifications H3K9me3 and H3K27me3 as repressive signals and H3ac as well as H3K4me3 as active marks. The accumulations of those signals were measured close to the hCMV-MIE at the beginning of the cultivation and were then compared with the loss of productivity over two month. I found that the degree of the H3 acetylation (H3ac) correlated best with the production stability. Furthermore I was able to identify an H3ac threshold to exclude most of the unstable producers. In the second project I aimed to improve the vector design by considering epigenetic mechanisms. To this end I designed on the one hand a target-oriented histone acetyltransferase to enforce an open and active chromatin status at the transgene. On the other hand I point-mutated methylation-susceptible CpGs within the hCMV-MIE to impede the maintenance of inactive heterochromatin formation. Remarkably, the C to G mutation located 179 bp upstream of transcription start site resulted in very stable antibody producing cell lines. In addition, the examination of cell pools expressing eGFP showed that G-179 promoter variants were less prone to a general methylation and gene amplification, which illustrates the dominating effect in epigenetic mechanisms of one single CpG. The last project was performed to localize stable integration sites within the CHO-K1 genome. In so doing I could show that the transfection leads predominantly to integration into inactive regions. Furthermore I identified promising integration sites with a high potential to induce stable expression. However, those results are preliminary and must be viewed with caution. Further examination needs to be done to confirm these results. Considering the results of all three projects, I propose that the interplay of metabolic burden and selection pressure at an early time point of cultivation plays an important role in cell line development. Small alterations of selection pressure can lead to a decisive change of cell properties. Therefore, stable cells are less susceptible than weak producers. The increase of selection pressure leads to compensatory effect by gene amplification in the instable cell lines. The resulting adjustment of productivity masks the truly stable cells, which precludes the selection of the right cell lines. For this reason the selection pressure, the copy number as well as the growth rate should be considered to minimize repressive effects.Die CHO-K1 Zelllinie ist das am häufigsten verwendete Expressionssystem für therapeutische Proteine innerhalb der pharmazeutischen Industrie. Aus wirtschaftlichen Gründen wird die verwendete Zelllinie sowie die eingesetzten Vektoren ständig verbessert um die Produktqualität und -quantität zu erhöhen. Während der Kultivierungsphase neigen Produktionszelllinien dazu an Produktivität zu verlieren. Dabei wird der Produktivitätsverlust häufig mit einer Reduktion der Kopienzahl oder dem Silencing von Transgenen assoziiert. Während der Zelllinienentwicklung werden vielversprechende Zelllinien ausgewählt und im großen Ansatz kultiviert. Ein Produktivitätsverlust innerhalb solcher Zellen ist somit sehr kostenintensiv. Um diese Gefahr zu minimieren entwickelte ich unterschiedliche Stategien, welche darauf abzielen den Produktivitätsverlust zu vermeiden. Produktionsinstabilität konnte von unserer Gruppe schon anhand des CpG Methylierungsgrades am CMV Promoter vorhergesagt werden. Die DNA Methylierung wird wahrscheinlich zur Aufrechterhaltung eines inaktiven Chromatinstatus benötigt und steht am Ende einer epigentischen Kaskade. Im Gegensatz dazu erscheinen Histonmodifikationen früher in der Signalkaskade und könnten deswegen eine höhere Aussagekraft über die Stabilität haben. Aus diesem Grunde wurden von mir Histonemodifkationen am hCMV-MIE Promoter und Enhancer zu Beginn der Kultivierungsphase gemessen. H3K4me3, H3ac sind Histonmodifikationen die mit Expression assoziiert werden wohingegen H3K27me3 und H3K9me3 grundsätzlich mit einem inaktiven Chromatinstatus in Verbindung gebracht werden. Der Grad der unterschiedlichen Modifikationen wurde mit dem über zwei Monate entstehenden Produktivitätsverlust verglichen. Dabei stellte sich heraus, dass der Grad der Histon H3 Acetylierung die höchste Korrelation mit der Stabilität aufwies. Des Weiteren konnte ich einen Grenzwert für die H3 Acetylierung definieren der einen Ausschluss der meisten instabilen Produktionszelllinien ermöglicht. Im zweiten Projekt wurde das Vector Design unter epigenetischen Aspekten verändert. Ich erstellte eine zielgerichtete Histonacetyltransferase, um in dem Chromatinbereich des Transgenes einen offenen und aktiven Status zu induzieren. Desweiteren mutierte ich methylierungsanfällige CpGs des hCMV-MIE Promoters und Enhancers um eine Methylierung und daraus folgend einen inaktiven Chromatinstatus zu verhindern. Die C zu G Konversion an dem 179 Basenpaar oberhalb der Transkriptionsstartstelle führte zu einer bemerkenswert stabilen Antikörperexpression in klonalen Zelllinien. Desweiteren konnte ich bei gleicher Promotervariante in eGFP exprimierenden Zellpools eine geringere Methylierung und Genamplifikation feststellen. Somit konnte zum ersten Mal die Effektsensitivität eines einzelnen CpGs verdeutlicht werden. Im letzten Projekt wurde die Expressionsstabilität abhängig von der Integrationsstelle des Transgenes untersucht. Dabei konnte ich zeigen, dass die standardmäßig durchgeführte zufällige Integration entweder bevorzugt in inaktiven Bereichen des Euchromatin stattfindet oder dass die Selektionsdruck induzierte Genamplifikation hauptsächlich im Heterochromatin stattfindet. Weiterhin vermute ich, dass beide Ereignisse hintereinander geschaltet sind, bei der die geringe Aktivität des Transgenes im inaktiven Euchromatin die Genamplifikation im Heterochromatin fördert. Bei der Untersuchung der Chromatinlandschaft und den enthaltenden Transgenen konnte ich vielversprechende aktive Regionen identifizieren, die wahrscheinlich die Stabilität der Expression fördern. Jedoch müssten diese Ergebnisse in weiteren Experimenten bestätigt werden. Bei der Betrachtung der drei Projekte zeigt sich, dass das Wechselspiel zwischen der Belastung des Stoffwechsels der Zelle und dem Selektionsdruck in der frühen Kultivierungsphase ausschlaggebend ist für deren weitere Entwicklung. Dabei können kleine Veränderungen des Selektionsdruckes die Zellen maßgebend beeinflussen. Stabil exprimierende Zellen sind dabei weniger angreifbar als schwach exprimierende Zellen. Bei einer Erhöhung des Selektionsdruckes kompensieren die schlechteren Produktionszelllinien ihren Nachteil durch Genamplifikation. Die Anpassung der Produktivität überdeckt die stabilen Zellen welches die richtige Auswahl erschwert. Aus diesem Grunde sollte der Selektiondruck, die Kopienzahl, sowie die Wachstumsrate in den Selektionskriterien mit einbezogen werden, um reprimierende Effekte zu minimieren

    PWWP2A

    Get PDF

    Functional Analysis Of Sin3 Isoforms In Drosophila

    Get PDF
    he multisubunit SIN3 complex is a global transcriptional regulator. In Drosophila, a single Sin3A gene encodes different isoforms of SIN3, of which SIN3 187 and SIN3 220 are the major isoforms. Previous studies have demonstrated functional non-redundancy of SIN3 isoforms. The role of SIN3 isoforms in regulating distinct biological processes, however, is not well characterized. In addition, how the components of the SIN3 complex modulate the gene regulatory activity of the complex is not well understood. In this study, I identified the biological processes regulated by the SIN3 isoforms. Additionally, I explored how Caf1-55 impacts the gene regulatory activity of the SIN3 220 complex. For the purpose of the study, I developed a highly reproducible ChIP protocol using micrococcal nuclease (MNase)-mediated chromatin preparation from Drosophila cultured cells. This protocol can be used to perform ChIP to map both histones and non-histone chromatin binding proteins locally and globally across the genome. Next, we identified the biological processes regulated by the SIN3 isoforms. We established a Drosophila S2 cell culture model system in which cells predominantly express either SIN3 187 or SIN3 220. To identify genomic targets of SIN3 isoforms, we performed chromatin immunoprecipitation followed by deep sequencing. Our data demonstrate that upon overexpression of SIN3 187, the level of SIN3 220 decreased and the large majority of genomic sites bound by SIN3 220 were instead bound by SIN3 187. We used RNA-seq to identify genes regulated by the expression of one isoform or the other. In S2 cells, which predominantly express SIN3 220, we found that SIN3 220 directly regulates genes involved in metabolism and cell proliferation. We also determined that SIN3 187 regulates a unique set of genes and likely modulates expression of many genes also regulated by SIN3 220. Interestingly, biological pathways enriched for genes specifically regulated by SIN3 187 strongly suggest that this isoform plays an important role during the transition from the embryonic to the larval stage of development. Finally, I investigated the function of Caf1-55 in the SIN3 220 complex. Our data demonstrate that Caf1-55 localizes to SIN3 220 gene targets and is partly required for recruiting SIN3 220 to chromatin. In addition, we show that the C-terminal domain of SIN3 220 physically interacts with Caf1-55. We found that the interaction between SIN3 and Caf1-55 is significantly reduced upon mutating the histone H4 binding pocket of Caf1-55. Surprisingly, the reduced interaction between the histone H4 binding mutant of Caf1-55 and SIN3 220 is not sufficient to cause a change in the expression of SIN3 220 regulated genes. Together, these data provide evidence of a novel role of Caf1-55 in impacting recruitment of a component of a chromatin modifying complex to genomic loci. In summary, our research reveals important insights of how the SIN3 isoform specific complexes might function during the course of fly development

    SumSec: accurate prediction of Sumoylation sites using predicted secondary structure

    Get PDF
    Post Translational Modification (PTM) is defined as the modification of amino acids along the protein sequences after the translation process. These modifications significantly impact on the functioning of proteins. Therefore, having a comprehensive understanding of the underlying mechanism of PTMs turns out to be critical in studying the biological roles of proteins. Among a wide range of PTMs, sumoylation is one of the most important modifications due to its known cellular functions which include transcriptional regulation, protein stability, and protein subcellular localization. Despite its importance, determining sumoylation sites via experimental methods is time-consuming and costly. This has led to a great demand for the development of fast computational methods able to accurately determine sumoylation sites in proteins. In this study, we present a new machine learning-based method for predicting sumoylation sites called SumSec. To do this, we employed the predicted secondary structure of amino acids to extract two types of structural features from neighboring amino acids along the protein sequence which has never been used for this task. As a result, our proposed method is able to enhance the sumoylation site prediction task, outperforming previously proposed methods in the literature. SumSec demonstrated high sensitivity (0.91), accuracy (0.94) and MCC (0.88). The prediction accuracy achieved in this study is 21% better than those reported in previous studies. The script and extracted features are publicly available at: https://github.com/YosvanyLopez/SumSec

    Defining the Role of Lysine Acetylation in Regulating the Fidelity of DNA Synthesis

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Accurate DNA replication is vital for maintaining genomic stability. Consequently, the machinery required to drive this process is designed to ensure the meticulous maintenance of information. However, random misincorporation of errors reduce the fidelity of the DNA and lead to pre-mature aging and age-related disorders such as cancer and neurodegenerative diseases. Some of the incorporated errors are the result of the error prone DNA polymerase alpha (Pol α), which initiates synthesis on both the leading and lagging strand. Lagging strand synthesis acquires an increased number of polymerase α tracks because of the number of Okazaki fragments synthesized per round of the cell cycle (~50 million in mammalian cells). The accumulation of these errors invariably reduces the fidelity of the genome. Previous work has shown that these pol α tracks can be removed by two redundant pathways referred to as the short and long flap pathway. The long flap pathway utilizes a complex network of proteins to remove more of the misincorporated nucleotides than the short flap pathway which mediates the removal of shorter flaps. Lysine acetylation has been reported to modulate the function of the nucleases implicated in flap processing. The cleavage activity of the long flap pathway nuclease, Dna2, is stimulated by lysine acetylation while conversely lysine acetylation of the short flap pathway nuclease, FEN1, inhibits its activity. The major protein players implicated during Okazaki fragment processing (OFP) are known, however, the choice of the processing pathway and its regulation by lysine acetylation of its main players is yet unknown. This dissertation identifies three main findings: 1) Saccharomyces cerevisiae helicase, petite integration frequency (Pif1) is lysine acetylated by Esa1 and deacetylated by Rpd3 regulating its viability and biochemical properties including helicase, binding and ATPase activity ii) the single stranded DNA binding protein, human replication protein A (RPA) is modified by p300 and this modification stimulates its primary binding function and iii) lysine acetylated human RPA directs OFP towards the long flap pathway even for a subset of short flaps

    Quantitative approaches to probe the acetylproteome

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Biological Engineering, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 173-175).Lysine acetylation is a prevalent post-translational modification whose multi-varied biological roles have recently emerged. While having all the necessary components of a signaling network, lysine acetylation studies have been limited to a small subset of proteins and pathways. Using a quantitative unbiased mass spectrometry approach, we explored the role of growth factor stimulation on lysine acetylation. Although the growth factors bind receptor tyrosine kinases, growth factor stimulation resulted in rapid and dynamic changes in lysine acetylation. Furthermore, we demonstrated that short-term HDAC inhibition alters phosphotyrosine-signaling networks. To better understand this behavior, a suite of biochemical and computational methods were developed. Bromodomains were engineered to explore binding preferences using degenerate peptide arrays as well as develop acetyllysine affinity reagents as an alternative to anti-acetyllysine antibodies. Additionally, bioorthogonal proteomics were employed to identify acetyltransferase substrates. Taken together, the knowledge generated and the methods developed provide a toolkit for the analysis of lysine acetylation networks in the context of many biological processes as well as diseases.by Bryan David Bryson.Ph.D
    corecore