589 research outputs found

    A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli

    Get PDF
    Motivation: Inclusion body formation has been a major deterrent for overexpression studies since a large number of proteins form insoluble inclusion bodies when overexpressed in Escherichia coli. The formation of inclusion bodies is known to be an outcome of improper protein folding; thus the composition and arrangement of amino acids in the proteins would be a major influencing factor in deciding its aggregation propensity. There is a significant need for a prediction algorithm that would enable the rational identification of both mutants and also the ideal protein candidates for mutations that would confer higher solubility-on-overexpression instead of the presently used trial-and-error procedures. Results: Six physicochemical properties together with residue and dipeptide-compositions have been used to develop a support vector machine-based classifier to predict the overexpression status in E.coli. The prediction accuracy is ~72% suggesting that it performs reasonably well in predicting the propensity of a protein to be soluble or to form inclusion bodies. The algorithm could also correctly predict the change in solubility for most of the point mutations reported in literature. This algorithm can be a useful tool in screening protein libraries to identify soluble variants of proteins

    A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli

    Get PDF
    ABSTRACT Motivation: Inclusion body formation has been a major deterrent for overexpression studies since a large number of proteins form insoluble inclusion bodies when overexpressed in Escherichia coli. The formation of inclusion bodies is known to be an outcome of improper protein folding; thus the composition and arrangement of amino acids in the proteins would be a major influencing factor in deciding its aggregation propensity. There is a significant need for a prediction algorithm that would enable the rational identification of both mutants and also the ideal protein candidates for mutations that would confer higher solubility-on-overexpression instead of the presently used trial-anderror procedures. Results: Six physicochemical properties together with residue and dipeptide-compositions have been used to develop a support vector machine-based classifier to predict the overexpression status in E.coli. The prediction accuracy is~72% suggesting that it performs reasonably well in predicting the propensity of a protein to be soluble or to form inclusion bodies. The algorithm could also correctly predict the change in solubility for most of the point mutations reported in literature. This algorithm can be a useful tool in screening protein libraries to identify soluble variants of proteins

    Learning to predict expression efficacy of vectors in recombinant protein production

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in <it>Escherichia coli </it>(<it>E. coli</it>). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression.</p> <p>Results</p> <p>In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in <it>E. coli</it>. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production.</p> <p>Conclusion</p> <p>In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.</p

    Recombinant expression of insoluble enzymes in Escherichia coli: a systematic review of experimental design and its manufacturing implications.

    Get PDF
    Recombinant enzyme expression in Escherichia coli is one of the most popular methods to produce bulk concentrations of protein product. However, this method is often limited by the inadvertent formation of inclusion bodies. Our analysis systematically reviews literature from 2010 to 2021 and details the methods and strategies researchers have utilized for expression of difficult to express (DtE), industrially relevant recombinant enzymes in E. coli expression strains. Our review identifies an absence of a coherent strategy with disparate practices being used to promote solubility. We discuss the potential to approach recombinant expression systematically, with the aid of modern bioinformatics, modelling, and 'omics' based systems-level analysis techniques to provide a structured, holistic approach. Our analysis also identifies potential gaps in the methods used to report metadata in publications and the impact on the reproducibility and growth of the research in this field.Non

    A novel design of multi-epitope based vaccine against Escherichia coli

    Get PDF
    Background:&nbsp;Multi-valent based vaccines have advantage over conventional vaccines because of its multi-faceted action targeted at antigen; thereby raising hope of a more sustained actions against allergens. Escherichia coli&nbsp;(E. coli)&nbsp;is a bacterium that is commonly found in the gut of humans and warm-blooded animals. An increasing number of outbreaks are associated with the consumption of fruits and vegetables (including sprouts, spinach, lettuce, coleslaw, and salad) thereby contamination may be due to contact with faeces from domestic or wild animals at some stages during cultivation or handling. Due to the reported increase in resistance to antibiotics used for&nbsp;Escherichia coli&nbsp;control; an effective vaccine is a would-be alternative of proven interest. Hence, a need for a rational, strategic, and efficient vaccine candidate against&nbsp;E.coli&nbsp;is of paramount necessity by the use of the most current bioinformatics tools to achieve this task.&nbsp;Method:&nbsp;In this study, immunoinformatics tools mined from diverse molecular databases were used &nbsp;for a novel putative epitope based oral vaccine against&nbsp;E.coli. The prospective vaccine proteins were carefully screened and validated to achieve a high thorough-put three-dimensional protein structure. The eventual propsective vaccine candidate proteins was evaluated for its non-allergenicity, antigenicity, solubility, appropriate molecular weight testing and isoelectric point evaluation.&nbsp;Conclusion:&nbsp;The resultant vaccine candidate could serve as a promising&nbsp;anti-E.coli&nbsp;vaccine candidate. Immunoinformatics is a new field over pharmaco-therapeutics; this newest technology should continue to be a rescue from age-long traditional approach in vaccine developments

    Inclusion Body Formation by Mutants of the Tenth Human Fibronectin Type III Domain

    Get PDF
    Inclusion bodies (IBs) are intracellular, insoluble protein aggregates, commonly observed when a protein of interest is expressed at high concentrations in a bacterial cell-based expression system. The molecular determinants of IB formation are poorly understood, and are of both fundamental and biotechnological significance. The stability, folding, and structure of the tenth human fibronectin type III domain (10Fn3) have been studied previously, making it an attractive model system to investigate IB formation. A library of 10Fn3 mutants was provided by Bristol-Myers Squibb; 31 of these mutants were expressed in Escherichia coli and analyzed. The percentage of the expressed protein found within IBs was quantified at different expression time points using densitometric analysis of soluble and inclusion body (insoluble) cell lysate fractions separated by centrifugation and subjected to polyacrylamide gel electrophoresis. Although most of these mutants differ from each other in only 3 amino acid positions, all found within a single flexible loop of the protein, the extent of IB formation varies greatly. This data set was used to test the performance of a variety of amino acid sequence-based protein aggregation prediction methods. Several of these methods produced predictions that correlate moderately well with the IB formation data (R2 > 0.6), suggesting that while the intrinsic aggregation propensity of sequence segments strongly influences IB formation, other factors are also relevant. We hypothesized that improved predictions might be made possible by the consideration of additional structural context, i.e. aggregation-prone sequence segment exposure. Thermodynamic stabilities determined using differential scanning calorimetry correlate poorly with IB formation; all of the mutants are sufficiently stable that no significant fraction of protein is likely to be denatured at equilibrium. To describe the variable structure of the flexible loop in which the mutant sequences differ, ensembles of homology models were constructed. IB formation was found to correlate with the ensemble average energy scores of the homology models. The ensemble average scores may capture subtle shifts in the energetic bias toward native structure that restricts the exposure of aggregation-prone sequence segments. A linear combination of sequence-based aggregation predictions and ensemble average homology model scores correlates much better with IB formation (R2 > 0.8) than either parameter does individually.1 yea

    Prediction of amyloid fibril-forming segments based on a support vector machine

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Amyloid fibrillar aggregates of proteins or polypeptides are known to be associated with many human diseases. Recent studies suggest that short protein regions trigger this aggregation. Thus, identifying these short peptides is critical for understanding diseases and finding potential therapeutic targets.</p> <p>Results</p> <p>We propose a method, named Pafig (Prediction of amyloid fibril-forming segments) based on support vector machines, to identify the hexpeptides associated with amyloid fibrillar aggregates. The features of Pafig were obtained by a two-round selection from AAindex. Using a 10-fold cross validation test on Hexpepset dataset, Pafig performed well with regards to overall accuracy of 81% and Matthews correlation coefficient of 0.63. Pafig was used to predict the potential fibril-forming hexpeptides in all of the 64,000,000 hexpeptides. As a result, approximately 5.08% of hexpeptides showed a high aggregation propensity. In the predicted fibril-forming hexpeptides, the amino acids – alanine, phenylalanine, isoleucine, leucine and valine occurred at the higher frequencies and the amino acids – aspartic acid, glutamic acid, histidine, lysine, arginine and praline, appeared with lower frequencies.</p> <p>Conclusion</p> <p>The performance of Pafig indicates that it is a powerful tool for identifying the hexpeptides associated with fibrillar aggregates and will be useful for large-scale analysis of proteomic data.</p

    Correlation Between Protein Primary Structure and Soluble Expression Level of HSA dAb in Escherichia coli

    Get PDF
    Izoelektrična točka, duljina molekule, molekularna masa i slijed aminokiselina bitno utječu na topljivost proteina. U ovom smo se radu fokusirali na sastav aminokiselina i ispitali one koje najviše utječu na razinu ekspresije topljivog protutijela albumina iz ljudskog seruma (HSA dAb). Grupiranjem i primjenom linearnog modela analizirana je topljivost 65 varijanti proteina. Bitan utjecaj na ekspresiju topljivog protutijela dAb imale su specifične kombinacije aminokiselina, i to (S, R, N, D, Q) u supernatantu, (G, R, C, N, S) u lizatu peleta i (R, S, G) u ukupnom topljivom protutijelu dAb. Od 20 aminokiselina, arginin je imao negativan, a glicin i serin su imale pozitivan učinak na razinu ekspresije topljivog proteina. Preciznost linearnog modela predviđanja topljivosti proteina bila je 80 %. Zaključeno je da se povećanjem udjela polarnih aminokiselina, osobito glicina i serina, te smanjenjem udjela arginina bitno povećala ekspresija topljivog proteina HSA dAb.It is widely accepted that features such as pI, length, molecular mass and amino acid (AA) sequence have a significant influence on protein solubility. Here, we mainly focused on AA composition and explored those that most affected the soluble expression level of human serum albumin (HSA) domain antibody (dAb). The soluble expression and sequence of 65 dAb variants were analysed using clustering and linear modelling. Certain AAs significantly affected the soluble expression level of dAb, with the specific AA combinations being (S, R, N, D, Q), (G, R, C, N, S) and (R, S, G); these combinations respectively affected the dAb expression level in the broth supernatant, the level in the pellet lysate and total soluble dAb. Among the 20 AAs, R displayed a negative influence on the soluble expression level, whereas G and S showed positive effects. A linear model was built to predict the soluble expression level from the sequence; this model had a prediction accuracy of 80 %. In summary, increasing the content of polar AAs, especially G and S, and decreasing the content of R, was helpful to improve the soluble expression level of HSA dAb
    corecore