111 research outputs found

    Multidimensional Feature Engineering for Post-Translational Modification Prediction Problems

    Get PDF
    Protein sequence data has been produced at an astounding speed. This creates an opportunity to characterize these proteins for the treatment of illness. A crucial characterization of proteins is their post translational modifications (PTM). There are 20 amino acids coded by DNA after coding (translation) nearly every protein is modified at an amino acid level. We focus on three specific PTMs. First is the bonding formed between two cysteine amino acids, thus introducing a loop to the straight chain of a protein. Second, we predict which cysteines can generally be modified (oxidized). Finally, we predict which lysine amino acids are modified by the active form of Vitamin B6 (PLP/pyridoxal-5-phosphate.) Our work aims to predict the PTM\u27s from protein sequencing data. When available, we integrate other data sources to improve prediction. Data mining finds patterns in data and uses these patterns to give a confidence score to unknown PTMs. There are many steps to data mining; however, our focus is on the feature engineering step i.e. the transforming of raw data into an intelligible form for a prediction algorithm. Our primary innovation is as follows: First, we created the Local Similarity Matrix (LSM), a description of the evolutionarily relatedness of a cysteine and its neighboring amino acids. This feature is taken two at a time and template matched to other cysteine pairs. If they are similar, then we give a high probability of it sharing the same bonding state. LSM is a three step algorithm, 1) a matrix of amino acid probabilities is created for each cysteine and its neighbors from an alignment. 2) We multiply the iv square of the BLOSUM62 matrix diagonal to each of the corresponding amino acids. 3) We z-score normalize the matrix by row. Next, we innovated the Residue Adjacency Matrix (RAM) for sequential and 3-D space (integration of protein coordinate data). This matrix describes cysteine\u27s neighbors but at much greater distances than most algorithms. It is particularly effective at finding conserved residues that are further away while still remaining a compact description. More data than necessary incurs the curse of dimensionality. RAM runs in O(n) time, making it very useful for large datasets. Finally, we produced the Windowed Alignment Scoring algorithm (WAS). This is a vector of protein window alignment bit scores. The alignments are one to all. Then we apply dimensionality reduction for gains in speed and performance. WAS uses the BLAST algorithm to align sequences within a window surrounding potential PTMs, in this case PLP attached to Lysine. In the case of WAS, we tried many alignment algorithms and used the approximation that BLAST provides to reduce computational time from months to days. The performances of different alignment algorithms did not vary significantly. The applications of this work are many. It has been shown that cysteine bonding configurations play a critical role in the folding of proteins. Solving the protein folding problem will help us to find the solution to Alzheimer\u27s disease that is due to a misfolding of the amyloid-beta protein. Cysteine oxidation has been shown to play a role in oxidative stress, a situation when free radicals become too abundant in the body. Oxidative stress leads to chronic illness such as diabetes, cancer, heart disease and Parkinson\u27s. Lysine in concert with PLP catalyzes the aminotransferase reaction. Research suggests that anti-cancer drugs will potentially selectively inhibit this reaction. Others have targeted this reaction for the treatment of epilepsy and addictions

    Novel concepts for identifying protein-protein interactions and unusual protein modifications

    Get PDF

    Defining the molecular, genetic and transcriptomic mechanisms underlying the variation in glycation gap between individuals

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.The discrepancy between HbA1c and fructosamine estimations in the assessment of glycaemia has frequently been observed and is referred to as the glycation gap (G-gap). This could be explained by the higher activity of the fructosamine-3-kinase (FN3K) deglycating enzyme in the negative G-gap group (patients with lower than predicted HbA1c for their mean glycaemia) as compared to the positive G-gap group. This G-gap is linked with differences in complications in patients with diabetes and this potentially happens because of dissimilarities in deglycation. The difference in deglycation rate in turn leads to altered production of advanced glycation end products (AGEs). These AGEs are both receptor dependent and receptor independent. It was hypothesised that variations in the level of the deglycating enzyme fructosamine-3-kinase (FN3K) might be as a result of known Single Nucleotide Polymorphisms (SNPs): rs1056534, rs3848403 and rs1046896 in FN3K gene, SNP in ferroportin1/SLC40A1 gene (rs11568350 linked with FN3K activity), differentially expressed genes (DEGs), differentially expressed transcripts or alternatively spliced transcript variants. Previous studies reported accelerated telomere length shortening in patients with diabetes. In this study, 184 patients with diabetes were included as dichotomised groups with either a strongly negative or positive G-gap. This study was conducted to analyse the differences in genotype frequency of specific SNPs via real time qPCR,determine soluble receptors for AGE (sRAGE) concentration via ELISA, finding association of sRAGE concentration with SNPs genotype, and evaluate relative average telomere length ratio via real time qPCR. This study also aimed at the investigation of underlying mechanisms of G-gap via transcriptome study for the identification of the DEGs and differentially expressed transcripts and to consequently identify pathways, biological processes and diseases linked to situations in which DEGs were enriched. The relative length of the telomere was normalised to the expression of a single copy gene (S). Chi-squared test was used for estimating the expected genotype frequencies in diabetic patients with negative and positive G-gap. Genotype frequencies of FN3K SNPs (rs1056534, rs3848403 and rs1046896) and SLC40A1/ferroportin1 SNP (rs11568350) polymorphisms within the studied groups were non-significant. With respect to genotypes, the rs1046896 genotype (CT) and rs11568350 genotype (AC) were only found in heterozygous state in all the investigated cohorts. No association between sRAGE concentration and FN3K SNPs (rs3848403 and rs1056534) was observed as the sRAGE concentration was also found not to be different between the groups. Similarly, the relative average telomere length was not different in both groups. Plasma sRAGE levels were not different in the cohort studied even though the Wolverhampton Diabetes Research Group (WDRG) previously reported that AGE is higher in positive G-gap. The latter is a more likely consequence of lower FN3K activities. In this study, it was found that SNPs in the FN3K/ferroportin1 gene are not responsible for the discrepancy in average glycaemia. The transcriptomic study via RNA-Seq mapped a total of 64451 gene transcripts to the human transcriptome. The DEGs and differentially expressed transcripts were 103 and 342 respectively (p 1.5). Of 103 DEGs, 61 were downregulated in G-gap positive and 42 were upregulated in positive G-gap individuals while 14 genes produced alternatively spliced transcript variants. Four pathways (Viral carcinogenesis, Ribosome, Phagosome and Dorso-ventral axis) were identified in the bioinformatics analysis of samples in which DEGs were enriched. These DEGs were also found to be associated with raised blood pressure and glycated haemoglobin (conditions that coexist with diabetes). Future analysis based on these results will be necessary to elucidate the significant drivers of gene expression leading to the G-gap in these patients

    Separation and characterization of subpopulations of biopharmaceuticals

    Get PDF
    The chemical changes that cause heterogeneity in protein therapeutics are well described and monitored during development and manufacturing. Many of these subpopulations were investigated regarding their impact on the therapeutic protein function. Adjustments in processes, formulation and protein engineering are applied to mitigate the number and quantity of protein variants. Yet, the root cause for the formation of protein subpopulations can not be controlled considering the manifold influences that lead to chemical changes. Therefore, it is important to know the impact of each variant on the stability of the protein drug. However, little is known about the impact of each subpopulation on protein stability, self-interaction, interaction with other variants or aggregation propensity. This thesis aims to separate, characterize and identify potential aggregation prone protein variants. If such variants are identified, further investigations are conducted to determine the influence of the variants on the stability of the therapeutic agent. With this work we aim to develop a tool kit of methods to gain knowledge about aggregation prone variants. A prerequisite for all methods is the maintenance of the intact protein variant in order to avoid any changes that might alter the protein behavior. Our tools include the separation and collection of individual subpopulations in their native state and analytical methods suitable for the limited quantities of sample. In the future, these applications and the knowledge gained from them can be used to remove or stabilize critical variants as a step towards protein therapeutics with increased stability and safety

    Deciphering transcriptional regulation in cancer cells and development of a new method to identify key transcriptional regulators and their target genes

    Get PDF
    Cancer cells accumulate genetic changes during carcinogenesis. The dimension of these changes range from point mutations to large chromosomal aberrations. It has been widely accepted that essential genetic programs are thereby dysregulated that normally would prevent uncontrolled cellular division and growth. Transcription factors (TFs) are key proteins of gene regulation and are frequently associated with genetic pathologies, e.g. MYCN in neuroblastomas (NBs). Research on gene regulation -in general or condition-specific- thus is a central aspect in cancer research, and it is also the focus of my work. In a carcinogenesis model of NBs without MYCN-amplification, mutations of chromosome 11q (11q-CNA) are suspected to critically influence tumor development. We were able to refine this model by means of gene expression analysis on 11q-CNA in NBs with different clinical outcome. Gene expression profiles of NBs with unfavorable progression differed significantly between tumors with and without 11q-CNA, whereas 11q-CNA in NBs with favorable outcome is apparently compensated by a yet unknown mechanism. The TF-encoding gene CAMTA1 is located on the chromosomal region 1p, which is frequently deleted in NBs. In vitro experiments with ectopic induction of CAMTA1 yielded CAMTA1-regulated genes with different gene expression profiles that were functionally associated by enrichment analyses with cell cycle regulation and neuronal differentiation. The suggested role of CAMTA1 as a tumor suppressor gene was confirmed by additional in vivo experiments. Furthermore, we studied the effect of MYC and MYCN in NBs without MYCN-amplification and found that these TF also strongly regulate a large number of common target genes according to their own gene expression in these tumors. Promoter analyses and chromatin immunoprecipitation additionally supported the regulation of the determined target genes by MYC/MYCN. The genome-wide application of promoter and enrichment analyses on gene expression data from mouse models enabled us to predict target TFs of Rage signaling. E2f1 and E2f4 were validated experimentally as components of the Rage-dependent gene regulatory network. Finally, we used our experience from gene expression analysis to develop a novel machine learning method to precisely predict TF target gene relationships in human. We combined results from a genome-wide correlation meta-analysis on 4064 microarray gene expression profiles and promoter analyses on TF binding sites with known regulatory interactions between TFs and target genes in our approach. Our method outperformed other comparable methods in human, as we improved shortcomings of other algorithms specifically for higher eukaryotes, in particular the frequently (erroneously) assumed correlation between the mRNA expression of TFs and their target genes. We made our method freely available as a software package with multiple applications like the identification of key TFs in a multiplicity of cellular systems (e.g. cancer cells)

    Dicarbonyl stress and dysfunction of the glyoxalase system in periodontal diseases

    Get PDF
    Periodontal ligament inflammation or periodontitis is a common disease characterised by gradual destruction of connective tissue fibres that attach a tooth to the alveolar bone within which it sits. Diabetes and inflammation enhances periodontal bone loss through enhanced resorption and diminished bone formation. Periodontal ligament fibroblast attachment to collagen-I and function was impaired by methylglyoxal (MG) modification in vitro. The glyoxalase system is an anti-glycation defence in all cells that metabolises MG and thereby suppresses MG-mediated protein damage. Overexpression of Glo1 decreased the intracellular levels of MG The aim of this investigation was to improve the understanding of protein damage in PDL in diabetes, focusing on protein damage by MG in human periodontal ligament fibroblasts (hPDLFs) in hyperglycaemia and to evaluate the effects of high and low glucose concentrations on MG metabolism in hPDLFs with or without Glo1 inducers. The effect of high glucose concentration on the formation and metabolism of MG was studied in hPDLFs in vitro. The ability of two small molecule Glo1 inducers, individually and in synergistic combination, to counter dicarbonyl stress in hPDLFs in vitro was studied. Interactions between hPDLFs to the extracellular matrix protein, collagen-I, were investigated and impairments in hPDLFs adhesion to MG-modified collagen-I coated plates were assessed. Protein susceptible to MG modification and inactivation in the cytosol of hPDLFs were identified by high resolution mass spectrometry proteomics. The effect of clinical periodontitis on plasma protein glycation, oxidation and nitration was also investigated in a pilot clinical investigation. When hPDLFs were incubated with high glucose concentration in vitro there was a 45% decrease in Glo1 activity and 42% increase in D-lactate flux – surrogate indication of MG flux of formation, which contributed to increased cellular concentration of MG and increase in MG-H1 residue content of cell protein, compared to low glucose control. This indicated dicarbonyl stress was induced in hPDLFs by high glucose concentration in vitro, a model for hyperglycaemia in vivo. Decrease of Glo1 activity and increase in cellular MG concentration and MG-H1 residue content of cell protein was corrected with the addition of Glo1 inducers. The binding of hPDLFs to collagen-I was decreased by 30% in high glucose concentration and was corrected by addition of Glo1 inducers. Proteomics analysis of cytosolic extracts of hPDLFs indicated that high glucose incubations produced changes in MG-modified proteins and also up-regulated and down-regulated unmodified proteins in hPDLFs. The pilot investigation of clinical periodontitis suggested a systemic effect of this local inflammation which was associated with changes in plasma protein glycation, oxidation and nitration. This study reveals that dicarbonyl stress is a potential contributory pathogenic mechanism in hPDLFs in periodontitis and countering it may provide new treatment options to prevent and treat decline in periodontal health, particularly in diabetes. Small molecule inducers of Glo1 expression may in future contribute to improving periodontal health, particularly in diabetes

    Strategies for Untargeted Biomarker Discovery in Biological Fluids

    Get PDF
    The health status of an organism modulates the dynamic and complex interplay of biochemical species that make-up the body and fluids of the organism. As such, these biological fluids are routinely used for diagnostic testing, yet they are often not used to their full potential. For instance, amniotic fluid (AF), the fluid that surrounds the fetus during gestation, is collected primarily for genetic testing from women with identified risk factors. The AF proteome and/or metabolome are seldom considered and represent a largely untapped wealth of relevant clinical information. Extensive, multi-analyte data can be collected from biological samples with modern analytical instrumentation. However, sophisticated data preprocessing and analysis (i.e. chemometrics) are required to reveal the relationships between the biochemical signals and the health status. This thesis seeks to demonstrate that untargeted biomarker discovery strategies can be efficiently applied to the task of finding novel biomarkers and complement the traditional hypothesis driven approaches. In the work underlying this thesis, a chemometric data analysis strategy was developed to search for biomarkers in capillary electrophoresis (CE) separations data. The absorbance data from amniotic fluid samples (n=107) collected at 15 weeks gestation, at 195 +/- 4 nm, was normalized, time aligned with Correlation Optimized Warping and reduced to a smaller number of variables by Haar transformation. The reduced data was then classified into normal or abnormal health classes by using a Bayes classifier algorithm. The chemometric data analysis was first employed to find biomarkers of gestational diabetes mellitus (GDM) and revealed that human serum albumin (HSA) could predict the early onset of disease. The same approach was successfully used to identify cases of large-for-gestational age (LGA) with the same AF CE-UV data. It was also employed for the classification of embryos with high and low reproductive potential using in vitro fertilization (IVF) culture media analyzed by CE-UV. Overall, a chemometric method was developed to perform untargeted biomarker discovery in biological samples and provide new means to detect GDM pregnancies, LGA neonates and viable embryos in IVF. The method was successful at identifying biomarkers of interest and showed high flexibility and transferability to other biological fluids
    corecore