162 research outputs found

    Multidimensional Feature Engineering for Post-Translational Modification Prediction Problems

    Get PDF
    Protein sequence data has been produced at an astounding speed. This creates an opportunity to characterize these proteins for the treatment of illness. A crucial characterization of proteins is their post translational modifications (PTM). There are 20 amino acids coded by DNA after coding (translation) nearly every protein is modified at an amino acid level. We focus on three specific PTMs. First is the bonding formed between two cysteine amino acids, thus introducing a loop to the straight chain of a protein. Second, we predict which cysteines can generally be modified (oxidized). Finally, we predict which lysine amino acids are modified by the active form of Vitamin B6 (PLP/pyridoxal-5-phosphate.) Our work aims to predict the PTM\u27s from protein sequencing data. When available, we integrate other data sources to improve prediction. Data mining finds patterns in data and uses these patterns to give a confidence score to unknown PTMs. There are many steps to data mining; however, our focus is on the feature engineering step i.e. the transforming of raw data into an intelligible form for a prediction algorithm. Our primary innovation is as follows: First, we created the Local Similarity Matrix (LSM), a description of the evolutionarily relatedness of a cysteine and its neighboring amino acids. This feature is taken two at a time and template matched to other cysteine pairs. If they are similar, then we give a high probability of it sharing the same bonding state. LSM is a three step algorithm, 1) a matrix of amino acid probabilities is created for each cysteine and its neighbors from an alignment. 2) We multiply the iv square of the BLOSUM62 matrix diagonal to each of the corresponding amino acids. 3) We z-score normalize the matrix by row. Next, we innovated the Residue Adjacency Matrix (RAM) for sequential and 3-D space (integration of protein coordinate data). This matrix describes cysteine\u27s neighbors but at much greater distances than most algorithms. It is particularly effective at finding conserved residues that are further away while still remaining a compact description. More data than necessary incurs the curse of dimensionality. RAM runs in O(n) time, making it very useful for large datasets. Finally, we produced the Windowed Alignment Scoring algorithm (WAS). This is a vector of protein window alignment bit scores. The alignments are one to all. Then we apply dimensionality reduction for gains in speed and performance. WAS uses the BLAST algorithm to align sequences within a window surrounding potential PTMs, in this case PLP attached to Lysine. In the case of WAS, we tried many alignment algorithms and used the approximation that BLAST provides to reduce computational time from months to days. The performances of different alignment algorithms did not vary significantly. The applications of this work are many. It has been shown that cysteine bonding configurations play a critical role in the folding of proteins. Solving the protein folding problem will help us to find the solution to Alzheimer\u27s disease that is due to a misfolding of the amyloid-beta protein. Cysteine oxidation has been shown to play a role in oxidative stress, a situation when free radicals become too abundant in the body. Oxidative stress leads to chronic illness such as diabetes, cancer, heart disease and Parkinson\u27s. Lysine in concert with PLP catalyzes the aminotransferase reaction. Research suggests that anti-cancer drugs will potentially selectively inhibit this reaction. Others have targeted this reaction for the treatment of epilepsy and addictions

    Skin Sensitisation (Q)SARs/Expert Systems: from Past, Present to Future

    Get PDF
    This review describes the state of the art of available (Q)SARs/expert systems for skin sensitisation and evaluates their utility for potential regulatory use. There is a strong mechanistic understanding with respect to skin sensitisation which has facilitated the development of different models. Most existing models fall into one of two main categories either they are local in nature, usually specific to a chemical class or reaction chemical mechanism or else they are global in form, derived empirically using statistical methods. Some of the published global QSARs available have been recently characterised and evaluated elsewhere in accordance with the OECD principles. An overview of expert systems capable of predicting skin sensitisation is also provided. Recently, a new perspective regarding the development of mechanistic skin sensitisation QSARs so-called Quantitative Mechanistic Modelling (QMM) has been proposed, where reactivity and hydrophobicity, are used as the key parameters in mathematically modelling skin sensitisation. Whilst hydrophobicity can be conveniently modelled using log P, the octanol-water partition coefficient; reactivity is less readily determined from chemical structure. Initiatives are in progress to generate reactivity data for reactions relevant to skin sensitisation but more resources are required to realise a comprehensive set of reactivity data. This is a fundamental and necessary requirement for the future assessment of skin sensitisation.JRC.I.3-Toxicology and chemical substance

    Investigation into biological and biomimetic transmembrane systems

    Get PDF
    Membranes are essential components of living organisms, which serve as effective barriers that separate distinct chemical environments on either side of the membrane. Chemists have designed biological and synthetic systems to functionalise membrane-embedded systems for a variety of applications such as sensing, sequencing, reaction mechanistic studies, and therapeutics. The continuous interest in functionalising membranes, combined with incomplete understanding of the underlying factors determining their mechanisms inspired the investigations undertaken in this work. This Thesis employs both experimental and computational methods to explore two distinct applications for sequencing and therapeutics, respectively. (1) Engineered biological nanopores have found great success in DNA sequencing. The Bayley group previously reported a molecular hopper, which makes sub-nanometer steps by thiol-disulfide interchange along a track with cysteine footholds within a protein nanopore. In Chapter 2, the hopping rate was optimized with a view towards rapid enzymeless biopolymer characterization during translocation within nanopores. I first used a nanopore approach to systematically profile the reactivity of individual cysteine footholds along an engineered protein track at the single-molecule level. Using this approach, I calculated the pKa of cysteine thiols and the pH-independent rate constants for the reaction between thiolates and a disulfide molecule. This reactivity profile guided site-specific mutagenesis. Together with the optimization of experimental conditions, the overall stepping rate of a DNA cargo along a five-cysteine track was accelerated. This work extends the practical application of this enzymeless system as a sequencing method for biopolymers beyond DNA. (2) Synthetic anion transporters have attracted significant attention as promising therapeutics for ion channel diseases. In Chapter 3, I use computational modelling to investigate the chloride binding and transmembrane transport mechanisms of E-/Z-switchable synthetic transporters. Using a model system,I developed a workflow to construct full energy profiles for the transmembrane transport process. These results revealed the importance of pre-organization of the Z-isomer and the balance between the energy barrier of transport and the solubility of the transporter. Additionally, in Chapter 4, I present a predictive machine-learning (ML) approach for estimating the chloride transport activity of a variety of synthetic chloride transporters. The ML models, employing both classification and regression frameworks, exhibited remarkable performance across a diverse range of systems. Moreover, they offered insights crucial for future design efforts, e.g., identifying key structural features and experimental conditions that influence the observed transport activity. Overall, this work bridges biological and molecular design, computational modelling and data-driven approaches to advance the development of two applications to functionalise membranes for sequencing and therapeutics. It provides interpretable molecular models as well as structure-activity relationships that will aid hypothesis generation and contribute to synthetic advances in both fields

    Review of Data Sources, QSARs and Integrated Testing Strategies for Skin Sensitisation

    Get PDF
    This review collects information on sources of skin sensitisation data and computational tools for the estimation of skin sensitisation potential, such as expert systems and (quantitative) structure-activity relationship (QSAR) models. The review also captures current thinking of what constitutes an integrated testing strategy (ITS) for this endpoint. The emphasis of the review is on the usefulness of the models for the regulatory assessment of chemicals, particularly for the purposes of the new European legislation for the Registration, Evaluation, Authorisation and Restriction of CHemicals (REACH), which entered into force on 1 June 2007. Since there are no specific databases for skin sensitisation currently available, a description of experimental data found in various literature sources is provided. General (global) models, models for specific chemical classes and mechanisms of action and expert systems are summarised. This review was prepared as a contribution to the EU funded Integrated Project, OSIRIS.JRC.I.3-Consumer products safety and qualit

    Data Enrichment for Data Mining Applied to Bioinformatics and Cheminformatics Domains

    Get PDF
    Problemas cada vez mais complexos estão a ser tratados na àrea das ciências da vida. A aquisição de todos os dados que possam estar relacionados com o problema em questão é primordial. Igualmente importante é saber como os dados estão relacionados uns com os outros e com o próprio problema. Por outro lado, existem grandes quantidades de dados e informações disponíveis na Web. Os investigadores já estão a utilizar Data Mining e Machine Learning como ferramentas valiosas nas suas investigações, embora o procedimento habitual seja procurar a informação baseada nos modelos indutivos. Até agora, apesar dos grandes sucessos já alcançados com a utilização de Data Mining e Machine Learning, não é fácil integrar esta vasta quantidade de informação disponível no processo indutivo, com algoritmos proposicionais. A nossa principal motivação é abordar o problema da integração de informação de domínio no processo indutivo de técnicas proposicionais de Data Mining e Machine Learning, enriquecendo os dados de treino a serem utilizados em sistemas de programação de lógica indutiva. Os algoritmos proposicionais de Machine Learning são muito dependentes dos atributos dos dados. Ainda é difícil identificar quais os atributos mais adequados para uma determinada tarefa na investigação. É também difícil extrair informação relevante da enorme quantidade de dados disponíveis. Vamos concentrar os dados disponíveis, derivar características que os algoritmos de ILP podem utilizar para induzir descrições, resolvendo os problemas. Estamos a criar uma plataforma web para obter informação relevante para problemas de Bioinformática (particularmente Genómica) e Quimioinformática. Esta vai buscar os dados a repositórios públicos de dados genómicos, proteicos e químicos. Após o enriquecimento dos dados, sistemas Prolog utilizam programação lógica indutiva para induzir regras e resolver casos específicos de Bioinformática e Cheminformática. Para avaliar o impacto do enriquecimento dos dados com ILP, comparamos com os resultados obtidos na resolução dos mesmos casos utilizando algoritmos proposicionais.Increasingly more complex problems are being addressed in life sciences. Acquiring all the data that may be related to the problem in question is paramount. Equally important is to know how the data is related to each other and to the problem itself. On the other hand, there are large amounts of data and information available on the Web. Researchers are already using Data Mining and Machine Learning as a valuable tool in their researches, albeit the usual procedure is to look for the information based on induction models. So far, despite the great successes already achieved using Data Mining and Machine Learning, it is not easy to integrate this vast amount of available information in the inductive process with propositional algorithms. Our main motivation is to address the problem of integrating domain information into the inductive process of propositional Data Mining and Machine Learning techniques by enriching the training data to be used in inductive logic programming systems. The algorithms of propositional machine learning are very dependent on data attributes. It still is hard to identify which attributes are more suitable for a particular task in the research. It is also hard to extract relevant information from the enormous quantity of data available. We will concentrate the available data, derive features that ILP algorithms can use to induce descriptions, solving the problems. We are creating a web platform to obtain relevant bioinformatics (particularly Genomics) and Cheminformatics problems. It fetches the data from public repositories with genomics, protein and chemical data. After the data enrichment, Prolog systems use inductive logic programming to induce rules and solve specific Bioinformatics and Cheminformatics case studies. To assess the impact of the data enrichment with ILP, we compare with the results obtained solving the same cases using propositional algorithms

    A robust machine learning approach for the prediction of allosteric binding sites

    Get PDF
    Previously held under moratorium from 28 March 2017 until 28 March 2022Allosteric regulatory sites are highly prized targets in drug discovery. They remain difficult to detect by conventional methods, with the vast majority of known examples being found serendipitously. Herein, a rigorous, wholly-computational protocol is presented for the prediction of allosteric sites. Previous attempts to predict the location of allosteric sites by computational means drew on only a small amount of data. Moreover, no attempt was made to modify the initial crystal structure beyond the in silico deletion of the allosteric ligand. This behaviour can leave behind a conformation with a significant structural deformation, often betraying the location of the allosteric binding site. Despite this artificial advantage, modest success rates are observed at best. This work addresses both of these issues. A set of 60 protein crystal structures with known allosteric modulators was collected. To remove the imprint on protein structure caused by the presence of bound modulators, molecular dynamics was performed on each protein prior to analysis. A wide variety of analytical techniques were then employed to extract meaningful data from the trajectories. Upon fusing them into a single, coherent dataset, random forest - a machine learning algorithm - was applied to train a high performance classification model. After successive rounds of optimisation, the final model presented in this work correctly identified the allosteric site for 72% of the proteins tested. This is not only an improvement over alternative strategies in the literature; crucially, this method is unique among site prediction tools in that is does not abuse crystal structures containing imprints of bound ligands - of key importance when making live predictions, where no allosteric regulatory sites are known.Allosteric regulatory sites are highly prized targets in drug discovery. They remain difficult to detect by conventional methods, with the vast majority of known examples being found serendipitously. Herein, a rigorous, wholly-computational protocol is presented for the prediction of allosteric sites. Previous attempts to predict the location of allosteric sites by computational means drew on only a small amount of data. Moreover, no attempt was made to modify the initial crystal structure beyond the in silico deletion of the allosteric ligand. This behaviour can leave behind a conformation with a significant structural deformation, often betraying the location of the allosteric binding site. Despite this artificial advantage, modest success rates are observed at best. This work addresses both of these issues. A set of 60 protein crystal structures with known allosteric modulators was collected. To remove the imprint on protein structure caused by the presence of bound modulators, molecular dynamics was performed on each protein prior to analysis. A wide variety of analytical techniques were then employed to extract meaningful data from the trajectories. Upon fusing them into a single, coherent dataset, random forest - a machine learning algorithm - was applied to train a high performance classification model. After successive rounds of optimisation, the final model presented in this work correctly identified the allosteric site for 72% of the proteins tested. This is not only an improvement over alternative strategies in the literature; crucially, this method is unique among site prediction tools in that is does not abuse crystal structures containing imprints of bound ligands - of key importance when making live predictions, where no allosteric regulatory sites are known

    Ligand-Protein Binding Affinity Prediction Using Machine Learning Scoring Functions.

    Get PDF
    In recent years, artificial intelligence makes its appearance in extremely different fields with promising results able to produce enormous steps forward in some circumstances. In chemoinformatics the use of machine learning technique, in particular, allows the scientific community to build apparently accurate scoring functions for computational docking. These types of scoring functions can overperform classic ones, the type of scoring functions used until now. However the comparison between classic and machine learning scoring functions are based on particular tests which can favour these latter, as highlighted by some studies. In particular the machine learning scoring functions, per definition, must be trained on some data, passing to the model the instances chosen to describe the complexes and the relative ligand-protein affinity. In these conditions the scoring power of the machine learning scoring functions can be evaluated on different dataset and the scoring functions performance recorded can be different depending on it. In particular, datasets very similar to the one used for the training phase of the machine learning scoring function can facilitate in reaching high performance in the scoring power. The objective of the present study is to verify the real efficiency and the effective performances of the new born machine learning scoring functions. Our aim is to give an answer to the scientific community about the doubts on the fact that the machine learning scoring function can be or not the revolutionary road to be followed in the field of chemioinformatic and drug discovery. In order to do this many tests are conducted and a definitive test protocol to be executed to exhaustive validate a new machine learning scoring function is proposed . Here we investigate what are the circumstances in which a machine learning scoring function produces overestimated performances and why it can happen. As a possible solution we propose a tests protocol to be followed in order to guarantee a real performance descriptions of machine learning scoring functions. Eventually an effective and innovative solution in the field of machine learning scoring functions is proposed. It consists in the use of per-target scoring functions which are machine learning scoring functions created using complexes coming from a single protein and able to predict the affinity of complexes which use that target. The data used to build the model are synthetic and for this reason are easy to be created. The performances on the target chosen are better than the ones obtained with basic model of scoring functions and machine learning scoring functions trained on database composed by more than one protein

    IN SILICO APPROACHES IN DRUG DESIGN AND DEVELOPMENT: APPLICATIONS TO RATIONAL LIGAND DESIGN AND METABOLISM PREDICTION

    Get PDF
    In the last decades, the applications of computational methods in medicinal chemistry have experienced significant changes which have incredibly expanded their approaches, and more importantly their objectives. The overall aim of the present research project is to explore the different fields of the modelling studies by using well-known computational methods as well as different and innovative techniques. Indeed, computational methods traditionally consisted in ligand-based and the structure-based approaches substantially aimed at optimizing the ligand structure in terms of affinity, potency and selectivity. The studies concerning the muscarinic receptors in the present thesis applied these approaches for the rational design of novel improved bioactive molecules, interacting both in the orthosteric (e.g., 1,4-dioxane agonist) and in the allosteric sites. The research includes also the application of a novel method for target optimization, which consists in the generation of the so-called conformational chimeras to explore the flexibility of the modelled GPCR structures. In parallel, computational methods are finding successful applications in the research phase which precedes the ligand design and which is focused on a detailed validation and characterization of the biological target. A proper example of this kind of studies is given by the study regarding the purinergic receptors, which is aimed at the identification and characterization of potential allosteric binding pockets for the already reported inhibitors, exploiting also innovative approaches for binding site predictions (e.g., PELE, SPILLO-PBSS). Over time, computational applications felt a rich extension of their objectives and one of the clearest examples is represented by the ever increasing attempts to optimize the ADME/Tox profile of the novel compounds, so reducing the marked attrition in drug discovery caused by unsuitable pharmacokinetic profiles. Coherently, the first and main project of the present thesis regards the field of metabolism prediction and is founded on the meta-analysis and the corresponding database called MetaSar, manually collected from the recent specialized literature. This ongoing extended project includes different studies which are overall aimed at developing a comprehensive method for metabolism prediction. In detail, this Thesis reports an interesting application of the database which exploits an innovative predictive technique, the Proteochemometric modelling (PCM). This approach is indeed at the forefront of the latest modelling techniques, as it perfectly fits the growing request of new solutions to deal with the incredibly huge amount of data recently produced by the \u201comics\u201d disciplines. In this context, MetaSar represents an alternative and still appropriate source of data for PCM studies, which also enables the extension of its fields of application to a new avenue, such as the prediction of metabolism biotransformation. In the present thesis, we present the first example of these applications, which involves the building of a classification model for the prediction of the glucuronidation reaction. The field of glucuronidation reactions is exhaustively explored also through an homology modelling study aimed at defining the complete three-dimensional structure of the enzyme UGT2B7, the main isoform of glucuronidation enzymes in humans, in complex with the cofactor UDPGA and a typical substrate, such as Naproxen. The paths of the substrate entering to the binding site and the egress of the product have been investigated by performing Steered Molecular Dynamics (SMD) simulations, which were also useful to gain deeper insights regarding the full mechanism of action and the movements of the cofactor
    corecore