47 research outputs found

    Dynamic protein classification: Adaptive models based on incremental learning strategies

    Get PDF
    Abstract One of the major problems in computational biology is the inability of existing classification models to incorporate expanding and new domain knowledge. This problem of static classification models is addressed in this thesis by the introduction of incremental learning for problems in bioinformatics. The tools which have been developed are applied to the problem of classifying proteins into a number of primary and putative families. The importance of this type of classification is of particular relevance due to its role in drug discovery programs and the benefit it lends to this process in terms of cost and time saving. As a secondary problem, multi–class classification is also addressed. The standard approach to protein family classification is based on the creation of committees of binary classifiers. This one-vs-all approach is not ideal, and the classification systems presented here consists of classifiers that are able to do all-vs-all classification. Two incremental learning techniques are presented. The first is a novel algorithm based on the fuzzy ARTMAP classifier and an evolutionary strategy. The second technique applies the incremental learning algorithm Learn++. The two systems are tested using three datasets: data from the Structural Classification of Proteins (SCOP) database, G-Protein Coupled Receptors (GPCR) database and Enzymes from the Protein Data Bank. The results show that both techniques are comparable with each other, giving classification abilities which are comparable to that of the single batch trained classifiers, with the added ability of incremental learning. Both the techniques are shown to be useful to the problem of protein family classification, but these techniques are applicable to problems outside this area, with applications in proteomics including the predictions of functions, secondary and tertiary structures, and applications in genomics such as promoter and splice site predictions and classification of gene microarrays

    Towards Solving the Dopamine G Protein Coupled Receptor Modelling Problem

    Get PDF
    The overall aim of this work has been to furnish a model of the dopamine (DA) receptor D2. There are currently two sub-groups within the DA family of G protein coupled receptors (GPCRs): D1 sub-group (includes D1 and D5) and the D2 sub-group (includes D2, D3 and D4). Organon (UK) Ltd. supplied a disk containing the PDB atomic co-ordinates of the integral membrane protein bacteriorhodopsin (bRh; Henderson et al., 1975 and 1990) to use as a template to model D2 - the aim being to generate a model of D2 by simply mutating the side-residues of bRh. The assumption being that bRh had homology with members of the supergene class of GPCRs. However, using the GCG Wisconsin GAP algorithm (Devereux et al., 1984) no significant homology was detected between the primary structures of any member of the DA family of GPCRs and bRh. However, given the original brief to carry out homology modelling using bRh as a template (see appendix 1) I felt obliged to carry out further alignments using a shuffling technique and a standard statistical test to check for significant structural homology. The results clearly showed that there is no significant structural homology, on the basis of sequence similarity, between bRh and any member of the DA family of GPCRs. Indeed, the statistical analysis clearly demonstrated that while there is significant structural homology between every catecholamine binding GPCR, there is no structural homology what so ever between any catecholamine binding GPCR and bRh. Hydropathy analysis is frequently used to identify the location of putative transmembrane segments. However, is difficult to predict the end positions of each ptms. To this end a novel alignment algorithm (DH Scan) was coded to exploit transparallel supercomputer technology to provide a basis for identifying likely helix end points and to pinpoint areas of local homology between GPCRs. DH Scan clearly demonstrated characteristic transmembrane homology between different subtype DA GPCRs. Two further homology algorithms were coded (IH Scan and RH Scan) which provided evidence of internal homology. In particular IH Scan independently revealed a repeat region in the 3rd intracellular loop (iIII) of D4 and RH Scan revealed palindromic like short stretches of amino acids which were found to be particularly well represented in predicted ?-helices in each DA receptor subtype. In addition, the profile network prediction algorithm (PHD; Rost et al., 1994) predicted a short alpha-helix at greater than 80% probablility at each end of the third intracellular loop and between the carboxy terminal end of transmembrane VII and a conserved Cys residue in the forth intracellular loop. Fourier analysis of catecholamine binding GPCR primary structures in the form of a multiple-sequence file suggested that the consensus view that only those residues facing the protein interior are conserved is not entirely correct. In particular, transmembrane helices II and III do not exhibit residue conservancy characteristic of an amphipathic helix. It is proposed that these two helices undergo a form of helix interface shear to assist agonist binding to a Asp residue on helix II. This data in combination with information from a number of papers concerning helix shear interface mechanism and molecular dynamic studies of proline containing ?-helices suggested a physically plausible binding mechanism for agonists. While it was evident that homology modelling could not be scientifically justified, the combinatorial approach to protein modelling might be successfully applied to the transmembrane region of the D2 receptor. The probable arrangement of helices in the transmembrane region of GPCRs (Baldwin, 1993) which was based on a careful analysis of a low resolution projection map of rhodopsin (Gebhard et ah, 1993) was used as a guide to model the transmembrane region of D2. The backbone torsion angles of a helix with a middle Pro residue (Sankararamakrishnan et al., 1991) was used to model transmembrane helix V. Dopamine was successfully docked to the putative binding pocket of D2. Using this model as a template, models of D3 and D4 were produced. A separate model of Di was then produced and this in turn was used as a template to model D5

    Predicting Transporter Proteins and Their Substrate Specificity

    Get PDF
    The publication of numerous genome projects has resulted in an abundance of protein sequences, a significant number of which are still unannotated. Membrane proteins such as transporters, receptors, and enzymes are among the least characterized proteins due to their hydrophobic surfaces and lack of conformational stability. This research aims to build a proteome-wide system to determine transporter substrate specificity, which involves three phases: 1) distinguishing membrane proteins, 2) differentiating transporters from other functional types of membrane proteins, and 3) detecting the substrate specificity of the transporters. To distinguish membrane from non-membrane proteins, we propose a novel tool, TooT-M, that combines the predictions from transmembrane topology prediction tools and a selective set of classifiers where protein samples are represented by pseudo position-specific scoring matrix (Pse-PSSM) vectors. The results suggest that the proposed tool outperforms all state-of-the-art methods in terms of the overall accuracy and Matthews correlation coefficient (MCC). To distinguish transporters from other proteins, we propose an ensemble classifier, TooT-T, that is trained to optimally combine the predictions from homology annotation transfer and machine learning methods. The homology annotation transfer components detect transporters by searching against the transporter classification database (TCDB) using different thresholds. The machine learning methods include three models wherein the protein sequences are encoded using a novel encoding psi-composition. The results show that TooT-T outperforms all state-of-the-art de novo transporter predictors in terms of the overall accuracy and MCC. To detect the substrate specificity of a transporter, we propose a novel tool, TooT-SC, that combines compositional, evolutionary, and positional information to represent protein samples. TooT-SC can efficiently classify transport proteins into eleven classes according to their transported substrate, which is the highest number of predicted substrates offered by any de novo prediction tool. Our results indicate that TooT-SC significantly outperforms all of the state-of-the-art methods. Further analysis of the locations of the informative positions reveals that there are more statistically significant informative positions in the transmembrane segments (TMSs) than the non-TMSs, and there are more statistically significant informative positions that occur close to the TMSs compared to regions far from them

    Bioinformatic analysis of bacterial and eukaryotic amino- terminal signal peptides

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Characterisation and Classification of Protein Sequences by Using Enhanced Amino Acid Indices and Signal Processing-Based Methods

    Get PDF
    Due to copyright reasons, the authors published papers have been removed from this copy of the thesis.Protein sequencing has produced overwhelming amount of protein sequences, especially in the last decade. Nevertheless, the majority of the proteins' functional and structural classes are still unknown, and experimental methods currently used to determine these properties are very expensive, laborious and time consuming. Therefore, automated computational methods are urgently required to accurately and reliably predict functional and structural classes of the proteins. Several bioinformatics methods have been developed to determine such properties of the proteins directly from their sequence information. Such methods that involve signal processing methods have recently become popular in the bioinformatics area and been investigated for the analysis of DNA and protein sequences and shown to be useful and generally help better characterise the sequences. However, there are various technical issues that need to be addressed in order to overcome problems associated with the signal processing methods for the analysis of the proteins sequences. Amino acid indices that are used to transform the protein sequences into signals have various applications and can represent diverse features of the protein sequences and amino acids. As the majority of indices have similar features, this project proposes a new set of computationally derived indices that better represent the original group of indices. A study is also carried out that resulted in finding a unique and universal set of best discriminating amino acid indices for the characterisation of allergenic proteins. This analysis extracts features directly from the protein sequences by using Discrete Fourier Transform (DFT) to build a classification model based on Support Vector Machines (SVM) for the allergenic proteins. The proposed predictive model yields a higher and more reliable accuracy than those of the existing methods. A new method is proposed for performing a multiple sequence alignment. For this method, DFT-based method is used to construct a new distance matrix in combination with multiple amino acid indices that were used to encode protein sequences into numerical sequences. Additionally, a new type of substitution matrix is proposed where the physicochemical similarities between any given amino acids is calculated. These similarities were calculated based on the 25 amino acids indices selected, where each one represents a unique biological protein feature. The proposed multiple sequence alignment method yields a better and more reliable alignment than the existing methods. In order to evaluate complex information that is generated as a result of DFT, Complex Informational Spectrum Analysis (CISA) is developed and presented. As the results show, when protein classes present similarities or differences according to the Common Frequency Peak (CFP) in specific amino acid indices, then it is probable that these classes are related to the protein feature that the specific amino acid represents. By using only the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient, as biologically related features can appear individually either in the real or the imaginary spectrum. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Upon identification of a new protein, it is important to single out amino acid responsible for the structural and functional classification of the protein, as well as the amino acids contributing to the protein's specific biological characterisation. In this work, a novel approach is presented to identify and quantify the relationship between individual amino acids and the protein. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Characterisation and identification problem of the Influenza A virus protein sequences is tackled through a Subgroup Discovery (SD) algorithm, which can provide ancillary knowledge to the experts. The main objective of the case study was to derive interpretable knowledge for the influenza A virus problem and to consequently better describe the relationships between subtypes of this virus. Finally, by using DFT-based sequence-driven features a Support Vector Machine (SVM)-based classification model was built and tested, that yields higher predictive accuracy than that of SD. The methods developed and presented in this study yield promising results and can be easily applied to proteomic fields

    Subcellular trafficking of proteolipid protein (PLP/DM20) and novel mechanisms of ER retention in Pelizaeus-Merzbacher disease

    Get PDF
    Missense mutations that predict the misfolding of membrane proteins have been associated with a number of neurogenetic diseases. However, it is not known how apparently minor changes in the amino acid sequence of an extracellular loop or a transmembrane domain lead to complete ER retention with complex loss- and gain-of-function effects. I have chosen PLP/DM20, a highly conserved and abundant tetraspan myelin protein, associated with Pelizaeus-Merzbacher disease (PMD), as a model system. By expressing wildtype and mutant PLP isoforms in glial cells, surprising molecular properties became apparent, including the ability to self-assemble from two truncated PLP polypeptides, and to form conformation sensitive epitope that become masked as the protein matures in the ER. With respect to human disease, it was possible to identify a novel molecular mechanism by which missense mutations cause ER retention of misfolded PLP. Unexpectedly, pairs of cysteines within an extracellular loop of PLP/DM20 play a critical role. Multiple disease-causing mutations require the presence of cysteines such that misfolded PLP/DM20 is efficiently retained in the ER. Replacing cysteines by serine completely prevents ER retention and restores normal trafficking of mutant PLP/DM20. This demonstrates a novel pathological mechanism by which missense mutations greatly reduce the efficiency of intramolecular disulfide bridging. When exposed by misfolding to the ER lumen, unpaired cysteines engage in alternative oxidations that lead to abnormal intermolecular crosslinks. Since extracellular cysteines are a feature of many membrane proteins, this novel pathomechanism is likely to contribute to a diverse group of genetic diseases. To monitor the expression and subcellular trafficking of PLP in vivo, a transgenic knock-in mouse in being generated that will express a PLP-EGFP fusion protein under control of the endogenous promoter. In an attempt to develop a cure for Pelizaeus-Merzbacher disease (PMD), we treated a genuine animal model (rumpshaker mice) with Turmeric. The active constituent of this herbal drug (Curcumin) is a non-toxic Ca2+ adenosine triphosphatase pump inhibitor, and known to release membrane proteins from ER retention. In a pilot experiment, we extended the lifespan of rumpshaker mice from 20 to 60 days. These promising data suggest that a therapeutic strategy should be developed for PMD, using turmeric and our in vitro and in vivo models

    ABCC2 transporter and α2 adrenoceptors : Identification of novel compounds and their mode of action

    Get PDF
    The main goal of this dissertation is to identify novel modulators acting on ATP Binding Cassette subfamily C member 2 (ABCC2) transporters and α2-adrenoceptors subtypes. With the purpose of identifying novel modulators and their mode of action, a combination of experimental and computational approaches have been used. The first protein presented in this dissertation is the ABCC2 transporter, also known as the multidrug resistance associated protein 2 (MRP2), an efflux transporter expressed in polarized cells where it effluxes a variety of both endogenous and exogenous molecules out of the cell. The most common way to study the interactions between small molecules and ABCC2 transporter is by a vesicle transport assay. Three assays are commercially available, which use different probes to define the ABCC2- transport. With the intent to define the different assays and identify the effect that small molecules have on the ABCC2-transport, a small set of eight compounds and, subsequently a larger library of compounds were tested with the different assays. Additionally, the aim was to identify and characterise novel ABCC2 inhibitors, 16 inhibitors have been identified from the larger library and classification models were built to identify important descriptors that were able to discriminate inhibitors from inactive molecules. Instant structure-activity relationships (SAR) of four scaffolds of ABCC2 modulators are also presented. In addition, some unpublished results are presented, the homology model of ABCC2 and further insights into the SAR of ABCC2 modulators. The other proteins included in this dissertation are the three subtypes of the α2-adrenoceptors, G-protein coupled receptors, involved in the signalling pathway of adrenaline and noradrenaline. A clear subtype characterization/profile of these proteins is not available. Selective molecules could be used in treatment of high blood pressure, in the alleviation of withdrawal symptoms, and as anaesthetic with fewer side effects than the current drugs. To define the affinity of a small set of antagonists and outline the involvement of the first transmembrane helix in ligand binding, a competition binding assay has been used with chimera receptors where the first transmembrane helix has been swapped between the three subtypes. Molecular modelling has been used to explain the different binding affinities to the chimera receptors. Additionally, the aim was to identify novel α2B-adrenoceptor selective compounds, thus a mid-sized library has been screened using a miniaturized binding assay. Hierarchical classification and chemoinformatics analysis has been used to visualize and analyse the screening results.Väitöskirja käsittelee uusien ABCC2-kuljetinproteiinin modulaattoreiden ja α2 adrenoseptorialatyyppien inhibiittoreiden tunnistusta sekä kokeellisia että laskennallisia menetelmiä käyttäen. Väitöskirjan aluksi käsittelen ABBCC2-kuljetinproteiinia (ATP Binding Casette -proteiiniperhe, ryhmä C, alatyyppi 2), joka tunnetaan myös MRP2-proteiinina (monilääkeresistenssiin liittyvä proteiini 2). Se on polarisoituneissa soluissa ilmentyvä efflux-kuljetinproteiini, joka pumppaa monia endogeenisiä ja eksogeenisiä molekyylejä ulos soluista. Yhdisteiden ABCC2-vuorovaikutuksia tutkitaan yleisesti vesikkelikuljetuskokeella. Kaupallisesti on saatavilla kolmeen eri testiyhdisteeseen perustuvia ABCC2-kuljetuskokeita. Arvioidakseni näitä kokeita ja pienmolekyylien vaikutusta ABCC2-kuljetukseen testasin eri koeasetelmissa ensin kahdeksan yhdistettä ja sitten suuremman yhdistekirjaston. Tunnistin 16 uutta ABCC2-inhibiittoria, ja rakensin deskriptoripohjaisen luokittelumallin erottelemaan estäjät inaktiivisista yhdisteistä. Esitän myös ABCC2-moduloinnin rakenne-aktiivisuussuhteet neljälle ydinrakenteelle ja käsittelen lisäksi joitakin julkaisemattomia tuloksia, kuten ABCC2-proteiinista rakennettua homologimallia ja jatkotutkimusta ABCC2-modulaattorien rakenne-aktiivisuussuhteista. Käsittelen väitöskirjassani myös kolmea α2-adrenoseptorin alatyyppiä, jotka kuuluvat G proteiinikytkentäisiin reseptoreihin ja osallistuvat adrenaliinin ja noradrenaliinin signalointiin. Näiden reseptorialatyyppien karakterisointi on toistaiseksi puutteellinen. Alatyyppiselektiivisiä molekyylejä voitaisiin hyödyntää verenpainetaudin hoidossa, lievittämään vieroitusoireita sekä nykyisiä lääkkeitä vähemmän haittavaikutuksia aiheuttavana anesteettina. Määrittääkseni pienen antagonistijoukon affiniteetin ja tutkiakseni ensimmäisen kalvon läpäisevän heliksin (TM1) osuutta sitoutumiseen käytin kilpailevaa sitoutumiskoetta ja kimeerisiä reseptoreita, joissa ensimmäistä kalvon läpäisevää heliksiä (TM1) vaihdettiin kolmen reseptorialatyypin välillä. Selitän havaittuja affiniteettieroja molekyylimallinuksen avulla. Lisäksi seuloin keskikokoisen yhdistekirjaston miniatyrisoidulla sitoutumiskokeella tunnistaakseni uusia α2B adrenoseptoriselektiivisiä yhdisteitä. Hyödynnän hierarkista luokittelua ja kemoinformatiikkaa seulontatulosten analysoinnissa ja esittämisessä
    corecore