1,794 research outputs found

    Discovery and Extraction of Protein Sequence Motif Information that Transcends Protein Family Boundaries

    Get PDF
    Protein sequence motifs are gathering more and more attention in the field of sequence analysis. The recurring patterns have the potential to determine the conformation, function and activities of the proteins. In our work, we obtained protein sequence motifs which are universally conserved across protein family boundaries. Therefore, unlike most popular motif discovering algorithms, our input dataset is extremely large. As a result, an efficient technique is essential. We use two granular computing models, Fuzzy Improved K-means (FIK) and Fuzzy Greedy K-means (FGK), in order to efficiently generate protein motif information. After that, we develop an efficient Super Granular SVM Feature Elimination model to further extract the motif information. During the motifs searching process, setting up a fixed window size in advance may simplify the computational complexity and increase the efficiency. However, due to the fixed size, our model may deliver a number of similar motifs simply shifted by some bases or including mismatches. We develop a new strategy named Positional Association Super-Rule to confront the problem of motifs generated from a fixed window size. It is a combination approach of the super-rule analysis and a novel Positional Association Rule algorithm. We use the super-rule concept to construct a Super-Rule-Tree (SRT) by a modified HHK clustering, which requires no parameter setup to identify the similarities and dissimilarities between the motifs. The positional association rule is created and applied to search similar motifs that are shifted some residues. By analyzing the motifs results generated by our approaches, we realize that these motifs are not only significant in sequence area, but also in secondary structure similarity and biochemical properties

    Structural approaches to protein sequence analysis

    Get PDF
    Various protein sequence analysis techniques are described, aimed at improving the prediction of protein structure by means of pattern matching. To investigate the possibility that improvements in amino acid comparison matrices could result in improvements in the sensitivity and accuracy of protein sequence alignments, a method for rapidly calculating amino acid mutation data matrices from large sequence data sets is presented. The method is then applied to the membrane-spanning segments of integral membrane proteins in order to investigate the nature of amino acid mutability in a lipid environment. Whilst purely sequence analytic techniques work well for cases where some residual sequence similarity remains between a newly characterized protein and a protein of known 3-D structure, in the harder cases, there is little or no sequence similarity with which to recognize proteins with similar folding patterns. In the light of these limitations, a new approach to protein fold recognition is described, which uses a statistically derived pairwise potential to evaluate the compatibility between a test sequence and a library of structural templates, derived from solved crystal structures. The method, which is called optimal sequence threading, proves to be highly successful, and is able to detect the common TIM barrel fold between a number of enzyme sequences, which has not been achieved by any previous sequence analysis technique. Finally, a new method for the prediction of the secondary structure and topology of membrane proteins is described. The method employs a set of statistical tables compiled from well-characterized membrane protein data, and a novel dynamic programming algorithm to recognize membrane topology models by expectation maximization. The statistical tables show definite biases towards certain amino acid species on the inside, middle and outside of a cellular membrane

    Computational studies on membrane proteins and membrane-drug interactions

    Get PDF
    The cell membrane is a gateway to the cell and immersion point for membrane proteins and thus is of interest for pharmacology and structural biology. This thesis aims to study its interaction with water, small molecules, polymers and proteins through molecular dynamics simulation and statistical analysis. In the first part of the thesis, I have performed a statistical analysis of membrane proteins present in the PDB databank and enumerated in a structure database of known membrane proteins. Based on a statistical analysis of 127 proteins it was shown that extracellular cysteines are not solvent accessible. This rule has not previously been stated and was poorly followed by the participants of the GPCR DOCK competitions in 2008 and 2010. Thus it can provide qualitative guidelines to improve structural modeling. In a second study, based on a statistical analysis of 39 membrane proteins of three or more transmembrane helices, all of different fold, we have shown and clustered different spatial arrangements that sets of three interacting or consecutive helices can take, in addition to visualizing their abundance. In the second part of the thesis, I performed 200 ns simulations of both membranes in the gel (DSPC) and liquid-crystalline (DLPC) states with solvent and ions; These simulations were repeated with functionalized PEG polymers included (PEGylation). We also performed 200ns lipid membrane simulations in the liquid-crystalline (POPC) state with hematoporphyrin. Our studies provide a new, more accurate description of interactions between lipid membrane ions and featuring PEG polymers rather as dynamic molecules looping around Na+ ions and penetrating to liquid crystalline membrane rather than just a steric barrier outside of membrane. This sheds new light on the mechanism of liposome protection by PEG as well as triggering the release of liposome content through a heat induced lipid phase transition. Hematoporphyrin was shown to reside in the lipid headgroup carbonyl region. Ionized hematoporphyrin has lower affinity to the membrane as well as forming stable dimers in the aqueous phase. The research was in agreement with experimental data and has provided a molecular level view of the interactions between photosensitizers and the membrane.Solukalvo toimii sekä porttina soluun että kalvoproteiinien alustana, joten se on kiinnostava tutkimuskohde farmakologian ja rakenteellisen biologian kannalta. Väitöskirjan tarkoituksena on tutkia solukalvon vuorovaikutuksia veden, pienmolekyylien, polymeerien ja proteiinien kanssa molekyylidynaamisen simulaation ja tilastollisen analyysin keinoin. Teimme tilastollisen analyysin PDB-tietokannan kalvoproteiineista professori Stephen Whiten ylläpitämään tunnettujen kalvoproteiinien tietokantaan perustuen. 127 proteiinia kattava analyysi osoitti, että solunulkoiset kysteiinit eivät ole alttiina liuottimelle. Tällaista havaintoa ei ole aiemmin esitetty, eivätkä esimerkiksi GPCR DOCK kilpailuun vuosina 2008 ja 2010 osallistuneet ryhmät hyödyntäneet tällaista tietoa. Havainto voikin tarjota kvalitatiivisia suuntaviivoja rakennemallinnuksen kehittämiseen. Analysoimme 39 kalvoproteiinia, joissa on eri tavoin laskostuneena kolme tai useampia kalvon läpäiseviä heliksejä. Analyysin pohjalta osoitimme ja ryhmittelimme kolmella peräkkäisellä tai vuorovaikuttavalla heliksillä tavattavat erilaiset avaruudelliset järjestykset ja havainnollistimme niiden määrät. Simuloimme solukalvoa 200 ns ajan liuottimen ja ionien kanssa sekä geeli- että nestekidemuodossa (DSPC- ja DLPC-kalvolipidit). Simulaatiot toistettiin PEGpolymeereillä funtionalisoiduilla kalvolipideillä (PEGylaatio). Lisäksi simuloimme 200 ns ajan nestekidemuotoista POPC-lipidikalvoa hematoporfyriinin kanssa. Havaitsimme, että lipidikalvon olomuoto vaikuttaa kalvon vuorovaikutuksiin ionien ja polymeerien kanssa, etenkin ionien ja polymeerien kykyyn tunkeutua ja sitoutua solukalvon karbonyyli- ja ydinalueelle. Tutkimuksemme tarjoaa aiempaa tarkemman kuvauksen lipidikalvon vuorovaikutuksista ionien kanssa ja kuvaa PEG-polymeerit pelkän kalvonulkoisen steerisen esteen sijaan dynaamisina molekyyleinä, jotka kietoutuvat Na+-ionien ympärille ja tunkeutuvat nestekidekalvoon. Tämä valaisee liposomien PEG-suojauksen ja lämpöindusoidun lipidifaasimuutoksen laukaiseman liposomin sisällön vapautumisen mekanismeja. Hematoporfyriinin havaitsimme asettuvan lipidien hydrofiilisten päiden karbonyylialueelle. Ionisoitu hematoporfyriini sitoutuu kalvoon heikommin, minkä lisäksi se ei myöskään muodosta vakaita dimeerejä vesiliuoksessa. Tulokset ovat yhdenmukaisia kokeellisten tulosten kanssa ja tarjoavat molekyylitasoisen kuvan valoherkistäjien ja kalvon välisistä vuorovaikutuksista

    Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds

    Get PDF
    BACKGROUND: As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with solved structure and address the following questions. Do these domains represent an unbiased random sample of all sequence families? Do targets solved by structural genomic initiatives (SGI) provide such a sample? What are approximate total numbers of structure-based superfamilies and folds among soluble globular domains? RESULTS: To make these assessments, we combine two approaches: (i) sequence analysis and homology-based structure prediction for proteins from complete genomes; and (ii) monitoring dynamics of the assigned structure set in time, with the accumulation of experimentally solved structures. In the Clusters of Orthologous Groups (COG) database, we map the growing population of structurally characterized domain families onto the network of sequence-based connections between domains. This mapping reveals a systematic bias suggesting that target families for structure determination tend to be located in highly populated areas of sequence space. In contrast, the subset of domains whose structure is initially inferred by SGI is similar to a random sample from the whole population. To accommodate for the observed bias, we propose a new non-parametric approach to the estimation of the total numbers of structural superfamilies and folds, which does not rely on a specific model of the sampling process. Based on dynamics of robust distribution-based parameters in the growing set of structure predictions, we estimate the total numbers of superfamilies and folds among soluble globular proteins in the COG database. CONCLUSION: The set of currently solved protein structures allows for structure prediction in approximately a third of sequence-based domain families. The choice of targets for structure determination is biased towards domains with many sequence-based homologs. The growing SGI output in the future should further contribute to the reduction of this bias. The total number of structural superfamilies and folds in the COG database are estimated as ~4000 and ~1700. These numbers are respectively four and three times higher than the numbers of superfamilies and folds that can currently be assigned to COG proteins

    A structural classification of protein-protein interactions for detection of convergently evolved motifs and for prediction of protein binding sites on sequence level

    Get PDF
    BACKGROUND: A long-standing challenge in the post-genomic era of Bioinformatics is the prediction of protein-protein interactions, and ultimately the prediction of protein functions. The problem is intrinsically harder, when only amino acid sequences are available, but a solution is more universally applicable. So far, the problem of uncovering protein-protein interactions has been addressed in a variety of ways, both experimentally and computationally. MOTIVATION: The central problem is: How can protein complexes with solved threedimensional structure be utilized to identify and classify protein binding sites and how can knowledge be inferred from this classification such that protein interactions can be predicted for proteins without solved structure? The underlying hypothesis is that protein binding sites are often restricted to a small number of residues, which additionally often are well-conserved in order to maintain an interaction. Therefore, the signal-to-noise ratio in binding sites is expected to be higher than in other parts of the surface. This enables binding site detection in unknown proteins, when homology based annotation transfer fails. APPROACH: The problem is addressed by first investigating how geometrical aspects of domain-domain associations can lead to a rigorous structural classification of the multitude of protein interface types. The interface types are explored with respect to two aspects: First, how do interface types with one-sided homology reveal convergently evolved motifs? Second, how can sequential descriptors for local structural features be derived from the interface type classification? Then, the use of sequential representations for binding sites in order to predict protein interactions is investigated. The underlying algorithms are based on machine learning techniques, in particular Hidden Markov Models. RESULTS: This work includes a novel approach to a comprehensive geometrical classification of domain interfaces. Alternative structural domain associations are found for 40% of all family-family interactions. Evaluation of the classification algorithm on a hand-curated set of interfaces yielded a precision of 83% and a recall of 95%. For the first time, a systematic screen of convergently evolved motifs in 102.000 protein-protein interactions with structural information is derived. With respect to this dataset, all cases related to viral mimicry of human interface bindings are identified. Finally, a library of 740 motif descriptors for binding site recognition - encoded as Hidden Markov Models - is generated and cross-validated. Tests for the significance of motifs are provided. The usefulness of descriptors for protein-ligand binding sites is demonstrated for the case of "ATP-binding", where a precision of 89% is achieved, thus outperforming comparable motifs from PROSITE. In particular, a novel descriptor for a P-loop variant has been used to identify ATP-binding sites in 60 protein sequences that have not been annotated before by existing motif databases

    MYRbase: analysis of genome-wide glycine myristoylation enlarges the functional spectrum of eukaryotic myristoylated proteins

    Get PDF
    We evaluated the evolutionary conservation of glycine myristoylation within eukaryotic sequences. Our large-scale cross-genome analyses, available as MYRbase, show that the functional spectrum of myristoylated proteins is currently largely underestimated. We give experimental evidence for in vitro myristoylation of selected predictions. Furthermore, we classify five membrane-attachment factors that occur most frequently in combination with, or even replacing, myristoyl anchors, as some protein family examples show

    Towards Constructing a Corpus for Studying the Effects of Treatments and Substances Reported in PubMed Abstracts

    Full text link
    We present the construction of an annotated corpus of PubMed abstracts reporting about positive, negative or neutral effects of treatments or substances. Our ultimate goal is to annotate one sentence (rationale) for each abstract and to use this resource as a training set for text classification of effects discussed in PubMed abstracts. Currently, the corpus consists of 750 abstracts. We describe the automatic processing that supports the corpus construction, the manual annotation activities and some features of the medical language in the abstracts selected for the annotated corpus. It turns out that recognizing the terminology and the abbreviations is key for determining the rationale sentence. The corpus will be applied to improve our classifier, which currently has accuracy of 78.80% achieved with normalization of the abstract terms based on UMLS concepts from specific semantic groups and an SVM with a linear kernel. Finally, we discuss some other possible applications of this corpus.Comment: medical relation extraction, rationale extraction, effects and treatments, bioNL
    corecore