28 research outputs found

    Pairwise alignment incorporating dipeptide covariation

    Full text link
    Motivation: Standard algorithms for pairwise protein sequence alignment make the simplifying assumption that amino acid substitutions at neighboring sites are uncorrelated. This assumption allows implementation of fast algorithms for pairwise sequence alignment, but it ignores information that could conceivably increase the power of remote homolog detection. We examine the validity of this assumption by constructing extended substitution matrixes that encapsulate the observed correlations between neighboring sites, by developing an efficient and rigorous algorithm for pairwise protein sequence alignment that incorporates these local substitution correlations, and by assessing the ability of this algorithm to detect remote homologies. Results: Our analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation

    MTRAP: Pairwise sequence alignment algorithm by a new measure based on transition probability between two consecutive pairs of residues

    Get PDF
    BACKGROUND: Sequence alignment is one of the most important techniques to analyze biological systems. It is also true that the alignment is not complete and we have to develop it to look for more accurate method. In particular, an alignment for homologous sequences with low sequence similarity is not in satisfactory level. Usual methods for aligning protein sequences in recent years use a measure empirically determined. As an example, a measure is usually defined by a combination of two quantities (1) and (2) below: (1) the sum of substitutions between two residue segments, (2) the sum of gap penalties in insertion/deletion region. Such a measure is determined on the assumption that there is no an intersite correlation on the sequences. In this paper, we improve the alignment by taking the correlation of consecutive residues. RESULTS: We introduced a new method of alignment, called MTRAP by introducing a metric defined on compound systems of two sequences. In the benchmark tests by PREFAB 4.0 and HOMSTRAD, our pairwise alignment method gives higher accuracy than other methods such as ClustalW2, TCoffee, MAFFT. Especially for the sequences with sequence identity less than 15%, our method improves the alignment accuracy significantly. Moreover, we also showed that our algorithm works well together with a consistency-based progressive multiple alignment by modifying the TCoffee to use our measure. CONCLUSIONS: We indicated that our method leads to a significant increase in alignment accuracy compared with other methods. Our improvement is especially clear in low identity range of sequences. The source code is available at our web page, whose address is found in the section "Availability and requirements"

    deepBlockAlign: a tool for aligning RNA-seq profiles of read block patterns

    Get PDF
    Motivation: High-throughput sequencing methods allow whole transcriptomes to be sequenced fast and cost-effectively. Short RNA sequencing provides not only quantitative expression data but also an opportunity to identify novel coding and non-coding RNAs. Many long transcripts undergo post-transcriptional processing that generates short RNA sequence fragments. Mapped back to a reference genome, they form distinctive patterns that convey information on both the structure of the parent transcript and the modalities of its processing. The miR-miR* pattern from microRNA precursors is the best-known, but by no means singular, example

    Statistical approaches to the study of protein folding and energetics

    Get PDF
    The determination of protein structure and the exploration of protein folding landscapes are two of the key problems in computational biology. In order to address these challenges, both a protein model that accurately captures the physics of interest and an efficient sampling algorithm are required. The first part of this thesis documents the continued development of CRANKITE, a coarse-grained protein model, and its energy landscape exploration using nested sampling, a Bayesian sampling algorithm. We extend CRANKITE and optimize its parameters using a maximum likelihood approach. The efficiency of our procedure, using the contrastive divergence approximation, allows a large training set to be used, producing a model which is transferable to proteins not included in the training set. We develop an empirical Bayes model for the prediction of protein β-contacts, which are required inputs for CRANKITE. Our approach couples the constraints and prior knowledge associated with β-contacts to a maximum entropy-based statistic which predicts evolutionarily-related contacts. Nested sampling (NS) is a Bayesian algorithm shown to be efficient at sampling systems which exhibit a first-order phase transition. In this work we parallelize the algorithm and, for the first time, apply it to a biophysical system: small globular proteins modelled using CRANKITE. We generate energy landscape charts, which give a large-scale visualization of the protein folding landscape, and we compare the efficiency of NS to an alternative sampling technique, parallel tempering, when calculating the heat capacity of a short peptide. In the final part of the thesis we adapt the NS algorithm for use within a molecular dynamics framework and demonstrate the application of the algorithm by calculating the thermodynamics of allatom models of a small peptide, comparing results to the standard replica exchange approach. This adaptation will allow NS to be used with more realistic force fields in the future

    Expanding the repertoire of bacterial (non-)coding RNAs

    Get PDF
    The detection of non-protein-coding RNA (ncRNA) genes in bacteria and their diverse regulatory mode of action moved the experimental and bio-computational analysis of ncRNAs into the focus of attention. Regulatory ncRNA transcripts are not translated to proteins but function directly on the RNA level. These typically small RNAs have been found to be involved in diverse processes such as (post-)transcriptional regulation and modification, translation, protein translocation, protein degradation and sequestration. Bacterial ncRNAs either arise from independent primary transcripts or their mature sequence is generated via processing from a precursor. Besides these autonomous transcripts, RNA regulators (e.g. riboswitches and RNA thermometers) also form chimera with protein-coding sequences. These structured regulatory elements are encoded within the messenger RNA and directly regulate the expression of their “host” gene. The quality and completeness of genome annotation is essential for all subsequent analyses. In contrast to protein-coding genes ncRNAs lack clear statistical signals on the sequence level. Thus, sophisticated tools have been developed to automatically identify ncRNA genes. Unfortunately, these tools are not part of generic genome annotation pipelines and therefore computational searches for known ncRNA genes are the starting point of each study. Moreover, prokaryotic genome annotation lacks essential features of protein-coding genes. Many known ncRNAs regulate translation via base-pairing to the 5’ UTR (untranslated region) of mRNA transcripts. Eukaryotic 5’ UTRs have been routinely annotated by sequencing of ESTs (expressed sequence tags) for more than a decade. Only recently, experimental setups have been developed to systematically identify these elements on a genome-wide scale in prokaryotes. The first part of this thesis, describes three experimental surveys of exploratory field studies to analyze transcript organization in pathogenic bacteria. To identify ncRNAs in Pseudomonas aeruginosa we used a combination of an experimental RNomics approach and ncRNA prediction. Besides already known ncRNAs we identified and validated the expression of six novel RNA genes. Global detection of transcripts by next generation RNA sequencing techniques unraveled an unexpectedly complex transcript organization in many bacteria. These ultra high-throughput methods give us the appealing opportunity to analyze the complete RNA output of any species at once. The development of the differential RNA sequencing (dRNA-seq) approach enabled us to analyze the primary transcriptome of Helicobacter pylori and Xanthomonas campestris. For the first time we generated a comprehensive and precise transcription start site (TSS) map for both species and provide a general framework for the analysis of dRNA-seq data. Focusing on computer-aided analysis we developed new tools to annotate TSS, detect small protein-coding genes and to infer homology of newly detected transcripts. We discovered hundreds of TSS in intergenic regions, upstream of protein-coding genes, within operons and antisense to annotated genes. Analysis of 5’ UTRs (spanning from the TSS to the start codon of the adjacent protein-coding gene) revealed an unexpected size diversity ranging from zero to several hundred nucleotides. We identified and validated the expression of about 60 and about 20 ncRNA candidates in Helicobacter and Xanthomonas, respectively. Among these ncRNA candidates we found several small protein-coding genes that have previously evaded annotation in both species. We showed that the combination of dRNA-seq and computational analysis is a powerful method to examine prokaryotic transcriptomes. Experimental setups are time consuming and often combined with huge costs. Another limitation of experimental approaches is that genes which are expressed in specific developmental stages or stress conditions are likely to be missed. Bioinformatic tools build an alternative to overcome such restraints. General approaches usually depend on comparative genomic data and evolutionary signatures are used to analyze the (non-)coding potential of multiple sequence alignments. In the second part of my thesis we present our major update of the widely used ncRNA gene finder RNAz and introduce RNAcode, an efficient tool to asses local protein-coding potential of genomic regions. RNAz has been successfully used to identify structured RNA elements in all domains of life. However, our own experience and the user feedback not only demonstrated the applicability of the RNAz approach, but also helped us to identify limitations of the current implementation. Using a much larger training set and a new classification model we significantly improved the prediction accuracy of RNAz. During transcriptome analysis we repeatedly identified small protein-coding genes that have not been annotated so far. Only a few of those genes are known to date and standard proteincoding gene finding tools suffer from the lack of training data. To avoid an excess of false positive predictions, gene finding software is usually run with an arbitrary cutoff of 40-50 amino acids and therefore misses the small sized protein-coding genes. We have implemented RNAcode which is optimized for emerging applications not covered by standard protein-coding gene annotation software. In addition to complementing classical protein gene annotation, a major field of application of RNAcode is the functional classification of transcribed regions. RNA sequencing analyses are likely to falsely report transcript fragments (e.g. mRNA degradation products) as non-coding. Hence, an evaluation of the protein-coding potential of these fragments is an essential task. RNAcode reports local regions of high coding potential instead of complete protein-coding genes. A training on known protein-coding sequences is not necessary and RNAcode can therefore be applied to any species. We showed this with our analysis of the Escherichia coli genome where the current annotation could be accurately reproduced. We furthermore identified novel small protein-coding genes with RNAcode in this extensively studied genome. Using transcriptome and proteome data we found compelling evidence that several of the identified candidates are bona fide proteins. In summary, this thesis clearly demonstrates that bioinformatic methods are mandatory to analyze the huge amount of transcriptome data and to identify novel (non-)coding RNA genes. With the major update of RNAz and the implementation of RNAcode we contributed to complete the repertoire of gene finding software which will help to unearth hidden treasures of the RNA World

    Methoden zur Vorhersage von komplexen biomolekularen Strukturen

    Get PDF
    Die erste hochaufgelöste Struktur eines Proteins wurde 1985 von John Kendrew und Max Perutz aufgelöst. Seitdem ist die experimentelle Aufklärung ein wichtiger Bestandteil der biologischen Forschung. Allerdings ist die Aufklärung der Strukturen von biomoleku- laren Komplexen sehr schwierig. Diese Strukturen sind jedoch immens wichtig für das Verständnis vieler biologischer Phänomene auf molekularer Ebene. Aus diesem Grund hat sich ein Forschungsfeld entwickelt, das computergestützte Modellierung zur Vorher- sage von biomolekularen Strukturen verwendet. In dieser Promotionsschrift sollten Methoden zur Vorhersage von komplexen biomolekularen Strukturen entwickelt werden. Diese Methoden basieren auf drei unter- schiedlichen Ansätzen: Die erste Methode wurde für Proteine entwickelt, die aus mehreren Domänen bestehen. Die Methode nutzt vorhandene Strukturen der einzelnen Domänen und experimentelle Daten, die geometrische Relationen der Domänen abbilden, und ermöglicht die Unter- suchung konformationeller Änderungen bedingt durch äußere Einflüsse, wie beispielsweise das Zuführen eines Substrates. Als Fallbeispiel wurde die Konformation des flexiblen zwei-Domänen Proteins peptidylprolyl cis/trans isomerase NIMA-interacting 1 (Pin1) untersucht, sowie die Änderung als Reaktion auf die Zugabe des Substrates polyethy- lene glycol (PEG). Die zweite Methode basiert auf dem neuen Verfahren Direct Coupling Analysis (DCA), das es ermöglicht geometrische Kontakte von Aminosäuren anhand eines multiplen Sequenzalignments (MSA) vorherzusagen. DCA nutzt eine Korrektur zur Vermeidung einer Stichprobenverzerrung bedingt durch die Auswahl der Sequenzen für das MSA. Die hier vorgestellte Optimierung ermöglicht eine robustere Vorhersage der geometrischen Kontakte. Die optimierte Methode wurde für die Analyse von Human Immunodeficiency Virus-1 Envelope Protein (HIV-1 Env) eingesetzt. Die letzte Methode wurde entwickelt, um Binderegionen des negativ geladenen Heparansulfates an Proteinen vorherzusagen. Dafür haben wir ein Modell entwickelt, das auf der elektrostatischen Wechselwirkung basiert. Die Fallbeispiele sind hier ver- schiedene Heparansulfat bindenden Proteine, wie das Chemokine CCL3 und den Hedgehog Proteinen. Insgesamt wird gezeigt, dass für verschiedene Arten von biomolekularer Strukturen und Komplexe moderne computergestützte Methoden Einsichten liefern, die im Einklang mit Experimenten stehen

    CHARACTERIZATION OF YBR074 (PFF1), A CONSERVED VACUOLAR MEMBRANE METALLOPROTEASE FAMILY MEMBER

    Get PDF
    In yeast, the vacuole is the principal intracellular compartment associated with protein degradation. The vacuole acts as a buffering organelle that accumulates nutrients in times of plenty, and releases them into the cytosol during periods of nutrient starvation. This important biological function is mediated by vacuolar proteases, which exhibit a variety of conserved catalytic mechanisms. Metalloproteases represent one of the most diverse classes of proteases, and they are defined by a characteristic dependence on coordinated metal ions for their catalytic activity. In higher eukaryotes, metalloproteases are associated with both intracellular homeostasis and remodeling of the extracellular environment. In humans, remodeling of the extracellular matrix is mediated by secreted matrix metalloproteases regulating cell motility during development and wound healing, and also serving as markers of cancer metastasis. Within the cell, metalloproteases play a major role in the maturation and trafficking of proteins, as well as in the turnover of long-lived, superfluous, or damaged proteins and organelles. This dissertation represents the first characterization of the putative yeast metalloprotease Ybr074, which is named herein, protein in FXNA-related family (Pff1). Pff1 is a member of the conserved family of M28 metalloproteases, which includes the mammalian ortholog, FXNA. FXNA has been reported to be localized in the endoplasmic reticulum (ER), and is expressed in multiple tissues in rats, where it has been implicated in ovarian development. This dissertation shows that, unlike the ER-localized FXNA, Pff1 is a vacuolar protein. This finding is in agreement with extensive data, presented herein, demonstrating that Pff1 is not involved in protein quality control in the ER. However, genetic and chemical-genetic analyses suggest that Pff1 may have a role in yeast cell wall maintenance. Finally, this dissertation describes proteomic approaches employed in an attempt to identify endogenous substrates of Pff1, and outlines additional strategies aimed at defining the biological function of this novel vacuolar protease family member
    corecore