80 research outputs found

    Back-translation for discovering distant protein homologies in the presence of frameshift mutations

    Get PDF
    Background: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins ’ common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. \ud \ud Results: We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at http://bioinfo.lifl.fr/path/.\ud \ud Conclusions: Our approach allows to uncover evolutionary information that is not captured by traditional\ud alignment methods, which is confirmed by biologically significant example

    Back-translation for discovering distant protein homologies

    Get PDF
    Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. To cope with this situation, we propose a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. This allows us to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.Comment: The 9th International Workshop in Algorithms in Bioinformatics (WABI), Philadelphia : \'Etats-Unis d'Am\'erique (2009

    HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein domain classification is an important step in metagenomic annotation. The state-of-the-art method for protein domain classification is profile HMM-based alignment. However, the relatively high rates of insertions and deletions in homopolymer regions of pyrosequencing reads create frameshifts, causing conventional profile HMM alignment tools to generate alignments with marginal scores. This makes error-containing gene fragments unclassifiable with conventional tools. Thus, there is a need for an accurate domain classification tool that can detect and correct sequencing errors.</p> <p>Results</p> <p>We introduce HMM-FRAME, a protein domain classification tool based on an augmented Viterbi algorithm that can incorporate error models from different sequencing platforms. HMM-FRAME corrects sequencing errors and classifies putative gene fragments into domain families. It achieved high error detection sensitivity and specificity in a data set with annotated errors. We applied HMM-FRAME in Targeted Metagenomics and a published metagenomic data set. The results showed that our tool can correct frameshifts in error-containing sequences, generate much longer alignments with significantly smaller E-values, and classify more sequences into their native families.</p> <p>Conclusions</p> <p>HMM-FRAME provides a complementary protein domain classification tool to conventional profile HMM-based methods for data sets containing frameshifts. Its current implementation is best used for small-scale metagenomic data sets. The source code of HMM-FRAME can be downloaded at <url>http://www.cse.msu.edu/~zhangy72/hmmframe/</url> and at <url>https://sourceforge.net/projects/hmm-frame/</url>.</p

    Modélisation et comparaison de la structure de gènes

    Get PDF
    La bio-informatique est un domaine de recherche multi-disciplinaire, à la croisée de différents domaines : biologie, médecine, mathématiques, statistiques, chimie, physique et informatique. Elle a pour but de concevoir et d’appliquer des modèles et outils statistiques et computationnels visant l’avancement des connaissances en biologie et dans les sciences connexes. Dans ce contexte, la compréhension du fonctionnement et de l’évolution des gènes fait l’objet de nombreuses études en bio-informatique. Ces études sont majoritairement fondées sur la comparaison des gènes et en particulier sur l’alignement de séquences génomiques. Cependant, dans leurs calculs d’alignement de séquences génomiques, les méthodes existantes se basent uniquement sur la similarité des séquences et ne tiennent pas compte de la structure des gènes. L’alignement prenant en compte la structure des séquences offre l’opportunité d’en améliorer la précision ainsi que les résultats des méthodes développées à partir de ces alignements. C’est dans cette hypothèse que s’inscrit l’objectif de cette thèse de doctorat : proposer des modèles tenant compte de la structure des gènes lors de l’alignement des séquences de familles de gènes. Ainsi, par cette thèse, nous avons contribué à accroître les connaissances scientifiques en développant des modèles d’alignement de séquences biologiques intégrant des informations sur la structure de codage et d’épissage des séquences. Nous avons proposé un algorithme et une nouvelle fonction du score pour l’alignement de séquences codantes d’ADN (CDS) en tenant compte de la longueur des décalages du cadre de traduction. Nous avons aussi proposé un algorithme pour aligner des paires de séquences d’une famille de gènes en considérant leurs structures d’épissage. Nous avons également développé un algorithme pour assembler des alignements épissés par paire en alignements multiples de séquences. Enfin, nous avons développé un outil pour la visualisation d’alignements épissés multiples de famille de gènes. Dans cette thèse, nous avons souligné l’importance et démontré l’utilité de tenir compte de la structure des séquences en entrée lors du calcul de leur alignement

    Applications of Evolutionary Bioinformatics in Basic and Biomedical Research

    Get PDF
    With the revolutionary progress in sequencing technologies, computational biology emerged as a game-changing field which is applied in understanding molecular events of life for not only complementary but also exploratory purposes. Bioinformatics resources and tools significantly help in data generation, organization and analysis. However, there is still a need for developing new approaches built based on a biologist’s point of view. In protein bioinformatics, there are several fundamental problems such as (i) determining protein function; (ii) identifying protein-protein interactions; (iii) predicting the effect of amino acid variants. Here, I present three chapters addressing these problems from an evolutionary perspective. Firstly, I describe a novel search pipeline for protein domain identification. The algorithm chain provides sensitive domain assignments with the highest possible specificity. Secondly, I present a tool enabling large-scale visualization of presences and absences of proteins in hierarchically clustered genomes. This tool visualizes multi-layer information of any kind of genome-linked data with a special focus on domain architectures, enabling identification of coevolving domains/proteins, which can eventually help in identifying functionally interacting proteins. And finally, I propose an approach for distinguishing between benign and damaging missense mutations in a human disease by establishing the precise evolutionary history of the associated gene. This part introduces new criteria on how to determine functional orthologs via phylogenetic analysis. All three parts use comparative genomics and/or sequence analyses. Taken together, this study addresses important problems in protein bioinformatics and as a whole it can be utilized to describe proteins by their domains, coevolving partners and functionally important residues

    Algorithms for the description of molecular sequences

    Get PDF
    Unambiguous sequence variant descriptions are important in reporting the outcome of clinical diagnostic DNA tests. The standard nomenclature of the Human Genome Variation Society (HGVS) describes the observed variant sequence relative to a given reference sequence. We propose an efficient algorithm for the extraction of HGVS descriptions from two DNA sequences. Our algorithm is able to compute the HGVS~descriptions of complete chromosomes or other large DNA strings in a reasonable amount of computation time and its resulting descriptions are relatively small. Additional applications include updating of gene variant database contents and reference sequence liftovers. Next, we adapted our method for the extraction of descriptions for protein sequences in particular for describing frame shifted variants. We propose an addition to the HGVS nomenclature for accommodating the (complex) frame shifted variants that can be described with our method. Finally, we applied our method to generate descriptions for Short Tandem Repeats (STRs), a form of self-similarity. We propose an alternative repeat variant that can be added to the existing HGVS nomenclature. The final chapter takes an explorative approach to classification in large cohort studies. We provide a ``cross-sectional'' investigation on this data to see the relative power of the different groups.  Algorithms and the Foundations of Software technolog

    Conservation of different mechanisms of Hox cluster regulation within chordates

    Full text link
    [eng] In this thesis we have covered the importance of finding underlying conservation events to better understand the regulatory mechanisms of important development orchestrators like the Hox cluster. As an example of these non-evident conservation, we have shown two cases, as described below. The first case studied, after developing a software able to detect homologous long noncoding RNAs by means of microsynteny analyses, is the conservation of Hotairm1 in Chordata. For assessing the homology of this lncRNA, first we had to identify the lncRNA fraction within the B. lanceolatum transcriptome. With a reliable lincRNA dataset, we used our pipeline, LincOFinder, to identify orthologs between human and amphioxus through microsynteny. After the identification of Hotairm1 as one of the lincRNAs with conserved microsynteny, we used Xenopus as a proxy to analyse the homologies in the expression and the function. We had to proceed this way due to the difficulties associated with the inhibition of genes in B. lanceolatum, and the unavailability of expression patterns for Hotairm1 in the bibliography. After we successfully characterised Hotairm1 expression in amphioxus and Xenopus, we injected morpholino oligonucleotides to target and inhibit the splicing of Hotairm1 to promote an isoform imbalance. Through the phenotype obtained and the performing of qPCRs, we were able to deduct the mechanism of Hotairm1 and successfully relate this mechanism with the one described in human cells. With all the data obtained we were able to strongly suggest that the amphioxus Hotairm1 is homologous to the Xenopus and human Hotairm1, thus being conserved in most of the lineages within chordates. The second case studied was the conservation of the regulation of the Hox cluster mediated by Cdx. When analysing the B. floridae knockouts of Cdx and Pdx obtained using the TALEN technique, we found a severe phenotype of the developing larvae in Cdx-/- and a mild phenotype in Pdx-/-. The Cdx-/- phenotype consisted in the disruption of posterior gut development, as well as an underdevelopment of the postanal tail, coupled with a non-opening anus. When looking at changes in the expression of the Hox cluster in this Cdx-/- embryos, we found collinear misregulation of the expressed Hox genes, with the most anterior Hox cluster genes upregulated, and the most posterior ones downregulated. This is very similar to findings seen in triple morpholino knockdowns of the Cdx genes in Xenopus, indicating that in both, Xenopus and amphioxus, Cdx is regulating the Hox cluster through a homologous mechanism

    Hypermutation and adaptation of experimentally evolved marine Vibrio bacteria

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Biological Engineering, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 73-83).Environmental bacteria display tremendous genetic diversity, but we are still learning how this diversity arises and relates to their wide range of habitats. Investigating how bacteria adapt helps us understand their contributions to environmental processes and informs forward engineering of bacteria for industrial applications. Experimental evolution is a powerful approach, with microbes especially, but it has mostly been applied to model organisms and metabolic functions. In the work here, we investigated the possibility, degree, and variability of adaptation of an environmental Vibrio strain by applying a little-used selection method appropriate to a relevant condition, salinity. We successfully isolated mutants with higher salt tolerance by selecting on salt gradient plates. Resequencing the genomes of the evolved strains revealed unprecedented hypermutation in three of nine parallel lineages. These mutator lines arose independently, and each of them accumulated more than 1500 single-base mutations. By comparison, there are only 302 single-base differences between the ancestor strain and another strain isolated in the wild. Hypermutation was associated with a deletion resulting from improper prophage excision. Members of this family of prophages are found in other proteobacteria, including well-studied human pathogens, from very different environments. Mutators are known to arise spontaneously in wild and clinical bacteria, but the extent of their adaptive contribution is unknown. We have preliminary evidence that this mechanism of evolution could be relevant in the environment, where horizontal gene transfer and mobile elements play known, significant roles in bacterial evolution.by Sean Aidan Clarke.Ph.D
    • …
    corecore