502 research outputs found

    Back-translation for discovering distant protein homologies

    Get PDF
    Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. To cope with this situation, we propose a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. This allows us to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.Comment: The 9th International Workshop in Algorithms in Bioinformatics (WABI), Philadelphia : \'Etats-Unis d'Am\'erique (2009

    MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons

    Get PDF
    Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment

    Sequence Search Algorithms for Single Pass Sequence Identification: Does One Size Fit All?

    Get PDF
    Bioinformatic tools have become essential to biologists in their quest to understand the vast quantities of sequence data, and now whole genomes, which are being produced at an ever increasing rate. Much of these sequence data are single-pass sequences, such as sample sequences from organisms closely related to other organisms of interest which have already been sequenced, or cDNAs or expressed sequence tags (ESTs). These single-pass sequences often contain errors, including frameshifts, which complicate the identification of homologues, especially at the protein level. Therefore, sequence searches with this type of data are often performed at the nucleotide level. The most commonly used sequence search algorithms for the identification of homologues are Washington University’s and the National Center for Biotechnology Information's (NCBI) versions of the BLAST suites of tools, which are to be found on websites all over the world. The work reported here examines the use of these tools for comparing sample sequence datasets to a known genome. It shows that care must be taken when choosing the parameters to use with the BLAST algorithms. NCBI’s version of gapped BLASTn gives much shorter, and sometimes different, top alignments to those found using Washington University’s version of BLASTn (which also allows for gaps), when both are used with their default parameters. Most of the differences in performance were found to be due to the choices of default parameters rather than underlying differences between the two algorithms. Washington University’s version, used with defaults, compares very favourably with the results obtained using the accurate but computationally intensive Smith–Waterman algorithm

    Modélisation et comparaison de la structure de gènes

    Get PDF
    La bio-informatique est un domaine de recherche multi-disciplinaire, à la croisée de différents domaines : biologie, médecine, mathématiques, statistiques, chimie, physique et informatique. Elle a pour but de concevoir et d’appliquer des modèles et outils statistiques et computationnels visant l’avancement des connaissances en biologie et dans les sciences connexes. Dans ce contexte, la compréhension du fonctionnement et de l’évolution des gènes fait l’objet de nombreuses études en bio-informatique. Ces études sont majoritairement fondées sur la comparaison des gènes et en particulier sur l’alignement de séquences génomiques. Cependant, dans leurs calculs d’alignement de séquences génomiques, les méthodes existantes se basent uniquement sur la similarité des séquences et ne tiennent pas compte de la structure des gènes. L’alignement prenant en compte la structure des séquences offre l’opportunité d’en améliorer la précision ainsi que les résultats des méthodes développées à partir de ces alignements. C’est dans cette hypothèse que s’inscrit l’objectif de cette thèse de doctorat : proposer des modèles tenant compte de la structure des gènes lors de l’alignement des séquences de familles de gènes. Ainsi, par cette thèse, nous avons contribué à accroître les connaissances scientifiques en développant des modèles d’alignement de séquences biologiques intégrant des informations sur la structure de codage et d’épissage des séquences. Nous avons proposé un algorithme et une nouvelle fonction du score pour l’alignement de séquences codantes d’ADN (CDS) en tenant compte de la longueur des décalages du cadre de traduction. Nous avons aussi proposé un algorithme pour aligner des paires de séquences d’une famille de gènes en considérant leurs structures d’épissage. Nous avons également développé un algorithme pour assembler des alignements épissés par paire en alignements multiples de séquences. Enfin, nous avons développé un outil pour la visualisation d’alignements épissés multiples de famille de gènes. Dans cette thèse, nous avons souligné l’importance et démontré l’utilité de tenir compte de la structure des séquences en entrée lors du calcul de leur alignement

    FATHMM: Frameshift Aware Translated Hidden Markov Models

    Get PDF

    A conserved predicted pseudoknot in the NS2A-encoding sequence of West Nile and Japanese encephalitis flaviviruses suggests NS1' may derive from ribosomal frameshifting

    Get PDF
    Japanese encephalitis, West Nile, Usutu and Murray Valley encephalitis viruses form a tight subgroup within the larger Flavivirus genus. These viruses utilize a single-polyprotein expression strategy, resulting in ~10 mature proteins. Plotting the conservation at synonymous sites along the polyprotein coding sequence reveals strong conservation peaks at the very 5' end of the coding sequence, and also at the 5' end of the sequence encoding the NS2A protein. Such peaks are generally indicative of functionally important non-coding sequence elements. The second peak corresponds to a predicted stable pseudoknot structure whose biological importance is supported by compensatory mutations that preserve the structure. The pseudoknot is preceded by a conserved slippery heptanucleotide (Y CCU UUU), thus forming a classical stimulatory motif for -1 ribosomal frameshifting. We hypothesize, therefore, that the functional importance of the pseudoknot is to stimulate a portion of ribosomes to shift -1 nt into a short (45 codon), conserved, overlapping open reading frame, termed foo. Since cleavage at the NS1-NS2A boundary is known to require synthesis of NS2A in cis, the resulting transframe fusion protein is predicted to be NS1-NS2AN-term-FOO. We hypothesize that this may explain the origin of the previously identified NS1 'extension' protein in JEV-group flaviviruses, known as NS1'

    Codon Size Reduction as the Origin of the Triplet Genetic Code

    Get PDF
    The genetic code appears to be optimized in its robustness to missense errors and frameshift errors. In addition, the genetic code is near-optimal in terms of its ability to carry information in addition to the sequences of encoded proteins. As evolution has no foresight, optimality of the modern genetic code suggests that it evolved from less optimal code variants. The length of codons in the genetic code is also optimal, as three is the minimal nucleotide combination that can encode the twenty standard amino acids. The apparent impossibility of transitions between codon sizes in a discontinuous manner during evolution has resulted in an unbending view that the genetic code was always triplet. Yet, recent experimental evidence on quadruplet decoding, as well as the discovery of organisms with ambiguous and dual decoding, suggest that the possibility of the evolution of triplet decoding from living systems with non-triplet decoding merits reconsideration and further exploration. To explore this possibility we designed a mathematical model of the evolution of primitive digital coding systems which can decode nucleotide sequences into protein sequences. These coding systems can evolve their nucleotide sequences via genetic events of Darwinian evolution, such as point-mutations. The replication rates of such coding systems depend on the accuracy of the generated protein sequences. Computer simulations based on our model show that decoding systems with codons of length greater than three spontaneously evolve into predominantly triplet decoding systems. Our findings suggest a plausible scenario for the evolution of the triplet genetic code in a continuous manner. This scenario suggests an explanation of how protein synthesis could be accomplished by means of long RNA-RNA interactions prior to the emergence of the complex decoding machinery, such as the ribosome, that is required for stabilization and discrimination of otherwise weak triplet codon-anticodon interactions

    Multiple sequence alignments of partially coding nucleic acid sequences

    Get PDF
    BACKGROUND: High quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data. Nucleic acid sequences, however, exhibit a much larger sequence heterogeneity compared to their encoded protein sequences due to the redundancy of the genetic code. It is desirable, therefore, to make use of the amino acid sequence when aligning coding nucleic acid sequences. In many cases, however, only a part of the sequence of interest is translated. On the other hand, overlapping reading frames may encode multiple alternative proteins, possibly with intermittent non-coding parts. Examples are, in particular, RNA virus genomes. RESULTS: The standard scoring scheme for nucleic acid alignments can be extended to incorporate simultaneously information on translation products in one or more reading frames. Here we present a multiple alignment tool, codaln, that implements a combined nucleic acid plus amino acid scoring model for pairwise and progressive multiple alignments that allows arbitrary weighting for almost all scoring parameters. Resource requirements of codaln are comparable with those of standard tools such as ClustalW. CONCLUSION: We demonstrate the applicability of codaln to various biologically relevant types of sequences (bacteriophage Levivirus and Vertebrate Hox clusters) and show that the combination of nucleic acid and amino acid sequence information leads to improved alignments. These, in turn, increase the performance of analysis tools that depend strictly on good input alignments such as methods for detecting conserved RNA secondary structure elements

    Nutrition, HIV, and Drug Abuse: The Molecular Basis of a Unique Role for Selenium

    Get PDF
    HIV-infected injection drug users (IDUs) often suffer from serious nutritional deficiencies. This is a concern because plasma levels of micronutrients such as vitamin B12, zinc, and selenium have been correlated with mortality risk in HIV-positive populations. Injection drug use also increases lipid peroxidation and other indicators of oxidative stress, which, combined with antioxidant deficiencies, can stimulate HIV-1 replication through activation of NF-?B transcription factors, while weakening immune defenses. As detailed herein, these prooxidant stimuli can also increase the pathogenic effects of HIV-1 by another mechanism, involving viral selenoproteins. Overlapping the envelope coding region, HIV-1 encodes a truncated glutathione peroxidase (GPx) gene (see #6 in reference list). Sequence analysis and molecular modeling show that this viral GPx (vGPx) module has highly significant structural similarity to known mammalian GPx, with conservation of the catalytic triad of selenocysteine (Sec), glutamine, and tryptophan. In addition to other functions, HIV-1 vGPx may serve as a negative regulator of proviral transcription, by acting as an NF-?B inhibitor (a known property of cellular GPx). Another potential selenoprotein coding function of HIV-1 is associated with the 3' end of the nef gene, which terminates in a conserved UGA (potential Sec) codon in the context of a sequence (Cys-Sec) identical to the C-terminal redox center of thioredoxin reductase, another cellular regulator of NF-?B. Thus, in combination with known cellular mechanisms involving Se, viral selenoproteins may represent a unique mechanism by which HIV-1 monitors and exploits an essential micronutrient to optimize its replication relative to the host
    corecore