6,186 research outputs found

    Multiple sequence alignments of partially coding nucleic acid sequences

    Get PDF
    BACKGROUND: High quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data. Nucleic acid sequences, however, exhibit a much larger sequence heterogeneity compared to their encoded protein sequences due to the redundancy of the genetic code. It is desirable, therefore, to make use of the amino acid sequence when aligning coding nucleic acid sequences. In many cases, however, only a part of the sequence of interest is translated. On the other hand, overlapping reading frames may encode multiple alternative proteins, possibly with intermittent non-coding parts. Examples are, in particular, RNA virus genomes. RESULTS: The standard scoring scheme for nucleic acid alignments can be extended to incorporate simultaneously information on translation products in one or more reading frames. Here we present a multiple alignment tool, codaln, that implements a combined nucleic acid plus amino acid scoring model for pairwise and progressive multiple alignments that allows arbitrary weighting for almost all scoring parameters. Resource requirements of codaln are comparable with those of standard tools such as ClustalW. CONCLUSION: We demonstrate the applicability of codaln to various biologically relevant types of sequences (bacteriophage Levivirus and Vertebrate Hox clusters) and show that the combination of nucleic acid and amino acid sequence information leads to improved alignments. These, in turn, increase the performance of analysis tools that depend strictly on good input alignments such as methods for detecting conserved RNA secondary structure elements

    A two-phase approach for detecting recombination in nucleotide sequences

    Full text link
    Genetic recombination can produce heterogeneous phylogenetic histories within a set of homologous genes. Delineating recombination events is important in the study of molecular evolution, as inference of such events provides a clearer picture of the phylogenetic relationships among different gene sequences or genomes. Nevertheless, detecting recombination events can be a daunting task, as the performance of different recombinationdetecting approaches can vary, depending on evolutionary events that take place after recombination. We recently evaluated the effects of postrecombination events on the prediction accuracy of recombination-detecting approaches using simulated nucleotide sequence data. The main conclusion, supported by other studies, is that one should not depend on a single method when searching for recombination events. In this paper, we introduce a two-phase strategy, applying three statistical measures to detect the occurrence of recombination events, and a Bayesian phylogenetic approach in delineating breakpoints of such events in nucleotide sequences. We evaluate the performance of these approaches using simulated data, and demonstrate the applicability of this strategy to empirical data. The two-phase strategy proves to be time-efficient when applied to large datasets, and yields high-confidence results.Comment: 5 pages, 3 figures. Chan CX, Beiko RG and Ragan MA (2007). A two-phase approach for detecting recombination in nucleotide sequences. In Hazelhurst S and Ramsay M (Eds) Proceedings of the First Southern African Bioinformatics Workshop, 28-30 January, Johannesburg, 9-1

    Detecting overlapping coding sequences in virus genomes

    Get PDF
    BACKGROUND: Detecting new coding sequences (CDSs) in viral genomes can be difficult for several reasons. The typically compact genomes often contain a number of overlapping coding and non-coding functional elements, which can result in unusual patterns of codon usage; conservation between related sequences can be difficult to interpret – especially within overlapping genes; and viruses often employ non-canonical translational mechanisms – e.g. frameshifting, stop codon read-through, leaky-scanning and internal ribosome entry sites – which can conceal potentially coding open reading frames (ORFs). RESULTS: In a previous paper we introduced a new statistic – MLOGD (Maximum Likelihood Overlapping Gene Detector) – for detecting and analysing overlapping CDSs. Here we present (a) an improved MLOGD statistic, (b) a greatly extended suite of software using MLOGD, (c) a database of results for 640 virus sequence alignments, and (d) a web-interface to the software and database. Tests show that, from an alignment with just 20 mutations, MLOGD can discriminate non-overlapping CDSs from non-coding ORFs with a typical accuracy of up to 98%, and can detect CDSs overlapping known CDSs with a typical accuracy of 90%. In addition, the software produces a variety of statistics and graphics, useful for analysing an input multiple sequence alignment. CONCLUSION: MLOGD is an easy-to-use tool for virus genome annotation, detecting new CDSs – in particular overlapping or short CDSs – and for analysing overlapping CDSs following frameshift sites. The software, web-server, database and supplementary material are available at

    Intragenic homogenization and multiple copies of prey-wrapping silk genes in Argiope garden spiders.

    Get PDF
    BackgroundSpider silks are spectacular examples of phenotypic diversity arising from adaptive molecular evolution. An individual spider can produce an array of specialized silks, with the majority of constituent silk proteins encoded by members of the spidroin gene family. Spidroins are dominated by tandem repeats flanked by short, non-repetitive N- and C-terminal coding regions. The remarkable mechanical properties of spider silks have been largely attributed to the repeat sequences. However, the molecular evolutionary processes acting on spidroin terminal and repetitive regions remain unclear due to a paucity of complete gene sequences and sampling of genetic variation among individuals. To better understand spider silk evolution, we characterize a complete aciniform spidroin gene from an Argiope orb-weaving spider and survey aciniform gene fragments from congeneric individuals.ResultsWe present the complete aciniform spidroin (AcSp1) gene from the silver garden spider Argiope argentata (Aar_AcSp1), and document multiple AcSp1 loci in individual genomes of A. argentata and the congeneric A. trifasciata and A. aurantia. We find that Aar_AcSp1 repeats have >98% pairwise nucleotide identity. By comparing AcSp1 repeat amino acid sequences between Argiope species and with other genera, we identify regions of conservation over vast amounts of evolutionary time. Through a PCR survey of individual A. argentata, A. trifasciata, and A. aurantia genomes, we ascertain that AcSp1 repeats show limited variation between species whereas terminal regions are more divergent. We also find that average dN/dS across codons in the N-terminal, repetitive, and C-terminal encoding regions indicate purifying selection that is strongest in the N-terminal region.ConclusionsUsing the complete A. argentata AcSp1 gene and spidroin genetic variation between individuals, this study clarifies some of the molecular evolutionary processes underlying the spectacular mechanical attributes of aciniform silk. It is likely that intragenic concerted evolution and functional constraints on A. argentata AcSp1 repeats result in extreme repeat homogeneity. The maintenance of multiple AcSp1 encoding loci in Argiope genomes supports the hypothesis that Argiope spiders require rapid and efficient protein production to support their prolific use of aciniform silk for prey-wrapping and web-decorating. In addition, multiple gene copies may represent the early stages of spidroin diversification

    Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m6A modification

    Get PDF
    Understanding genome organization and gene regulation requires insight into RNA transcription, processing and modification. We adapted nanopore direct RNA sequencing to examine RNA from a wild-type accession of the model plant Arabidopsis thaliana and a mutant defective in mRNA methylation (m6A). Here we show that m6A can be mapped in full-length mRNAs transcriptome-wide and reveal the combinatorial diversity of cap-associated transcription start sites, splicing events, poly(A) site choice and poly(A) tail length. Loss of m6A from 3’ untranslated regions is associated with decreased relative transcript abundance and defective RNA 30 end formation. A functional consequence of disrupted m6A is a lengthening of the circadian period. We conclude that nanopore direct RNA sequencing can reveal the complexity of mRNA processing and modification in full-length single molecule reads. These findings can refine Arabidopsis genome annotation. Further, applying this approach to less well-studied species could transform our understanding of what their genomes encode

    Bioinformatic analysis suggests that the Orbivirus VP6 cistron encodes an overlapping gene

    Get PDF
    Abstract Background The genus Orbivirus includes several species that infect livestock – including Bluetongue virus (BTV) and African horse sickness virus (AHSV). These viruses have linear dsRNA genomes divided into ten segments, all of which have previously been assumed to be monocistronic. Results Bioinformatic evidence is presented for a short overlapping coding sequence (CDS) in the Orbivirus genome segment 9, overlapping the VP6 cistron in the +1 reading frame. In BTV, a 77–79 codon AUG-initiated open reading frame (hereafter ORFX) is present in all 48 segment 9 sequences analysed. The pattern of base variations across the 48-sequence alignment indicates that ORFX is subject to functional constraints at the amino acid level (even when the constraints due to coding in the overlapping VP6 reading frame are taken into account; MLOGD software). In fact the translated ORFX shows greater amino acid conservation than the overlapping region of VP6. The ORFX AUG codon has a strong Kozak context in all 48 sequences. Each has only one or two upstream AUG codons, always in the VP6 reading frame, and (with a single exception) always with weak or medium Kozak context. Thus, in BTV, ORFX may be translated via leaky scanning. A long (83–169 codon) ORF is present in a corresponding location and reading frame in all other Orbivirus species analysed except Saint Croix River virus (SCRV; the most divergent). Again, the pattern of base variations across sequence alignments indicates multiple coding in the VP6 and ORFX reading frames. Conclusion At ~9.5 kDa, the putative ORFX product in BTV is too small to appear on most published protein gels. Nonetheless, a review of past literature reveals a number of possible detections. We hope that presentation of this bioinformatic analysis will stimulate an attempt to experimentally verify the expression and functional role of ORFX, and hence lead to a greater understanding of the molecular biology of these important pathogens.</p
    corecore