170 research outputs found

    QSRA – a quality-value guided de novo short read assembler

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>New rapid high-throughput sequencing technologies have sparked the creation of a new class of assembler. Since all high-throughput sequencing platforms incorporate errors in their output, short-read assemblers must be designed to account for this error while utilizing all available data.</p> <p>Results</p> <p>We have designed and implemented an assembler, Quality-value guided Short Read Assembler, created to take advantage of quality-value scores as a further method of dealing with error. Compared to previous published algorithms, our assembler shows significant improvements not only in speed but also in output quality.</p> <p>Conclusion</p> <p>QSRA generally produced the highest genomic coverage, while being faster than VCAKE. QSRA is extremely competitive in its longest contig and N50/N80 contig lengths, producing results of similar quality to those of EDENA and VELVET. QSRA provides a step closer to the goal of de novo assembly of complex genomes, improving upon the original VCAKE algorithm by not only drastically reducing runtimes but also increasing the viability of the assembly algorithm through further error handling capabilities.</p

    Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results

    Get PDF
    Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole-genome assembly an appealing target application. In this paper we evaluate the feasibility of de novo genome assembly from short reads (≤100 nucleotides) through a detailed study involving genomic sequences of various lengths and origin, in conjunction with several of the currently popular assembly programs. Our extensive analysis demonstrates that, in addition to sequencing coverage, attributes such as the architecture of the target genome, the identity of the used assembly program, the average read length and the observed sequencing error rates are powerful variables that affect the best achievable assembly of the target sequence in terms of size and correctness

    Transcriptome characterization and polymorphism detection between subspecies of big sagebrush (Artemisia tridentata)

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Big sagebrush (<it>Artemisia tridentata</it>) is one of the most widely distributed and ecologically important shrub species in western North America. This species serves as a critical habitat and food resource for many animals and invertebrates. Habitat loss due to a combination of disturbances followed by establishment of invasive plant species is a serious threat to big sagebrush ecosystem sustainability. Lack of genomic data has limited our understanding of the evolutionary history and ecological adaptation in this species. Here, we report on the sequencing of expressed sequence tags (ESTs) and detection of single nucleotide polymorphism (SNP) and simple sequence repeat (SSR) markers in subspecies of big sagebrush.</p> <p>Results</p> <p>cDNA of <it>A. tridentata </it>sspp. <it>tridentata </it>and <it>vaseyana </it>were normalized and sequenced using the 454 GS FLX Titanium pyrosequencing technology. Assembly of the reads resulted in 20,357 contig consensus sequences in ssp. <it>tridentata </it>and 20,250 contigs in ssp. <it>vaseyana</it>. A BLASTx search against the non-redundant (NR) protein database using 29,541 consensus sequences obtained from a combined assembly resulted in 21,436 sequences with significant blast alignments (≤ 1e<sup>-15</sup>). A total of 20,952 SNPs and 119 polymorphic SSRs were detected between the two subspecies. SNPs were validated through various methods including sequence capture. Validation of SNPs in different individuals uncovered a high level of nucleotide variation in EST sequences. EST sequences of a third, tetraploid subspecies (ssp. <it>wyomingensis</it>) obtained by Illumina sequencing were mapped to the consensus sequences of the combined 454 EST assembly. Approximately one-third of the SNPs between sspp. <it>tridentata </it>and <it>vaseyana </it>identified in the combined assembly were also polymorphic within the two geographically distant ssp. <it>wyomingensis </it>samples.</p> <p>Conclusion</p> <p>We have produced a large EST dataset for <it>Artemisia tridentata</it>, which contains a large sample of the big sagebrush leaf transcriptome. SNP mapping among the three subspecies suggest the origin of ssp. <it>wyomingensis </it>via mixed ancestry. A large number of SNP and SSR markers provide the foundation for future research to address questions in big sagebrush evolution, ecological genetics, and conservation using genomic approaches.</p

    Why barcode? High-throughput multiplex sequencing of mitochondrial genomes for molecular systematics

    Get PDF
    Mitochondrial genome sequences are important markers for phylogenetics but taxon sampling remains sporadic because of the great effort and cost required to acquire full-length sequences. Here, we demonstrate a simple, cost-effective way to sequence the full complement of protein coding mitochondrial genes from pooled samples using the 454/Roche platform. Multiplexing was achieved without the need for expensive indexing tags (‘barcodes’). The method was trialled with a set of long-range polymerase chain reaction (PCR) fragments from 30 species of Coleoptera (beetles) sequenced in a 1/16th sector of a sequencing plate. Long contigs were produced from the pooled sequences with sequencing depths ranging from ∼10 to 100× per contig. Species identity of individual contigs was established via three ‘bait’ sequences matching disparate parts of the mitochondrial genome obtained by conventional PCR and Sanger sequencing. This proved that assembly of contigs from the sequencing pool was correct. Our study produced sequences for 21 nearly complete and seven partial sets of protein coding mitochondrial genes. Combined with existing sequences for 25 taxa, an improved estimate of basal relationships in Coleoptera was obtained. The procedure could be employed routinely for mitochondrial genome sequencing at the species level, to provide improved species ‘barcodes’ that currently use the cox1 gene only

    Adventures in the Enormous: A 1.8 Million Clone BAC Library for the 21.7 Gb Genome of Loblolly Pine

    Get PDF
    Loblolly pine (LP; Pinus taeda L.) is the most economically important tree in the U.S. and a cornerstone species in southeastern forests. However, genomics research on LP and other conifers has lagged behind studies on flowering plants due, in part, to the large size of conifer genomes. As a means to accelerate conifer genome research, we constructed a BAC library for the LP genotype 7-56. The LP BAC library consists of 1,824,768 individually-archived clones making it the largest single BAC library constructed to date, has a mean insert size of 96 kb, and affords 7.6X coverage of the 21.7 Gb LP genome. To demonstrate the efficacy of the library in gene isolation, we screened macroarrays with overgos designed from a pine EST anchored on LP chromosome 10. A positive BAC was sequenced and found to contain the expected full-length target gene, several gene-like regions, and both known and novel repeats. Macroarray analysis using the retrotransposon IFG-7 (the most abundant repeat in the sequenced BAC) as a probe indicates that IFG-7 is found in roughly 210,557 copies and constitutes about 5.8% or 1.26 Gb of LP nuclear DNA; this DNA quantity is eight times the Arabidopsis genome. In addition to its use in genome characterization and gene isolation as demonstrated herein, the BAC library should hasten whole genome sequencing of LP via next-generation sequencing strategies/technologies and facilitate improvement of trees through molecular breeding and genetic engineering. The library and associated products are distributed by the Clemson University Genomics Institute (www.genome.clemson.edu)

    Progenitor-Derivative Relationships of Hordeum Polyploids (Poaceae, Triticeae) Inferred from Sequences of TOPO6, a Nuclear Low-Copy Gene Region

    Get PDF
    Polyploidization is a major mechanism of speciation in plants. Within the barley genus Hordeum, approximately half of the taxa are polyploids. While for diploid species a good hypothesis of phylogenetic relationships exists, there is little information available for the polyploids (4×, 6×) of Hordeum. Relationships among all 33 diploid and polyploid Hordeum species were analyzed with the low-copy nuclear marker region TOPO6 for 341 Hordeum individuals and eight outgroup species. PCR products were either directly sequenced or cloned and on average 12 clones per individual were included in phylogenetic analyses. In most diploid Hordeum species TOPO6 is probably a single-copy locus. Most sequences found in polyploid individuals phylogenetically cluster together with sequences derived from diploid species and thus allow the identification of parental taxa of polyploids. Four groups of sequences occurring only in polyploid taxa are interpreted as footprints of extinct diploid taxa, which contributed to allopolyploid evolution. Our analysis identifies three key species involved in the evolution of the American polyploids of the genus. (i) All but one of the American tetraploids have a TOPO6 copy originating from the Central Asian diploid H. roshevitzii, the second copy clustering with different American diploid species. (ii) All hexaploid species from the New World have a copy of an extinct close relative of H. californicum and (iii) possess the TOPO6 sequence pattern of tetraploid H. jubatum, each with an additional copy derived from different American diploids. Tetraploid H. bulbosum is an autopolyploid, while the assumed autopolyploid H. brevisubulatum (4×, 6×) was identified as allopolyploid throughout most of its distribution area. The use of a proof-reading DNA polymerase in PCR reduced the proportion of chimerical sequences in polyploids in comparison to Taq polymerase

    Geogenic and atmospheric sources for volatile organic compounds in fumarolic emissions from Mt. Etna and Vulcano Island (Sicily, Italy)

    Get PDF
    In this paper, fluid source(s) and processes controlling the chemical composition of volatile organic compounds (VOCs) in gas discharges from Mt. Etna and Vulcano Island(Sicily, Italy) were investigated. The main composition of the Etnean and Volcano gas emissions is produced by mixing, to various degrees, of magmatic and hydrothermal components. VOCs are dominated by alkanes, alkenes and aromatics, with minor, though significant, concentrations of O-, S- and Cl(F)-substituted compounds. The main mechanism for the production of alkanes is likely related to pyrolysis of organic-matterbearing sediments that interact with the ascending magmatic fluids. Alkanes are then converted to alkene and aromatic compounds via catalytic reactions (dehydrogenation and dehydroaromatization, respectively). Nevertheless, an abiogenic origin for the light hydrocarbons cannot be ruled out. Oxidative processes of hydrocarbons at relatively high temperatures and oxidizing conditions, typical of these volcanic-hydrothermal fluids, may explain the production of alcohols, esters, aldehydes, as well as O- and S-bearing heterocycles. By comparing the concentrations of hydrochlorofluorocarbons (HCFCs) in the fumarolic discharges with respect to those of background air, it is possible to highlight that they have a geogenic origin likely due to halogenation of both methane and alkenes. Finally, chlorofluorocarbon (CFC) abundances appear to be consistent with background air, although the strong air contamination that affects the Mt. Etna fumaroles may mask a possible geogenic contribution for these compounds. On the other hand, no CFCs were detected in the Vulcano gases, which are characterized by low air contribution. Nevertheless, a geogenic source for these compounds cannot be excluded on the basis of the present data

    Deep Sequencing of the Nicastrin Gene in Pooled DNA, the Identification of Genetic Variants That Affect Risk of Alzheimer's Disease

    Get PDF
    Nicastrin is an obligatory component of the γ-secretase; the enzyme complex that leads to the production of Aβ fragments critically central to the pathogenesis of Alzheimer's disease (AD). Analyses of the effects of common variation in this gene on risk for late onset AD have been inconclusive. We investigated the effect of rare variation in the coding regions of the Nicastrin gene in a cohort of AD patients and matched controls using an innovative pooling approach and next generation sequencing. Five SNPs were identified and validated by individual genotyping from 311 cases and 360 controls. Association analysis identified a non-synonymous rare SNP (N417Y) with a statistically higher frequency in cases compared to controls in the Greek population (OR 3.994, CI 1.105–14.439, p = 0.035). This finding warrants further investigation in a larger cohort and adds weight to the hypothesis that rare variation explains some of genetic heritability still to be identified in Alzheimer's disease
    corecore