162,054 research outputs found

    The Sequence Read Archive

    Get PDF
    The combination of significantly lower cost and increased speed of sequencing has resulted in an explosive growth of data submitted into the primary next-generation sequence data archive, the Sequence Read Archive (SRA). The preservation of experimental data is an important part of the scientific record, and increasing numbers of journals and funding agencies require that next-generation sequence data are deposited into the SRA. The SRA was established as a public repository for the next-generation sequence data and is operated by the International Nucleotide Sequence Database Collaboration (INSDC). INSDC partners include the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). The SRA is accessible at http://www.ncbi.nlm.nih.gov/Traces/sra from NCBI, at http://www.ebi.ac.uk/ena from EBI and at http://trace.ddbj.nig.ac.jp from DDBJ. In this article, we present the content and structure of the SRA, detail our support for sequencing platforms and provide recommended data submission levels and formats. We also briefly outline our response to the challenge of data growth

    Global analysis of mutations driving microevolution of a heterozygous diploid fungal pathogen

    Get PDF
    Data deposition: The sequence reported in this paper has been deposited in the NCBI Sequence Read Archive, https://www.ncbi.nlm.nih.gov/bioproject (BioProject ID PRJNA345600). This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1806002115/-/DCSupplemental.Peer reviewedPublisher PD

    ENA as an Information Hub

    Get PDF
    The European Nucleotide Archive (ENA; "http://www.ebi.ac.uk/ena/":http://www.ebi.ac.uk/ena/) is a comprehensive repository for public nucleotide sequence data from nearly four hundred thousand taxonomic nodes. Together with partners in the International Nucleotide Sequence Database Collaboration (INSDC; EBI, NCBI and DDBJ) we provide a broad spectrum of sequences, from raw reads (Sequence Read Archive data class), assembled contigs (Whole Genome Shotgun data class), assemblies of EST transcripts (Transcriptome Shotgun Assembly data set), to partial or complete assembled nucleic acid molecules with functional annotation derived from direct and third party experimental evidence (Standard and TPA data classes, respectively). Resources beyond ENA, such as RNA and protein databases, genome collections and model organism services, use data stored and presented at ENA as both source and underlying supporting evidence for their records. Integration of the growing wealth of molecular information is a great challenge that brings opportunities for ENA to serve as a bioinformatics data information hub, allowing, through its provision of permanent identifiers for sequence and project records, community-recognized identifiers for navigation across databases.

As a comprehensive repository of directly sequenced nucleic acid molecules we have the unique opportunity to obtain exact provenance information directly from the submitting researchers. Our pre-publication biocuration efforts are focused on obtaining rich and accurate information on the sample that has been sequenced and on the methodology surrounding its preparation for sequencing. We present here an insight into data flow in the archive and a straightforward biologist-orientated submission system with a rule-based validator for smaller sets of sequences

    The European Nucleotide Archive

    Get PDF
    The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is Europe’s primary nucleotide-sequence repository. The ENA consists of three main databases: the Sequence Read Archive (SRA), the Trace Archive and EMBL-Bank. The objective of ENA is to support and promote the use of nucleotide sequencing as an experimental research platform by providing data submission, archive, search and download services. In this article, we outline these services and describe major changes and improvements introduced during 2010. These include extended EMBL-Bank and SRA-data submission services, extended ENA Browser functionality, support for submitting data to the European Genome-phenome Archive (EGA) through SRA, and the launch of a new sequence similarity search service

    Investigation into the annotation of protocol sequencing steps in the sequence read archive

    Get PDF
    BACKGROUND: The workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined. RESULTS: We examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records). CONCLUSIONS: The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present

    Tidying up international nucleotide sequence databases

    Get PDF
    Sequence analysis of the ribosomal RNA operon, particularly the internal transcribed spacer (ITS) region, provides a powerful tool for identification of mycorrhizal fungi. The sequence data deposited in the International Nucleotide Sequence Databases (INSD) are, however, unfiltered for quality and are often poorly annotated with metadata. To detect chimeric and low-quality sequences and assign the ectomycorrhizal fungi to phylogenetic lineages, fungal ITS sequences were downloaded from INSD, aligned within family-level groups, and examined through phylogenetic analyses and BLAST searches. By combining the fungal sequence database UNITE and the annotation and search tool PlutoF, we also added metadata from the literature to these accessions. Altogether 35,632 sequences belonged to mycorrhizal fungi or originated from ericoid and orchid mycorrhizal roots. Of these sequences, 677 were considered chimeric and 2,174 of low read quality. Information detailing country of collection, geographical coordinates, interacting taxon and isolation source were supplemented to cover 78.0%, 33.0%, 41.7% and 96.4% of the sequences, respectively. These annotated sequences are publicly available via UNITE (http://unite.ut.ee/) for downstream biogeographic, ecological and taxonomic analyses. In European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/), the annotated sequences have a special link-out to UNITE. We intend to expand the data annotation to additional genes and all taxonomic groups and functional guilds of fungi

    Dataset of the transcribed 45S ribosomal RNA sequence of the tree crop yerba mate

    Get PDF
    This contribution contains data related to the research article entitled The 18S-25S ribosomal RNA unit of yerba mate (Ilex paraguariensis A. St.-Hil.) (Aguilera et al., 2016). Through a bioinformatic approach involving NGS data, we provide information of the transcribed 45S ribosomal RNA (rRNA) sequence of yerba mate, the first reference for the Ilex L. genus. This dataset comprises information regarding the assembly and annotation of this rRNA unit. The generated data is applicable for comparative analysis and evolutionary studies among Ilex and related taxa. The raw sequencing data used here is available at DDBJ/EMBL/GenBank (NCBI Resource Coordinators, 2016) Sequence Read Archive (SRA) under the accession SRP043293 and the consensus 45S ribosomal RNA sequence has been deposited there under the accession GFHV00000000.Fil: Aguilera, Patricia Mabel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste. Instituto de Biología Subtropical. Universidad Nacional de Misiones. Instituto de Biología Subtropical; ArgentinaFil: Debat, Humberto Julio. Instituto Nacional de Tecnología Agropecuaria; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Grabiele, Mauro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste. Instituto de Biología Subtropical. Instituto de Biología Subtropical - Nodo Posadas | Universidad Nacional de Misiones. Instituto de Biología Subtropical. Instituto de Biología Subtropical - Nodo Posadas; Argentin

    A Hybrid Sequencing Approach Completes the Genome Sequence of Thermoanaerobacter ethanolicus JW 200

    Get PDF
    This is the final version. Available on open access from American Society for Microbiology via the DOI in this recordData availability.The complete genome sequence of T. ethanolicus JW 200 is deposited in GenBank under the accession number CP033580. Illumina and Oxford Nanopore DNA sequence reads have been deposited in the NCBI Sequence Read Archive (accession numbers SRR8113455 and SRR8113456).Thermoanaerobacter ethanolicus JW 200 has been identified as a potential sustainable biofuel producer due to its ability to readily ferment carbohydrates to ethanol. A hybrid sequencing approach, combining Oxford Nanopore and Illumina DNA sequence reads, was applied to produce a single contiguous genome sequence of 2,911,280 bp.Shell Research Ltd

    Whole genome resequencing of a laboratory-adapted Drosophila melanogaster population sample

    Get PDF
    As part of a study into the molecular genetics of sexually dimorphic complex traits, we used high-throughput sequencing to obtain data on genomic variation in an outbred laboratory-adapted fruit fly (Drosophila melanogaster) population. We successfully resequenced the whole genome of 220 hemiclonal females that were heterozygous for the same Berkeley reference line genome (BDGP6/dm6), and a unique haplotype from the outbred base population (LHM). The use of a static and known genetic background enabled us to obtain sequences from whole-genome phased haplotypes. We used a BWA-Picard-GATK pipeline for mapping sequence reads to the dm6 reference genome assembly, at a median depth-of coverage of 31X, and have made the resulting data publicly-available in the NCBI Short Read Archive (Accession number SRP058502). We used Haplotype Caller to discover and genotype 1,726,931 small genomic variants (SNPs and indels, <200bp). Additionally we detected and genotyped 167 large structural variants (1-100Kb in size) using GenomeStrip/2.0. Sequence and genotype data are publicly-available at the corresponding NCBI databases: Short Read Archive, dbSNP and dbVar (BioProject PRJNA282591). We have also released the unfiltered genotype data, and the code and logs for data processing and summary statistics
    corecore