473,498 research outputs found
Tidying up international nucleotide sequence databases
Sequence analysis of the ribosomal RNA operon, particularly the internal transcribed spacer (ITS) region, provides a powerful tool for identification of mycorrhizal fungi. The sequence data deposited in the International Nucleotide Sequence Databases (INSD) are, however, unfiltered for quality and are often poorly annotated with metadata. To detect chimeric and low-quality sequences and assign the ectomycorrhizal fungi to phylogenetic lineages, fungal ITS sequences were downloaded from INSD, aligned within family-level groups, and examined through phylogenetic analyses and BLAST searches. By combining the fungal sequence database UNITE and the annotation and search tool PlutoF, we also added metadata from the literature to these accessions. Altogether 35,632 sequences belonged to mycorrhizal fungi or originated from ericoid and orchid mycorrhizal roots. Of these sequences, 677 were considered chimeric and 2,174 of low read quality. Information detailing country of collection, geographical coordinates, interacting taxon and isolation source were supplemented to cover 78.0%, 33.0%, 41.7% and 96.4% of the sequences, respectively. These annotated sequences are publicly available via UNITE (http://unite.ut.ee/) for downstream biogeographic, ecological and taxonomic analyses. In European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/), the annotated sequences have a special link-out to UNITE. We intend to expand the data annotation to additional genes and all taxonomic groups and functional guilds of fungi
Compressing DNA sequence databases with coil
Background: Publicly available DNA sequence databases such as GenBank are large, and are
growing at an exponential rate. The sheer volume of data being dealt with presents serious storage
and data communications problems. Currently, sequence data is usually kept in large "flat files,"
which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which
rarely achieves good compression ratios. While much research has been done on compressing
individual DNA sequences, surprisingly little has focused on the compression of entire databases
of such sequences. In this study we introduce the sequence database compression software coil.
Results: We have designed and implemented a portable software package, coil, for compressing
and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared
towards achieving high compression ratios at the expense of execution time and memory usage
during compression – the compression time represents a "one-off investment" whose cost is
quickly amortised if the resulting compressed file is transmitted many times. Decompression
requires little memory and is extremely fast. We demonstrate a 5% improvement in compression
ratio over state-of-the-art general-purpose compression tools for a large GenBank database file
containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental
additions to a sequence database.
Conclusion: coil presents a compelling alternative to conventional compression of flat files for the
storage and distribution of DNA sequence databases having a narrow distribution of sequence
lengths, such as EST data. Increasing compression levels for databases having a wide distribution of
sequence lengths is a direction for future work
Closing the circle : current state and perspectives of circular RNA databases
Circular RNAs (circRNAs) are covalently closed RNA molecules that have been linked to various diseases, including cancer. However, a precise function and working mechanism are lacking for the larger majority. Following many different experimental and computational approaches to identify circRNAs, multiple circRNA databases were developed as well. Unfortunately, there are several major issues with the current circRNA databases, which substantially hamper progression in the field. First, as the overlap in content is limited, a true reference set of circRNAs is lacking. This results from the low abundance and highly specific expression of circRNAs, and varying sequencing methods, data-analysis pipelines, and circRNA detection tools. A second major issue is the use of ambiguous nomenclature. Thus, redundant or even conflicting names for circRNAs across different databases contribute to the reproducibility crisis. Third, circRNA databases, in essence, rely on the position of the circRNA back-splice junction, whereas alternative splicing could result in circRNAs with different length and sequence. To uniquely identify a circRNA molecule, the full circular sequence is required. Fourth, circRNA databases annotate circRNAs' microRNA binding and protein-coding potential, but these annotations are generally based on presumed circRNA sequences. Finally, several databases are not regularly updated, contain incomplete data or suffer from connectivity issues. In this review, we present a comprehensive overview of the current circRNA databases and their content, features, and usability. In addition to discussing the current issues regarding circRNA databases, we come with important suggestions to streamline further research in this growing field
SIG-DB: leveraging homomorphic encryption to Securely Interrogate privately held Genomic DataBases
Genomic data are becoming increasingly valuable as we develop methods to
utilize the information at scale and gain a greater understanding of how
genetic information relates to biological function. Advances in synthetic
biology and the decreased cost of sequencing are increasing the amount of
privately held genomic data. As the quantity and value of private genomic data
grows, so does the incentive to acquire and protect such data, which creates a
need to store and process these data securely. We present an algorithm for the
Secure Interrogation of Genomic DataBases (SIG-DB). The SIG-DB algorithm
enables databases of genomic sequences to be searched with an encrypted query
sequence without revealing the query sequence to the Database Owner or any of
the database sequences to the Querier. SIG-DB is the first application of its
kind to take advantage of locality-sensitive hashing and homomorphic encryption
to allow generalized sequence-to-sequence comparisons of genomic data.Comment: 38 pages, 3 figures, 4 tables, 1 supplemental table, 7 supplemental
figure
More Mouldy Data: Another mycoplasma gene jumps the silicon barrier into the human genome
The human genome sequence database contains DNA sequences very like those of
mycoplasma molds. It appears such moulds infect not only molecular Biology
laboratories but were picked up by experimenters from contaminated samples and
inserted into GenBank as if they were human. At least one mouldy EST (Expressed
Sequence Tag) has transferred from public databases to commercial tools
(Affymetrix HG-U133 plus 2.0 microarrays). We report a second example
(DA466599) and suggest there is a need to clean up genomic databases but fear
current tools will be inadequate to catch genes which have jumped the silicon
barrier.Comment: data directory contains results of AF241217 and DA466599 blast runs
by EBI in Cambridg
Where differences resemble: sequence-feature analysis in curated databases of intrinsically disordered proteins
Searching the World-Wide-Web using nucleotide and peptide sequences
*Background:* No approaches have yet been developed to allow instant searching of the World-Wide-Web by just entering a string of sequence data. Though general search engines can be tuned to accept ‘processed’ queries, the burden of preparing such ‘search strings’ simply defeats the purpose of quickly locating highly relevant information. Unlike ‘sequence similarity’ searches that employ dedicated algorithms (like BLAST) to compare an input sequence from defined databases, a direct ‘sequence based’ search simply locates quick and relevant information about a blunt piece of nucleotide or peptide sequence. This approach is particularly invaluable to all biomedical researchers who would often like to enter a sequence and quickly locate any pertinent information before proceeding to carry out detailed sequence alignment. 

*Results:* Here, we describe the theory and implementation of a web-based front-end for a search engine, like Google, which accepts sequence fragments and interactively retrieves a collection of highly relevant links and documents, in real-time. e.g. flat files like patent records, privately hosted sequence documents and regular databases. 

*Conclusions:* The importance of this simple yet highly relevant tool will be evident when with a little bit of tweaking, the tool can be engineered to carry out searches on all kinds of hosted documents in the World-Wide-Web.

*Availability:* Instaseq is free web based service that can be accessed by visiting the following hyperlink on the WWW
http://instaseq.georgetown.edu 

- …
