Search CORE

Harvard University - DASH

Recommended from our members

MAGIC-SPP: a database-driven DNA sequence processing package with associated management tools

Author: Cordonnier-Pratt Marie-Michèle
Freeman Robert M
Liang Chun
Pratt Lee H
Qu Junfeng
Sun Feng
Wang Haiming
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Processing raw DNA sequence data is an especially challenging task for relatively small laboratories and core facilities that produce as many as 5000 or more DNA sequences per week from multiple projects in widely differing species. To meet this challenge, we have developed the flexible, scalable, and automated sequence processing package described here. RESULTS: MAGIC-SPP is a DNA sequence processing package consisting of an Oracle 9i relational database, a Perl pipeline, and user interfaces implemented either as JavaServer Pages (JSP) or as a Java graphical user interface (GUI). The database not only serves as a data repository, but also controls processing of trace files. MAGIC-SPP includes an administrative interface, a laboratory information management system, and interfaces for exploring sequences, monitoring quality control, and troubleshooting problems related to sequencing activities. In the sequence trimming algorithm it employs new features designed to improve performance with respect to concerns such as concatenated linkers, identification of the expected start position of a vector insert, and extending the useful length of trimmed sequences by bridging short regions of low quality when the following high quality segment is sufficiently long to justify doing so. CONCLUSION: MAGIC-SPP has been designed to minimize human error, while simultaneously being robust, versatile, flexible and automated. It offers a unique combination of features that permit administration by a biologist with little or no informatics background. It is well suited to both individual research programs and core facilities

Springer - Publisher Connector

MICA: desktop software for comprehensive searching of DNA databases

Author: A Ning
AL Price
Benjamin S Glick
D Gusfield
DE Knuth
E Hunt
I Crawford
J Bikandi
J Reneker
J Reneker
K Murphy
K Rotmistrovsky
L Noé
M Lexa
M Li
MI Abouelhoda
PC Boutros
RA Lippert
S Kurtz
S Rombauts
SF Altschul
SF Altschul
TD Wu
WA Greene
William A Stokes
WJ Kent
WR Pearson
Z Ning
Publication venue: BioMed Central
Publication date: 01/10/2006
Field of study

BACKGROUND: Molecular biologists work with DNA databases that often include entire genomes. A common requirement is to search a DNA database to find exact matches for a nondegenerate or partially degenerate query. The software programs available for such purposes are normally designed to run on remote servers, but an appealing alternative is to work with DNA databases stored on local computers. We describe a desktop software program termed MICA (K-Mer Indexing with Compact Arrays) that allows large DNA databases to be searched efficiently using very little memory. RESULTS: MICA rapidly indexes a DNA database. On a Macintosh G5 computer, the complete human genome could be indexed in about 5 minutes. The indexing algorithm recognizes all 15 characters of the DNA alphabet and fully captures the information in any DNA sequence, yet for a typical sequence of length L, the index occupies only about 2L bytes. The index can be searched to return a complete list of exact matches for a nondegenerate or partially degenerate query of any length. A typical search of a long DNA sequence involves reading only a small fraction of the index into memory. As a result, searches are fast even when the available RAM is limited. CONCLUSION: MICA is suitable as a search engine for desktop DNA analysis software

Springer - Publisher Connector

arXiv.org e-Print Archive

Low-Bandwidth and Non-Compute Intensive Remote Identification of Microbes from Raw Sequencing Reads

Author: AJ Cox
B Langmead
B Rost
D Devos
H Li
H Li
J Shendure
Laurent Gautier
MS Lindner
N Rusk
Ole Lund
S Hoffimann
SF Altschul
SM Rumble
T Smith
Tim J. Hubbard
WJ Kent
Z Ning
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Cheap high-throughput DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approach to the analysis of sequencing data in which the reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data, and the hints can be used for more computationally-demanding work. Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references known to the server. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment. To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients, one of them running in a web browser, in order to demonstrate that gigabytes of raw sequencing reads of unknown origin could be identified without the need to transfer a very large volume of data, and on modestly powered computing devices. A web access is available at http://tapir.cbs.dtu.dk. The source code for a python command-line client, a server, and supplementary data is available at http://bit.ly/1aURxkc

CiteSeerX

Public Library of Science (PLOS)

Online Research Database In Technology

db4DNASeq: An object-oriented DNA database model associated with sequence search method

Author: Amnuaisuk Somnuk Phon
Chin Kuan Ho
Keng Hoong Ng
Kyuk Wei Shoo
Wei Liam Diong
Publication venue
Publication date: 10/06/2008
Field of study

DNA database consists of many nucleotide sequences, it is not only supporting typical database queries, but it also needs to facilitate sequence search and alignment. In this paper, we present an object-oriented nucleotide database which is designed not only for the convenience of executing normal database operations such as insertion, modification or data querying in a fast manner, but it also supports a fast search method on database sequences with reasonable tradeoff between time and memory usage

UUM Repository

An Integrated Pipeline of Open Source Software Adapted for Multi-CPU Architectures: Use in the Large-Scale Identification of Single Nucleotide Polymorphisms

Author: Chandra S.
Eshwar K.
Hanspal Manindra S.
Hoisington David A.
Jayashree B.
Ramesh N.
Spurthi N.
Srinivasan Rajgopal
Varshney Rajeev K.
Vigneshwaran R.
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2007
Field of study

The large amounts of EST sequence data available from a single species of an organism as well as for several species within a genus provide an easy source of identification of intra- and interspecies single nucleotide polymorphisms (SNPs). In the case of model organisms, the data available are numerous, given the degree of redundancy in the deposited EST data. There are several available bioinformatics tools that can be used to mine this data; however, using them requires a certain level of expertise: the tools have to be used sequentially with accompanying format conversion and steps like clustering and assembly of sequences become time-intensive jobs even for moderately sized datasets. We report here a pipeline of open source software extended to run on multiple CPU architectures that can be used to mine large EST datasets for SNPs and identify restriction sites for assaying the SNPs so that cost-effective CAPS assays can be developed for SNP genotyping in genetics and breeding applications. At the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), the pipeline has been implemented to run on a Paracel high-performance system consisting of four dual AMD Opteron processors running Linux with MPICH. The pipeline can be accessed through user-friendly web interfaces at http://hpc.icrisat.cgiar.org/PBSWeb and is available on request for academic use. We have validated the developed pipeline by mining chickpea ESTs for interspecies SNPs, development of CAPS assays for SNP genotyping, and confirmation of restriction digestion pattern at the sequence level

ICRISAT Open Access Repository

Research Repository

Harvesting the Mouse Genome

Author: Batzoglou
Griffin
Kleinjan
Marc Botcherby
Miles
Schwartz
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2002
Field of study

The sequencing of the black 6 mouse (strain C57Bl/6) has reached an important juncture. The BAC fingerprint map is almost complete, the BACs have been endsequenced and a seven-fold coverage whole-genome shotgun has been assembled. Now the BAC-by-BAC sequencing phase is under way and in-depth comparative analysis can be carried out on regions that have been the subject of targeted sequencing. This paper reviews the progress so far and looks forward to the promises of finished sequence

Ensembl variation resources

Author: Birney Ewan
Brent Simon
Chen Yuan
Cunningham Fiona
Flicek Paul
Kulesha Eugene
Marin-Garcia Pablo
McLaren William M
Pritchard Bethan
Rios Daniel
Smedley Damian
Smith James
Spudich Giulietta M
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Springer - Publisher Connector

The Vertebrate Genome Annotation (Vega) database

Author: Ashurst J. L.
Chen C.-K.
Gilbert J. G. R.
Hubbard T.
Jekosch K.
Keenan S.
Meidl P.
Searle S. M.
Stalker J.
Storey R.
Trevanion S.
Wilming L.
Publication venue: Oxford University Press
Publication date: 17/12/2004
Field of study

The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) has been designed to be a community resource for browsing manual annotation of finished sequences from a variety of vertebrate genomes. Its core database is based on an Ensembl-style schema, extended to incorporate curation-specific metadata. In collaboration with the genome sequencing centres, Vega attempts to present consistent high-quality annotation of the published human chromosome sequences. In addition, it is also possible to view various finished regions from other vertebrates, including mouse and zebrafish. Vega displays only manually annotated gene structures built using transcriptional evidence, which can be examined in the browser. Attempts have been made to standardize the annotation procedure across each vertebrate genome, which should aid comparative analysis of orthologues across the different finished regions

CiteSeerX