Search CORE

30 research outputs found

Evaluation of next-generation sequencing software in mapping and assembly

Author: A Bashir
A Bateman
AC McHardy
AD Smith
B Langmead
BinBin Wang
C Trapnell
CA Tilford
D Campagna
D Hernandez
D Weese
DR Bentley
DR Zerbino
DS Horner
DW Bryant Jr
ER Mardis
ER Mardis
ES Lander
EW Myers
F Sanger
H Jiang
H Li
H Li
H Li
H Lin
HL Eaves
J Butler
JC Dohm
JC Venter
JO Korbel
JR Miller
JR Miller
JT Simpson
JT Simpson
K Chen
KE Holt
L Engstrand
L Noe
M Margulies
M Pop
M Pop
MC Schatz
MJ Chaisson
ML Metzker
MS Hossain
N Homer
N Malhis
NL Clement
O Morozova
O Morozova
P Flicek
P Flicek
P Medvedev
PA Pevzner
PJ Campbell
PJ Hurd
R Staden
RF Service
RL Warren
RQ Li
RQ Li
Rui Jiang
SC Schuster
SM Rumble
Suying Bao
WingKeung Kwan
WJ Ansorge
WR Jeck
Xu Ma
Y Chen
YJ Kim
You-Qiang Song
Z Ning
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields, and further provided advices on selecting suitable tools for specific biological applications.published_or_final_versio

Crossref

HKU Scholars Hub

Bacterial Contamination in Public ATAC-Seq Data and Alignment-Free Detection Methods

Author: Aesoph Drake Michael
Publication venue: 'West Virginia University Libraries'
Publication date: 01/01/2022
Field of study

ATAC-seq is a new high-throughput sequencing technology for measuring chromatin accessibility within genomic samples. It can be used to discover new information about open regions, nucleosome positions, transcription factor binding sites, and DNA methylation. It is especially useful when combined with other next-generation sequencing techniques, such as RNA-seq. Unlike previous technologies, however, ATAC-seq is more sensitive to bacterial contamination, which is a well-known problem in cell cultures that can lead to incorrect experimental results. Previous studies have measured the contamination in public RNA-seq data and found that 5%- 10% of samples were contaminated. In this report, we investigate the prevalence of contamination in ATAC-seq samples, rather than RNA-seq data, uploaded to the Sequence Read Archive using two popular alignment-based tools: Bowtie 2 and Kraken 2. We then develop an alignment-free method of detection using machine learning and a novel method of estimating DNA fragment lengths from paired-end ATAC-seq data. Our results show that around 5% of ATAC-seq samples are contaminated and our machine learning method is able to correctly classify 97% of samples as contaminated or not while using less computational resources than the alignment-based tools. Thus, our method shows promise as a preliminary rapid screening tool for contamination in labs with limited access huge to computational resources

The Research Repository @ WVU (West Virginia University)

MR-CUDASW - GPU accelerated Smith-Waterman algorithm for medium-length (meta)genomic data

Author: Muhammadzadeh Amir
Publication venue: 'University of Saskatchewan Library'
Publication date
Field of study

The idea of using a graphics processing unit (GPU) for more than simply graphic output purposes has been around for quite some time in scientific communities. However, it is only recently that its benefits for a range of bioinformatics and life sciences compute-intensive tasks has been recognized. This thesis investigates the possibility of improving the performance of the overlap determination stage of an Overlap Layout Consensus (OLC)-based assembler by using a GPU-based implementation of the Smith-Waterman algorithm. In this thesis an existing GPU-accelerated sequence alignment algorithm is adapted and expanded to reduce its completion time. A number of improvements and changes are made to the original software. Workload distribution, query profile construction, and thread scheduling techniques implemented by the original program are replaced by custom methods specifically designed to handle medium-length reads. Accordingly, this algorithm is the first highly parallel solution that has been specifically optimized to process medium-length nucleotide reads (DNA/RNA) from modern sequencing machines (i.e. Ion Torrent). Results show that the software reaches up to 82 GCUPS (Giga Cell Updates Per Second) on a single-GPU graphic card running on a commodity desktop hardware. As a result it is the fastest GPU-based implemen- tation of the Smith-Waterman algorithm tailored for processing medium-length nucleotide reads. Despite being designed for performing the Smith-Waterman algorithm on medium-length nucleotide sequences, this program also presents great potential for improving heterogeneous computing with CUDA-enabled GPUs in general and is expected to make contributions to other research problems that require sensitive pairwise alignment to be applied to a large number of reads. Our results show that it is possible to improve the performance of bioinformatics algorithms by taking full advantage of the compute resources of the underlying commodity hardware and further, these results are especially encouraging since GPU performance grows faster than multi-core CPUs

eCommons@USASK

University of Saskatchewan Research Archive

Recommended from our members

Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community

Author: Bicak Mesude
Booth Tim
Chapman Brad
Field Dawn
Krampis Konstantinos
Nelson Karen E
Tiwari Bela
Publication venue: BioMed Central
Publication date: 02/04/2013
Field of study

Background: A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure. Results: Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds. Conclusions: Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them

Harvard University - DASH

PubMed Central

PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly

Author: David A. Bader
Henning Meyerhenke
Pushkar R. Pande
Xing Liu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Recent advances in inferring viral diversity from high-throughput sequencing data

Author: Beerenwinkel Niko
Posada-Cespedes Susana
Seifert David
Publication venue: 'Elsevier BV'
Publication date: 01/07/2017
Field of study

Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.ISSN:0168-170

Repository for Publications and Research Data

Elsevier - Publisher Connector

Bioinformatics' approaches to detect genetic variation in whole genome sequencing data

Author: Kerstens H.H.D.
Publication venue: S.n.
Publication date: 01/01/2010
Field of study

Current genetic marker repositories are not sufficient or even are completely lacking for most farm animals. However, genetic markers are essential for the development of a research tool facilitating discovery of genetic factors that contribute to resistance to disease and the overall welfare and performance in farm animals. By large scale identification of Single Nucleotide Polymorphisms (SNPs) and Structural Variants (SVs) we aimed to contribute to the development of a repository of genetic variants for farm animals. For this purpose bioinformatics data pipelines were designed and validated to address the challenge of the cost effective identification of genetic markers in DNA sequencing data even in absence of a fully sequenced reference genome. To find SNPs in pig, we analysed publicly available whole genome shotgun sequencing datasets by sequence alignment and clustering. Sequence clusters were assigned to genomic locations using publicly available BAC sequencing and BAC mapping data. Within the sequence clusters thousands of SNPs were detected of which the genomic location is roughly known. For turkey and duck, species that both were lacking a sufficient sequence data repository for variant discovery, we applied next-generation sequencing (NGS) on a reduced genome representation of a pooled DNA sample. For turkey a genome reference was reconstructed from our sequencing data and available public sequencing data whereas in duck the reference genome constructed by a (NGS) project was used. SNPs obtained by our cost-effective SNP detection procedure still turned out to cover, at intervals, the whole turkey and duck genomes and are of sufficient quality to be used in genotyping studies. Allele frequencies, obtained by genotyping animal panels with a subset our SNPs, correlated well with those observed during SNP detection. The availability of two external duck SNP datasets allowed for the construction of a subset of SNPs which we had in common with these sets. Genotyping turned out that this subset was of outstanding quality and can be used for benchmarking other SNPs that we identified within duck. Ongoing developments in (NGS) allowed for paired end sequencing which is an extension on sequencing analysis that provides information about which pair of reads are coming from the outer ends of one sequenced DNA fragment. We applied this technique on a reduced genome representation of four chicken breeds to detect SVs. Paired end reads were mapped to the chicken reference genome and SVs were identified as abnormally aligned read pairs that have orientation or span sizes discordant from the reference genome. SV detection parameters, to distinguish true structural variants from false positives, were designed and optimized by validation of a small representative sample of SVs using PCR and traditional capillary sequencing. To conclude: we developed SNP repositories which fulfils a requirement for SNPs to perform linkage analysis, comparative genomics QTL studies and ultimately GWA studies in a range of farm animals. We also set the first step in developing a repository for SVs in chicken, a relatively new genetic marker in animal sciences. <br/

Wageningen University & Research Publications

High Performance Computing for DNA Sequence Alignment and Assembly

Author: Schatz Michael Christopher
Publication venue
Publication date: 01/01/2010
Field of study

Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

Digital Repository at the University of Maryland

Paired is better: local assembly algorithms for NGS paired reads and applications to RNA-Seq

Author: Nadalin Francesca
Publication venue: Università degli Studi di Udine
Publication date: 12/05/2014
Field of study

The analysis of biological sequences is one of the main research areas of Bioinformatics. Sequencing data are the input for almost all types of studies concerning genomic as well as transcriptomic sequences, and sequencing experiments should be conceived specifically for each type of application. The challenges posed by fundamental biological questions are usually addressed by firstly aligning or assemblying the reads produced by new sequencing technologies. Assembly is the first step when a reference sequence is not available. Alignment of genomic reads towards a known genome is fundamental, e.g., to find the differences among organisms of related species, and to detect mutations proper of the so-called "diseases of the genome". Alignment of transcriptomic reads against a reference genome, allows to detect the expressed genes as well as to annotate and quantify alternative transcripts. In this thesis we overview the approaches proposed in literature for solving the above mentioned problems. In particular, we deeply analyze the sequence assembly problem, with particular emphasys on genome reconstruction, both from a more theoretical point of view and in light of the characteristics of sequencing data produced by state-of-the-art technologies. We also review the main steps in a pipeline for the analysis of the transcriptome, that is, alignment, assembly, and transcripts quantification, with particular emphasys on the opportunities given by RNA-Seq technologies in enhancing precision. The thesis is divided in two parts, the first one devoted to the study of local assembly methods for Next Generation Sequencing data, the second one concerning the development of tools for alignment of RNA-Seq reads and transcripts quantification. The permanent theme is the use of paired reads in all fields of applications discussed in this thesis. In particular, we emphasyze the benefits of assemblying inserts from paired reads in a wide range of applications, from de novo assembly, to the analysis of RNA. The main contribution of this thesis lies in the introduction of innovative tools, based on well-studied heuristics fine tuned on the data. Software is always tested to specifically assess the correctness of prediction. The aim is to produce robust methods, that, having low false positives rate, produce a certified output characterized by high specificity.openDottorato di ricerca in InformaticaopenNadalin, Francesc

Archivio istituzionale della ricerca - Università degli Studi di Udine

Discovery and interpretation of genetic variation with next‐generation sequencing technologies

Author: Quinlan Aaron Ryan
Publication venue: 'Boston College University Libraries'
Publication date: 01/01/2008
Field of study

Thesis advisor: Gabor T. MarthImprovements in molecular and computational technologies have driven and will continue to drive advances in our understanding of genetic variation and its relationship to phenotypic diversity. Over the last three years, several new DNA sequencing technologies have been developed that greatly improve upon the cost and throughput of the capillary DNA sequencing technologies that were used to sequence the first human genome. The economy of these so‐called “next‐generation” technologies has enabled researchers to conduct genome‐wide studies in genetic variation that were previously intractable or too expensive. However, because the new technologies employ novel molecular techniques, the resulting sequence data is quite different from the capillary sequences to which the genomics field is accustomed. Moreover, the vast amounts of sequence data that these technologies produce present novel statistical and computational challenges in order to make even the simplest observations. The focus of my dissertation has been the development of novel computational and analytical methods that facilitate genome‐wide studies in genetic variation with traditional capillary sequencers and with new sequencing technologies. I present a novel method that produces more accurate error estimates for sequence data from one of these next‐generation sequencing technologies. I also present two studies that illustrate the utility of two such technologies for genome‐wide polymorphism discovery studies in Drosophila melanogaster and Caenorhabditis elegans. These studies accurately estimate the degree of genetic diversity in the fruitfly and nematode, respectively. I later describe how new sequencing approaches can be used to accelerate the mapping of causal genetic mutations in forward geetic screens. Lastly, I remark on where I believe these technologies will lead future studies in human genetic variation and describe their relevance to several of my future research interests.Thesis (PhD) — Boston College, 2008.Submitted to: Boston College. Graduate School of Arts and Sciences.Discipline: Biology

eScholarship@BC