579 research outputs found

    Optimization Techniques For Next-Generation Sequencing Data Analysis

    Get PDF
    High-throughput RNA sequencing (RNA-Seq) is a popular cost-efficient technology with many medical and biological applications. This technology, however, presents a number of computational challenges in reconstructing full-length transcripts and accurately estimate their abundances across all cell types. Our contributions include (1) transcript and gene expression level estimation methods, (2) methods for genome-guided and annotation-guided transcriptome reconstruction, and (3) de novo assembly and annotation of real data sets. Transcript expression level estimation, also referred to as transcriptome quantification, tackle the problem of estimating the expression level of each transcript. Transcriptome quantification analysis is crucial to determine similar transcripts or unraveling gene functions and transcription regulation mechanisms. We propose a novel simulated regression based method for transcriptome frequency estimation from RNA-Seq reads. Transcriptome reconstruction refers to the problem of reconstructing the transcript sequences from the RNA-Seq data. We present genome-guided and annotation-guided transcriptome reconstruction methods. Empirical results on both synthetic and real RNA-seq datasets show that the proposed methods improve transcriptome quantification and reconstruction accuracy compared to currently state of the art methods. We further present the assembly and annotation of Bugula neritina transcriptome (a marine colonial animal), and Tallapoosa darter genome (a species-rich radiation freshwater fish)

    Cloud Computing for Next-Generation Sequencing Data Analysis

    Get PDF
    High-throughput next-generation sequencing (NGS) technologies have evolved rapidly and are reshaping the scope of genomics research. The substantial decrease in the cost of NGS techniques in the past decade has led to its rapid adoption in biological research and drug development. Genomics studies of large populations are producing a huge amount of data, giving rise to computational issues around the storage, transfer, and analysis of the data. Fortunately, cloud computing has recently emerged as a viable option to quickly and easily acquire the computational resources for large-scale NGS data analyses. Some cloud-based applications and resources have been developed specifically to address the computational challenges of working with very large volumes of data generated by NGS technology. In this chapter, we will review some cloud-based systems and solutions for NGS data analysis, discuss the practical hurdles and limitations in cloud computing, including data transfer and security, and share the lessons we learned from the implementation of Rainbow, a cloud-based tool for large-scale genome sequencing data analysis

    Enabling and Performance Benchmarking of a Next-generation Sequencing Data Analysis Pipeline

    Get PDF
    The development of Next Generation Sequencing (NGS) technology resulted the rapid accumulation of a large amount of sequencing data demanding data mining. Various of variant calling softwares and pipelines came into being. Genome Analysis Toolkit (GATK) and its Best Practices quickly became the industrial gold-standard for variant calling because of its speediness, high accuracy and throughput. GATK has been updated all the time. The latest and strongest version is GATK4 which enabled parallelization and cloud infrastructure optimization via Apache spark. Currently, Broad Institute has cooperated with many cloud providers to deploy GATK Best Practices on cloud platform. However, there is no benchmarking data released for GATK4 and no cooperation with CSC (CSC – IT Center of Science Ltd) cPouta IaaS (Infrastructure as a Service) cloud. We optimized WDL (workflow description language) script of germline SNPs and Indels short variants discovery workflow from Best Practices and ran it by Cromwell execution engine on a virtual machine of cPouta cloud which featured a 24 cores Intel(R) Xeon(R) CPU E5-2680 v3 with hyper-threading. In addition, we benchmarked pipeline execution time(s) for five seperated pipelines of this workflow with three 30X WGS (Whole Genome Sequencing) datasets: NA12878, NA12891 and NA12892 and explored optimized runtime parameters for GATK4 tools, PairHMM thread scalability in HaplotypeCaller, GATK4 thread scalability for PGC in MarkDuplicates and execution times comparison for GATK4 SortSam vs SortSamSpark and MarkDuplicates vs MarkDuplicatesSpark. We found the real execution time for similar WGS datasets with different size and features showed consistency and execution time and dataset size were roughly positive correlated. The optimal threads number is 12 for GATK4 HaplotypeCaller in ERC mode, giving rise to 12.4% speed-up. The optimal PGC threads number is 2 for GATK4 MarkDuplicates. And, multi-threading with Spark local runner highly speeded up GATK4 tool execution. SortSamSpark enabled 16 local cores gave rise to a speed-up of 83.6%. MarkDuplicatesSpark enabled 16 local cores gave rise to a speed-up of 22.2% and 37.3%, seperately with and without writing metrics file. With detailed virtural machine setting up, optimized parameters and GATK4 performance benchmarking data, this thesis is a guide for implementation of GATK4 Best Practices germline SNPs and Indels short variants discovery workflow on CSC cPouta cloud platform

    QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation

    Get PDF
    BACKGROUND: Next-Generation Sequencing (NGS) has emerged as a widely used tool in molecular biology. While time and cost for the sequencing itself are decreasing, the analysis of the massive amounts of data remains challenging. Since multiple algorithmic approaches for the basic data analysis have been developed, there is now an increasing need to efficiently use these tools to obtain results in reasonable time. RESULTS: We have developed QuickNGS, a new workflow system for laboratories with the need to analyze data from multiple NGS projects at a time. QuickNGS takes advantage of parallel computing resources, a comprehensive back-end database, and a careful selection of previously published algorithmic approaches to build fully automated data analysis workflows. We demonstrate the efficiency of our new software by a comprehensive analysis of 10 RNA-Seq samples which we can finish in only a few minutes of hands-on time. The approach we have taken is suitable to process even much larger numbers of samples and multiple projects at a time. CONCLUSION: Our approach considerably reduces the barriers that still limit the usability of the powerful NGS technology and finally decreases the time to be spent before proceeding to further downstream analysis and interpretation of the data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-1695-x) contains supplementary material, which is available to authorized users

    Computational methods for protein structure prediction and next-generation sequencing data analysis

    Get PDF
    With the wide application of next-generation sequencing technologies, the number of protein sequences is increasing exponentially. However, only a tiny portion of proteins have experimentally verified structures. The huge protein sequence-structure gap could be reduced by computational methods including template-based modeling and template-free modeling. Chapter 2 describes a stochastic point cloud sampling method for multi-template protein model generation. The stochastic sampling and simulated annealing protocol in the method has the capability to improve the global quality and reduce atom clashes in protein models. Two popular approaches for improving protein structure prediction include enlarging the sampling space of template-based modeling and integrating template-based modeling with template-free modeling when no good templates or only partial templates can be found for a target protein. Chapters 3 and 4 introduce a large-scale conformation sampling and evaluation system for protein structure prediction which integrates the two methods. Next-generation sequencing of RNAs (RNA-Seq) generates hundreds of millions of short reads. Analyzing these reads is increasingly being used to foster novel discovery in biomedical research. Chapter 5 describes a bioinformatics pipeline for RNA-Seq data analysis, which converts gigabytes of raw RNA-Seq data into kilobytes of valuable biological knowledge through a five-step data mining and knowledge discovery process

    Next generation sequencing-data analysis for cellulose- and xylan-degrading enzymes from POME metagenome

    Get PDF
    Metagenomic DNA library from palm oil mill effluent (POME) was constructed and subjected to high-throughput screening to find genes encoding cellulose- and xylan-degrading enzymes. DNA of 30 positive fosmid clones were sequenced with next generation sequencing technology and the raw data (short insert-paired) was analyzed with bioinformatic tools. First, the quality of 64,821,599 reverse and forward sequences of 101 bp length raw data was tested using Fastqc and SOLEXA. Then, raw data filtering was carried out by trimming low quality values and short reads and the vector sequences were removed and again the output was checked and the trimming was repeated until a high quality read sets was obtained. The second step was the de novo assembly of sequences to reconstruct 2900 contigs following de Bruijn graph algorithm. Pre-assembled contigs were arranged in order, the distances between contigs were identified and oriented with SSPACE, where 2139 scaffolds have been reconstructed. 16,386 genes have been identified after gene prediction using Prodigal and putative ID assignment with Blastp vs NR protein. The acceptable strategy to handle metagenomic NGS-data in order to detect known and potentially unknown genes is presented and we showed the computational efficiency of de Bruijn graph algorithm of de novo assembly to 21 bioprospect genes encoding cellulose-degrading enzymes and 6 genes encoding xylan-degrading enzymes of 30.3% to 100% identity percentage

    A Comparative Study of \u3ci\u3eK\u3c/i\u3e-Spectrum-Based Error Correction Methods for Next-Generation Sequencing Data Analysis

    Get PDF
    Background: Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets. Methods: Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method. Results: Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score \u3c 0.80) and/or failed to process one or more datasets. Conclusions: This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods

    Rare Variant Association Testing by Adaptive Combination of P-values

    Get PDF
    With the development of next-generation sequencing technology, there is a great demand for powerful statistical methods to detect rare variants (minor allele frequencies (MAFs)-MidPmethod (Cheung et al., 2012, Genet Epidemiol 36: 675–685) and propose an approach (named ‘adaptive combination of P-values for rare variant association testing’, abbreviated as ‘ADA’) that adaptively combines per-site P-values with the weights based on MAFs. Before combining P-values, we first imposed a truncation threshold upon the per-site P-values, to guard against the noise caused by the inclusion of neutral variants. ThisADA method is shown to outperform popular burden tests and non-burden tests under many scenarios. ADA is recommended for next-generation sequencing data analysis where many neutral variants may be included in a functional region
    • …
    corecore