620 research outputs found

    Long walk to genomics : history and current approaches to genome sequencing and assembly

    Get PDF
    Genomes represent the starting point of genetic studies. Since the discovery of DNA structure, scientists have devoted great efforts to determine their sequence in an exact way. In this review we provide a comprehensive historical background of the improvements in DNA sequencing technologies that have accompanied the major milestones in genome sequencing and assembly, ranging from early sequencing methods to Next-Generation Sequencing platforms. We then focus on the advantages and challenges of the current technologies and approaches, collectively known as Third Generation Sequencing. As these technical advancements have been accompanied by progress in analytical methods, we also review the bioinformatic tools currently employed in de novo genome assembly, as well as some applications of Third Generation Sequencing technologies and high-quality reference genomes

    FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads

    Get PDF
    The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes. The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns. We developed a new, fast and accurate algorithm for nucleic acid sequence analysis, FANSe, with adjustable mismatch allowance settings and ability to handle indels to accurately and quantitatively map millions of reads to small or large reference genomes. It is a seed-based algorithm which uses the whole read information for mapping and high sensitivity and low ambiguity are achieved by using short and non-overlapping reads. Furthermore, FANSe uses hotspot score to prioritize the processing of highly possible matches and implements modified Smithā€“Watermann refinement with reduced scoring matrix to accelerate the calculation without compromising its sensitivity. The FANSe algorithm stably processes datasets from various sequencing platforms, masked or unmasked and small or large genomes. It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets

    Characterization of structural chromosomal variants by massively parallel sequencing

    Get PDF
    Chromosomal Structural Variation (SV) such as translocations, inversions, deletions, and duplications are rearrangements of one or several DNA molecules. SVs are widespread across the human genome, and each individual carries thousands of SVs of different types and sizes. SV are known to contribute both to phenotypic diversity and disease traits, and are therefore of interest in multiple fields, including rare diseases research, and clinical diagnostics. Herein, we present five studies, focused on the analysis of SV using whole genome sequencing (WGS). The project has increased our knowledge regarding the frequency, structure and mechanisms of formation of structural variants in the human genome. In Paper I, II, and IV, we develop and evaluate software for detection and analysis of SV using WGS data. In Paper II, III and IV, we utilize WGS data to delineate the structure and determine the mechanism of formation of several complex SVs. In Paper II, we compare multiple sequencing technologies, and apply these technologies to solve the structure of three complex chromosomal rearrangements. Lastly, in Paper V, we validate the use of SV calling from WGS as a routine test in rare disease diagnostics. Through these studies, we developed and tested tools suitable for WGS SV analysis in a clinical setting. These tools are now part of the routine clinical pipeline; and many of the tools are used by researchers and clinics around the world

    G-CNV: A GPU-based tool for preparing data to detect CNVs with read-depth methods

    Get PDF
    Copy number variations (CNVs) are the most prevalent types of structural variations (SVs) in the human genome and are involved in a wide range of common human diseases. Different computational methods have been devised to detect this type of SVs and to study how they are implicated in human diseases. Recently, computational methods based on high-throughput sequencing (HTS) are increasingly used. The majority of these methods focus on mapping short-read sequences generated from a donor against a reference genome to detect signatures distinctive of CNVs. In particular, read-depth based methods detect CNVs by analyzing genomic regions with significantly different read-depth from the other ones. The pipeline analysis of these methods consists of four main stages: (i) data preparation, (ii) data normalization, (iii) CNV regions identification, and (iv) copy number estimation. However, available tools do not support most of the operations required at the first two stages of this pipeline. Typically, they start the analysis by building the read-depth signal from pre-processed alignments. Therefore, third-party tools must be used to perform most of the preliminary operations required to build the read-depth signal. These data-intensive operations can be efficiently parallelized on graphics processing units (GPUs). In this article, we present G-CNV, a GPU-based tool devised to perform the common operations required at the first two stages of the analysis pipeline. G-CNV is able to filter low-quality read sequences, to mask low-quality nucleotides, to remove adapter sequences, to remove duplicated read sequences, to map the short-reads, to resolve multiple mapping ambiguities, to build the read-depth signal, and to normalize it. G-CNV can be efficiently used as a third-party tool able to prepare data for the subsequent read-depth signal generation and analysis. Moreover, it can also be integrated in CNV detection tools to generate read-depth signals

    Utilization of Probabilistic Models in Short Read Assembly from Second-Generation Sequencing

    Get PDF
    With the advent of cheaper and faster DNA sequencing technologies, assembly methods have greatly changed. Instead of outputting reads that are thousands of base pairs long, new sequencers parallelize the task by producing read lengths between 35 and 400 base pairs. Reconstructing an organismā€™s genome from these millions of reads is a computationally expensive task. Our algorithm solves this problem by organizing and indexing the reads using n-grams, which are short, fixed-length DNA sequences of length n. These n-grams are used to efficiently locate putative read joins, thereby eliminating the need to perform an exhaustive search over all possible read pairs. Our goal was develop a novel n-gram method for the assembly of genomes from next-generation sequencers. Specifically, a probabilistic, iterative approach was utilized to determine the most likely reads to join through development of a new metric that models the probability of any two arbitrary reads being joined together. Tests were run using simulated short read data based on randomly created genomes ranging in lengths from 10,000 to 100,000 nucleotides with 16 to 20x coverage. We were able to successfully re-assemble entire genomes up to 100,000 nucleotides in length

    Development of bioinformatics tools for the rapid and sensitive detection of known and unknown pathogens from next generation sequencing data

    Get PDF
    Infectious diseases still remain one of the main causes of death across the globe. Despite huge advances in clinical diagnostics, establishing a clear etiology remains impossible in a proportion of cases. Since the emergence of next generation sequencing (NGS), a multitude of new research fields based on this technology have evolved. Especially its application in metagenomics ā€“ denoting the research on genomic material taken directly from its environment ā€“ has led to a rapid development of new applications. Metagenomic NGS has proven to be a promising tool in the field of pathogen related research and diagnostics. In this thesis, I present different approaches for the detection of known and the discovery of unknown pathogens from NGS data. These contributions subdivide into three newly developed methods and one publication on a real-world use case of methodology we developed and data analysis based on it. First, I present LiveKraken, a real-time read classification tool based on the core algorithm of Kraken. LiveKraken uses streams of raw data from Illumina sequencers to classify reads taxonomically. This way, we are able to produce results identical to those of Kraken the moment the sequencer finishes. We are furthermore able to provide comparable results in early stages of a sequencing run, allowing saving up to a week of sequencing time. While the number of classified reads grows over time, false classifications appear in negligible numbers and proportions of identified taxa are only affected to a minor extent. In the second project, we designed and implemented PathoLive, a real-time diagnostics pipeline which allows the detection of pathogens from clinical samples before the sequencing procedure is finished. We adapted the core algorithm of HiLive, a real-time read mapper, and enhanced its accuracy for our use case. Furthermore, probably irrelevant sequences automatically marked. The results are visualized in an interactive taxonomic tree that provides an intuitive overview and detailed metrics regarding the relevance of each identified pathogen. Testing PathoLive on the sequencing of a real plasma sample spiked with viruses, we could prove that we ranked the results more accurately throughout the complete sequencing run than any other tested tool did at the end of the sequencing run. With PathoLive, we shift the focus of NGS-based diagnostics from read quantification towards a more meaningful assessment of results in unprecedented turnaround time. The third project aims at the detection of novel pathogens from NGS data. We developed RAMBO-K, a tool which allows rapid and sensitive removal of unwanted host sequences from NGS datasets. RAMBO-K is faster than any tool we tested, while showing a consistently high sensitivity and specificity across different datasets. RAMBO-K rapidly and reliably separates reads from different species. It is suitable as a straightforward standard solution for workflows dealing with mixed datasets. In the fourth project, we used RAMBO-K as well as several other data analyses to discover Berlin squirrelpox virus, a deviant new poxvirus establishing a new genus of poxviridae. Near Berlin, Germany, several juvenile red squirrels (Sciurus vulgaris) were found with moist, crusty skin lesions. Histology, electron microscopy, and cell culture isolation revealed an orthopoxvirus-like infection. After standard workflows yielded no significant results, poxviral reads were assigned using RAMBO-K, enabling the assembly of the genome of the novel virus. With these projects, we established three new application-related methods each of which closes different research gaps. Taken together, we enhance the available repertoire of NGS-based pathogen related research tools and alleviate and fasten a variety of research projects

    Investigating the Validity and Significance of Variant Calls by Next Generation Sequencing (NGS)

    Get PDF
    Whole genome or exome sequencing enables the fast generation of large volumes of data and is currently a hot topic in research. This technique has the subject of extensive research and has vast applications in healthcare and medicine. Next generation sequencing (NGS) has the advantage of providing large length reads when compared to the traditional method of Sanger Sequencing. NGS enables the identification of genetic disease-causing variants, thus, improving the quality of healthcare, diagnostics and biomedical research. One of the major challenges of NGS is the analysis of large data outcomes. The diversity in DNA library preparation methods for various available platforms may result in data inaccuracies. Furthermore, the disparity in variant calling accuracies as a result of using diverse algorithms complicates the process of NGS data analysis. As a result, there is a large possibility for false positive and/or false negative results due to alignment and/or chemistry errors. In this project, we utilized the MiSeq platform that was selected based on its cost effective properties and ability to provide rapid genetic analysis. The autism panel is used in this study to assist the investigation of genomic features associated with autism by targeting 101 genes linked specifically to Autism. Here, we hypothesized that we could devise NGS analysis criteria to distinguish false positive and/or false negative sequencing calls to improve the quality of the generated sequencing data. Four Autism patients cohort of Arab descent have been used as a model for this research. We were able to prove our hypothesized criteria by validating the detected variances by Sanger Sequencing

    Feature-by-Feature ā€“ Evaluating De Novo Sequence Assembly

    Get PDF
    The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the ā€œexcess-dimensionalityā€ of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results

    Methods to improve the accuracy of next-generation sequencing

    Get PDF
    Next-generation sequencing (NGS) is present in all fields of life science, which has greatly promoted the development of basic research while being gradually applied in clinical diagnosis. However, the cost and throughput advantages of next-generation sequencing are offset by large tradeoffs with respect to read length and accuracy. Specifically, its high error rate makes it extremely difficult to detect SNPs or low-abundance mutations, limiting its clinical applications, such as pharmacogenomics studies primarily based on SNP and early clinical diagnosis primarily based on low abundance mutations. Currently, Sanger sequencing is still considered to be the gold standard due to its high accuracy, so the results of next-generation sequencing require verification by Sanger sequencing in clinical practice. In order to maintain high quality next-generation sequencing data, a variety of improvements at the levels of template preparation, sequencing strategy and data processing have been developed. This study summarized the general procedures of next-generation sequencing platforms, highlighting the improvements involved in eliminating errors at each step. Furthermore, the challenges and future development of next-generation sequencing in clinical application was discussed
    • ā€¦
    corecore