292 research outputs found

    Fuzzy-based Spectral Alignment for Correcting DNA Sequence from Next Generation Sequencer

    Get PDF
    Next generation sequencing technology is able to generate short read in large numbers and in a relatively short in single running programs. Graph based DNA sequence assembly used to handle these big data in assembly step. The graph based DNA sequence assembly is very sensitive to DNA sequencing error. This problem could be solved by performing an error correction step before the assembly process. This research proposed fuzzy inference system (FIS) model based spectral alignment method which can detect and correct DNA sequencing error. The spectral alignment technique was implemented as a pre-processing step before the DNA sequence assembly process. The evaluation was conducted using Velvet assembler. The number of nodes yielded by the Velvet assembler become a measure of the success of error correction. The results shows that FIS model based spectral alignment created small number of nodes and therefore it successfully corrected the DNA reads

    Large Genomes Assembly Using MAPREDUCE Framework

    Get PDF
    Knowing the genome sequence of an organism is the essential step toward understanding its genomic and genetic characteristics. Currently, whole genome shotgun (WGS) sequencing is the most widely used genome sequencing technique to determine the entire DNA sequence of an organism. Recent advances in next-generation sequencing (NGS) techniques have enabled biologists to generate large DNA sequences in a high-throughput and low-cost way. However, the assembly of NGS reads faces significant challenges due to short reads and an enormously high volume of data. Despite recent progress in genome assembly, current NGS assemblers cannot generate high-quality results or efficiently handle large genomes with billions of reads. In this research, we proposed a new Genome Assembler based on MapReduce (GAMR), which tackles both limitations. GAMR is based on a bi-directed de Bruijn graph and implemented using the MapReduce framework. We designed a distributed algorithm for each step in GAMR, making it scalable in assembling large-scale genomes. We also proposed novel gap-filling algorithms to improve assembly results to achieve higher accuracy and more extended continuity. We evaluated the assembly performance of GAMR using benchmark data and compared it against other NGS assemblers. We also demonstrated the scalability of GAMR by using it to assemble loblolly pine (~22Gbp). The results showed that GAMR finished the assembly much faster and with a much lower requirement of computing resources

    SeedsGraph: an efficient assembler for next-generation sequencing data

    Get PDF

    Fast and accurate genome anchoring using fuzzy hash maps

    Get PDF
    Although hash-based approaches to sequence alignment and genome assembly are long established, their utility is predicated on the rapid identification of exact k-mers from a hash-map or similar data structure. We describe how a fuzzy hash-map can be applied to quickly and accurately align a Prokaryotic genome to the reference genome of a related species. Using this technique, a draft genome of Mycoplasma genitalium, sampled at 1X coverage, was accurately anchored against the genome of Mycoplasma pneumoniae. The fuzzy approach to alignment, ordered and orientated more than 65% of the reads from the draft genome in under 10 seconds, with an error rate of <1.5%. Without sacrificing execution speed, fuzzy hash-maps also provide a mechanism for error tolerance and variability in k-mer centric sequence alignment and assembly applications

    Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler

    Get PDF
    BACKGROUND: Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. PRINCIPAL FINDINGS: We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly. CONCLUSIONS: These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler

    Field-based species identification of closely-related plants using real-time nanopore sequencing

    Get PDF
    Advances in DNA sequencing and informatics have revolutionised biology over the past four decades, but technological limitations have left many applications unexplored. Recently, portable, real-time, nanopore sequencing (RTnS) has become available. This offers opportunities to rapidly collect and analyse genomic data anywhere. However, generation of datasets from large, complex genomes has been constrained to laboratories. The portability and long DNA sequences of RTnS offer great potential for field-based species identification, but the feasibility and accuracy of these technologies for this purpose have not been assessed. Here, we show that a field-based RTnS analysis of closely-related plant species (Arabidopsis spp.) has many advantages over laboratory-based high-throughput sequencing (HTS) methods for species level identification and phylogenomics. Samples were collected and sequenced in a single day by RTnS using a portable, “al fresco” laboratory. Our analyses demonstrate that correctly identifying unknown reads from matches to a reference database with RTnS reads enables rapid and confident species identification. Individually annotated RTnS reads can be used to infer the evolutionary relationships of A. thaliana. Furthermore, hybrid genome assembly with RTnS and HTS reads substantially improved upon a genome assembled from HTS reads alone. Field-based RTnS makes real-time, rapid specimen identification and genome wide analyses possible

    Multiple Data Analyses and Statistical Approaches for Analyzing Data from Metagenomic Studies and Clinical Trials

    Get PDF
    Metagenomics, also known as environmental genomics, is the study of the genomic content of a sample of organisms (microbes) obtained from a common habitat. Metagenomics and other “omics” disciplines have captured the attention of researchers for several decades. The effect of microbes in our body is a relevant concern for health studies. There are plenty of studies using metagenomics which examine microorganisms that inhabit niches in the human body, sometimes causing disease, and are often correlated with multiple treatment conditions. No matter from which environment it comes, the analyses are often aimed at determining either the presence or absence of specific species of interest in a given metagenome or comparing the biological diversity and the functional activity of a wider range of microorganisms within their communities. The importance increases for comparison within different environments such as multiple patients with different conditions, multiple drugs, and multiple time points of same treatment or same patient. Thus, no matter how many hypotheses we have, we need a good understanding of genomics, bioinformatics, and statistics to work together to analyze and interpret these datasets in a meaningful way. This chapter provides an overview of different data analyses and statistical approaches (with example scenarios) to analyze metagenomics samples from different medical projects or clinical trials

    Bioinformatics approaches for hybrid de novo genome assembly

    Get PDF
    De novo genome assembly, the computational process to reconstruct the genomic sequence from scratch stitching together overlapping reads, plays a key role in computational biology and, to date, it cannot be considered a solved problem. Many bioinformatics approaches are available to deal with different type of data generated by diverse technologies. Assemblies relying on short read data resulted to be highly fragmented, reconstructing short contigs interrupted in repetitive region; on the other side long-read based approaches still suffer of high sequencing error rate, worsening the final consensus quality. This thesis aimed to assess the impact of different assembly approaches on the reconstruction of a highly repetitive genome, identifying the strengths and limiting the weaknesses of such approaches through the integration of orthogonal data types. Moreover, a benchmarking study has been undertaken to improve the contiguity of this genome, describing the improvements obtained thanks to the integration of additional data layers. Assemblies performed using short reads confirmed the limitation in the reconstruction of long sequences for both the software adopted. The use of long reads allowed to improve the genome assembly contiguity, reconstructing also a greater number of gene models. Despite the enhancement of contiguity, base level accuracy of long reads-based assembly could still not reach higher levels. Therefore, short reads were integrated within the assembly process to limit the base level errors present in the reconstructed sequences up to 96%. To order and orient the assembled polished contigs into longer scaffolds, data derived from three different technologies (linked read, chromosome conformation capture and optical mapping) have been analysed. The best contiguity metrics were obtained using chromosome conformation data, which permit to obtain chromosome-scale scaffolds. To evaluate the obtained results, data derived from linked reads and optical mapping have been used to identify putative misassemblies in the scaffolds. Both the datasets allowed the identification of misassemblies, highlighting the importance of integrating data derived from orthogonal technologies in the de novo assembly process. 4 This work underlines the importance of adopting bioinformatics approaches able to deal with data type generated by different technologies. In this way, results could be more accurately validated for the reconstruction of assemblies that could be eventually considered reference genomes
    corecore