1,822 research outputs found

    Validating Paired-End Read Alignments in Sequence Graphs

    Get PDF
    Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix-matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second

    Construction of reference chromosome-scale pseudomolecules for potato: integrating the potato genome with genetic and physical maps

    Get PDF
    The genome of potato, a major global food crop, was recently sequenced. The work presented here details the integration of the potato reference genome (DM) with a new STS marker based linkage map and other physical and genetic maps of potato and the closely related species tomato. Primary anchoring of the DM genome assembly was accomplished using a diploid segregating population, which was genotyped with several types of molecular genetic markers to construct a new ~936 cM linkage map comprising 2,469 marker loci. In silico anchoring approaches employed genetic and physical maps from the diploid potato genotype RH and tomato. This combined approach has allowed 951 superscaffolds to be ordered into pseudomolecules corresponding to the 12 potato chromosomes. These pseudomolecules represent 674 Mb (~93%) of the 723 Mb genome assembly and 37,482 (~96%) of the 39,031 predicted genes. The superscaffold order and orientation within the pseudomolecules is closely collinear with independently constructed high density linkage maps. Comparisons between marker distribution and physical location reveal regions of greater and lesser recombination, as well as regions exhibiting significant segregation distortion. The work presented here has led to a greatly improved ordering of the potato reference genome superscaffolds into chromosomal 'pseudomolecules'.Fil: Carboni, Martín Federico. Instituto Nacional de Tecnología Agropecuaria. Centro Regional Buenos Aires. Estación Experimental Agropecuaria Balcarce; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: D'ambrosio, Juan Martín. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional San Cristobal de Huamanga. Laboratorio de Genética y Biotecnología Vegetal; PerúFil: Sharma, Sanjeev Kumar. The James Hutton Institute; Reino UnidoFil: Bolser, Daniel. University of Dundee; Reino UnidoFil: de Boer, Jan. Wageningen University & Researc; Países BajosFil: Sønderkær, Mads . Aalborg University; DinamarcaFil: Amoros, Walter. International Potato Center; PerúFil: de la Cruz, Germán. Universidad Nacional San Cristobal de Huamanga; PerúFil: Di Genova, Alex. Universidad de Chile; ChileFil: Douches, David S.. Michigan State University; Estados UnidosFil: Eguiluz, Maria. Universidad Peruana Cayetano Heredia; PerúFil: Guo, Xiao. Shandong Academy of Agricultural Sciences; ChinaFil: Guzman, Frank. Universidad Peruana Cayetano Heredia; PerúFil: Hackett, Christine A.. Biomathematics and Statistics Scotland; Reino UnidoFil: Hamilton, John P.. Crops Environment and Land Use Programme; IrlandaFil: Li, Guangcun. Shandong Academy of Agricultural Sciences; ChinaFil: Li, Ying. The New Zealand Institute for Plant & Food Research; Nueva ZelandaFil: Lozano, Roberto. Universidad Peruana Cayetano Heredia; PerúFil: Maass, Alejandro. Universidad de Chile; ChileFil: Marshall, David. The James Hutton Institute; Reino UnidoFil: Martinez, Diana. Universidad Peruana Cayetano Heredia; PerúFil: McLean, Karen. The James Hutton Institute; Reino UnidoFil: Mejía, Nilo. Instituto de Investigaciones Agropecuarias. Centro Regional de Investigación La Platina; ChileFil: Milne, Linda. The James Hutton Institute; Reino UnidoFil: Munive, Susan. International Potato Center; PerúFil: Nagy, Istvan. Crops Environment and Land Use Programme; IrlandaFil: Ponce, Olga. Universidad Peruana Cayetano Heredia; PerúFil: Ramirez, Manuel. Universidad Peruana Cayetano Heredia; PerúFil: Simon, Reinhard. International Potato Center; PerúFil: Thomson, Susan J.. Chinese Academy of Agricultural Sciences; Chin

    Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly

    Get PDF
    Background: The transition to Next Generation sequencing (NGS) sequencing technologies has had numerous applications in Plant, Microbial and Human genomics during the past decade. However, NGS sequencing trades high read throughput for shorter read length, increasing the difficulty for genome assembly. This research presents a comparison of traditional versus Cloud computing-based genome assembly software, using as examples the Velvet and Contrail assemblers and reads from the genome sequence of the zebrafish (Danio rerio) model organism. Results: The first phase of the analysis involved a subset of the zebrafish data set (2X coverage) and best results were obtained using K-mer size of 65, while it was observed that Velvet takes less time than Contrail to complete the assembly. In the next phase, genome assembly was attempted using the full dataset of read coverage 192x and while Velvet failed to complete on a 256GB memory compute server, Contrail completed but required 240hours of computation. Conclusion: This research concludes that for deciding on which assembler software to use, the size of the dataset and available computing hardware should be taken into consideration. For a relatively small sequencing dataset, such as microbial or small eukaryotic genome, the Velvet assembler is a good option. However, for larger datasets Velvet requires large-memory compute servers in the order of 1000GB or more. On the other hand, Contrail is implemented using Hadoop, which performs the assembly in parallel across nodes of a compute cluster. Furthermore, Hadoop clusters can be rented on-demand from Cloud computing providers, and therefore Contrail can provide a simple and cost effective way for genome assembly of data generated at laboratories that lack the infrastructure or funds to build their own clusters

    Efficient methods for read mapping

    Get PDF
    DNA sequencing is the mainstay of biological and medical research. Modern sequencing machines can read millions of DNA fragments, sampling the underlying genomes at high-throughput. Mapping the resulting reads to a reference genome is typically the first step in sequencing data analysis. The problem has many variants as the reads can be short or long with a low or high error rate for different sequencing technologies, and the reference can be a single genome or a graph representation of multiple genomes. Therefore, it is crucial to develop efficient computational methods for these different problem classes. Moreover, continually declining sequencing costs and increasing throughput pose challenges to the previously developed methods and tools that cannot handle the growing volume of sequencing data. This dissertation seeks to advance the state-of-the-art in the established field of read mapping by proposing more efficient and scalable read mapping methods as well as tackling emerging new problem areas. Specifically, we design ultra-fast methods to map two types of reads: short reads for high-throughput chromatin profiling and nanopore raw reads for targeted sequencing in real-time. In tune with the characteristics of these types of reads, our methods can scale to larger sequencing data sets or map more reads correctly compared with the state-of-the-art mapping software. Furthermore, we propose two algorithms for aligning sequences to graphs, which is the foundation of mapping reads to graph-based reference genomes. One algorithm improves the time complexity of existing sequence to graph alignment algorithms for linear or affine gap penalty. The other algorithm provides good empirical performance in the case of the edit distance metric. Finally, we mathematically formulate the problem of validating paired-end read constraints when mapping sequences to graphs, and propose an exact algorithm that is also fast enough for practical use.Ph.D

    Enzyme selection for optical mapping is hard

    Get PDF
    Includes bibliographical references.2015 Summer.The process of assembling a genome, without access to a reference genome, is prone to a type of error called a misassembly error. These errors are difficult to detect and can mimic true, biological variation. Optical mapping data has been shown to have the potential to reduce misassembly errors in draft genomes. Optical mapping data is generated using digestion enzymes on a genome. In this paper, we formulate the problem of selecting optimal digestion enzymes to create the most informative optical map. We show this process in NP-hard and W[1]-hard. We also propose and evaluate a machine learning method using a support vector machine and feature reduction to estimate the optimal enzymes. Using this method, we were able to predict two optimal enzymes exactly and estimate three more within reasonable similarity

    Structural variant calling: the long and the short of it.

    Get PDF
    Recent research into structural variants (SVs) has established their importance to medicine and molecular biology, elucidating their role in various diseases, regulation of gene expression, ethnic diversity, and large-scale chromosome evolution-giving rise to the differences within populations and among species. Nevertheless, characterizing SVs and determining the optimal approach for a given experimental design remains a computational and scientific challenge. Multiple approaches have emerged to target various SV classes, zygosities, and size ranges. Here, we review these approaches with respect to their ability to infer SVs across the full spectrum of large, complex variations and present computational methods for each approach
    corecore