Search CORE

1,822 research outputs found

Validating Paired-End Read Alignments in Sequence Graphs

Author: Aluru Srinivas
Dilthey Alexander
Jain Chirag
Zhang Haowen
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019)
Publication date: 01/01/2019
Field of study

Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix-matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second

Dagstuhl Research Online Publication Server

Construction of reference chromosome-scale pseudomolecules for potato: integrating the potato genome with genetic and physical maps

Author: Amoros Walter
Bolser Daniel
Carboni Martín Federico
D'ambrosio Juan Martín
de Boer Jan
de la Cruz Germán
Di Genova Alex
Douches David S.
Eguiluz Maria
Guo Xiao
Guzman Frank
Hackett Christine A.
Hamilton John P.
Li Guangcun
Li Ying
Lozano Roberto
Maass Alejandro
Marshall David
Martinez Diana
McLean Karen
Mejía Nilo
Milne Linda
Munive Susan
Nagy Istvan
Ponce Olga
Ramirez Manuel
Sharma Sanjeev Kumar
Simon Reinhard
Sønderkær Mads
Thomson Susan J.
Publication venue: 'Genetics Society of America'
Publication date: 01/11/2013
Field of study

The genome of potato, a major global food crop, was recently sequenced. The work presented here details the integration of the potato reference genome (DM) with a new STS marker based linkage map and other physical and genetic maps of potato and the closely related species tomato. Primary anchoring of the DM genome assembly was accomplished using a diploid segregating population, which was genotyped with several types of molecular genetic markers to construct a new ~936 cM linkage map comprising 2,469 marker loci. In silico anchoring approaches employed genetic and physical maps from the diploid potato genotype RH and tomato. This combined approach has allowed 951 superscaffolds to be ordered into pseudomolecules corresponding to the 12 potato chromosomes. These pseudomolecules represent 674 Mb (~93%) of the 723 Mb genome assembly and 37,482 (~96%) of the 39,031 predicted genes. The superscaffold order and orientation within the pseudomolecules is closely collinear with independently constructed high density linkage maps. Comparisons between marker distribution and physical location reveal regions of greater and lesser recombination, as well as regions exhibiting significant segregation distortion. The work presented here has led to a greatly improved ordering of the potato reference genome superscaffolds into chromosomal 'pseudomolecules'.Fil: Carboni, Martín Federico. Instituto Nacional de Tecnología Agropecuaria. Centro Regional Buenos Aires. Estación Experimental Agropecuaria Balcarce; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: D'ambrosio, Juan Martín. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional San Cristobal de Huamanga. Laboratorio de Genética y Biotecnología Vegetal; PerúFil: Sharma, Sanjeev Kumar. The James Hutton Institute; Reino UnidoFil: Bolser, Daniel. University of Dundee; Reino UnidoFil: de Boer, Jan. Wageningen University & Researc; Países BajosFil: Sønderkær, Mads . Aalborg University; DinamarcaFil: Amoros, Walter. International Potato Center; PerúFil: de la Cruz, Germán. Universidad Nacional San Cristobal de Huamanga; PerúFil: Di Genova, Alex. Universidad de Chile; ChileFil: Douches, David S.. Michigan State University; Estados UnidosFil: Eguiluz, Maria. Universidad Peruana Cayetano Heredia; PerúFil: Guo, Xiao. Shandong Academy of Agricultural Sciences; ChinaFil: Guzman, Frank. Universidad Peruana Cayetano Heredia; PerúFil: Hackett, Christine A.. Biomathematics and Statistics Scotland; Reino UnidoFil: Hamilton, John P.. Crops Environment and Land Use Programme; IrlandaFil: Li, Guangcun. Shandong Academy of Agricultural Sciences; ChinaFil: Li, Ying. The New Zealand Institute for Plant & Food Research; Nueva ZelandaFil: Lozano, Roberto. Universidad Peruana Cayetano Heredia; PerúFil: Maass, Alejandro. Universidad de Chile; ChileFil: Marshall, David. The James Hutton Institute; Reino UnidoFil: Martinez, Diana. Universidad Peruana Cayetano Heredia; PerúFil: McLean, Karen. The James Hutton Institute; Reino UnidoFil: Mejía, Nilo. Instituto de Investigaciones Agropecuarias. Centro Regional de Investigación La Platina; ChileFil: Milne, Linda. The James Hutton Institute; Reino UnidoFil: Munive, Susan. International Potato Center; PerúFil: Nagy, Istvan. Crops Environment and Land Use Programme; IrlandaFil: Ponce, Olga. Universidad Peruana Cayetano Heredia; PerúFil: Ramirez, Manuel. Universidad Peruana Cayetano Heredia; PerúFil: Simon, Reinhard. International Potato Center; PerúFil: Thomson, Susan J.. Chinese Academy of Agricultural Sciences; Chin

CONICET Digital

Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly

Author: Krampis Konstantinos
Kumari Phil
Mazumder Raja
Simonyan Vahan
Publication venue: Health Sciences Research Commons
Publication date: 01/01/2015
Field of study

Background: The transition to Next Generation sequencing (NGS) sequencing technologies has had numerous applications in Plant, Microbial and Human genomics during the past decade. However, NGS sequencing trades high read throughput for shorter read length, increasing the difficulty for genome assembly. This research presents a comparison of traditional versus Cloud computing-based genome assembly software, using as examples the Velvet and Contrail assemblers and reads from the genome sequence of the zebrafish (Danio rerio) model organism. Results: The first phase of the analysis involved a subset of the zebrafish data set (2X coverage) and best results were obtained using K-mer size of 65, while it was observed that Velvet takes less time than Contrail to complete the assembly. In the next phase, genome assembly was attempted using the full dataset of read coverage 192x and while Velvet failed to complete on a 256GB memory compute server, Contrail completed but required 240hours of computation. Conclusion: This research concludes that for deciding on which assembler software to use, the size of the dataset and available computing hardware should be taken into consideration. For a relatively small sequencing dataset, such as microbial or small eukaryotic genome, the Velvet assembler is a good option. However, for larger datasets Velvet requires large-memory compute servers in the order of 1000GB or more. On the other hand, Contrail is implemented using Hadoop, which performs the assembly in parallel across nodes of a compute cluster. Furthermore, Hadoop clusters can be rented on-demand from Cloud computing providers, and therefore Contrail can provide a simple and cost effective way for genome assembly of data generated at laboratories that lack the infrastructure or funds to build their own clusters

George Washington University: Health Sciences Research Commons (HSRC)

Efficient methods for read mapping

Author: Zhang Haowen
Publication venue: Georgia Institute of Technology
Publication date: 25/08/2022
Field of study

DNA sequencing is the mainstay of biological and medical research. Modern sequencing machines can read millions of DNA fragments, sampling the underlying genomes at high-throughput. Mapping the resulting reads to a reference genome is typically the first step in sequencing data analysis. The problem has many variants as the reads can be short or long with a low or high error rate for different sequencing technologies, and the reference can be a single genome or a graph representation of multiple genomes. Therefore, it is crucial to develop efficient computational methods for these different problem classes. Moreover, continually declining sequencing costs and increasing throughput pose challenges to the previously developed methods and tools that cannot handle the growing volume of sequencing data. This dissertation seeks to advance the state-of-the-art in the established field of read mapping by proposing more efficient and scalable read mapping methods as well as tackling emerging new problem areas. Specifically, we design ultra-fast methods to map two types of reads: short reads for high-throughput chromatin profiling and nanopore raw reads for targeted sequencing in real-time. In tune with the characteristics of these types of reads, our methods can scale to larger sequencing data sets or map more reads correctly compared with the state-of-the-art mapping software. Furthermore, we propose two algorithms for aligning sequences to graphs, which is the foundation of mapping reads to graph-based reference genomes. One algorithm improves the time complexity of existing sequence to graph alignment algorithms for linear or affine gap penalty. The other algorithm provides good empirical performance in the case of the edit distance metric. Finally, we mathematically formulate the problem of validating paired-end read constraints when mapping sequences to graphs, and propose an exact algorithm that is also fast enough for practical use.Ph.D

Scholarly Materials And Research @ Georgia Tech

Genomu hairetsu no asenburi, kaiseki oyobi hyoka no tame no keisan paipurain no kaihatsu

Author: Jayakumar Vasanthan
ジャヤクマルワサンタン
Publication venue: 慶應義塾大学大学院理工学研究科
Publication date
Field of study

KeiO Academic Resource Archive

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

Author: Albracht Derek
et al
Fulton Robert S
Graves-Lindsay Tina
Kremitzki Milinn
Magrini Vincent
Markovic Chris
McGrath Sean
Steinberg Karyn Meltz
Wilson Richard K
Publication venue: Digital Commons@Becker
Publication date: 01/01/2017
Field of study

Digital Commons@Becker

Enzyme selection for optical mapping is hard

Author: Adams Laura
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2015
Field of study

Includes bibliographical references.2015 Summer.The process of assembling a genome, without access to a reference genome, is prone to a type of error called a misassembly error. These errors are difficult to detect and can mimic true, biological variation. Optical mapping data has been shown to have the potential to reduce misassembly errors in draft genomes. Optical mapping data is generated using digestion enzymes on a genome. In this paper, we formulate the problem of selecting optimal digestion enzymes to create the most informative optical map. We show this process in NP-hard and W[1]-hard. We also propose and evaluate a machine learning method using a support vector machine and feature reduction to estimate the optimal enzymes. Using this method, we were able to predict two optimal enzymes exactly and estimate three more within reasonable similarity

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Structural variant calling: the long and the short of it.

Author: Cruz-Dávalos DI
Dessimoz C
Gobet N
Mahmoud M
Mounier N
Sedlazeck FJ
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 20/11/2019
Field of study

Recent research into structural variants (SVs) has established their importance to medicine and molecular biology, elucidating their role in various diseases, regulation of gene expression, ethnic diversity, and large-scale chromosome evolution-giving rise to the differences within populations and among species. Nevertheless, characterizing SVs and determining the optimal approach for a given experimental design remains a computational and scientific challenge. Multiple approaches have emerged to target various SV classes, zygosities, and size ranges. Here, we review these approaches with respect to their ability to infer SVs across the full spectrum of large, complex variations and present computational methods for each approach

Serveur académique lausannois

UCL Discovery