Search CORE

291 research outputs found

Safe and complete contig assembly via omnitigs

Author: A Bankevich
A Guénoche
AR Rubinov
AS Motahari
C Kingsford
D Haussler
DR Zerbino
E Kapun
E Kapun
ES Lander
G Bresler
G Narzisi
I Lysov
JD Kececioglu
JR Miller
JT Simpson
JT Simpson
K Lam
K Sahlin
L Salmela
M Boetzer
M Boetzer
N Nagarajan
N Nagarajan
N Vyahhi
P Medvedev
P Medvedev
P Medvedev
PA Pevzner
PA Pevzner
R Chikhi
R Chikhi
R Luo
R Uricaru
RM Idury
SL Salzberg
Publication venue
Publication date: 16/08/2016
Field of study

Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph

G

(e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from

G

as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201

arXiv.org e-Print Archive

Crossref

Sequence assembly using next generation sequencing data—challenges and solutions

Author: D Hernandez
DR Kelley
DR Zerbino
EA Rodland
EW Myers
F Sanger
Francis Y. L. Chin
H Leung
HCM Leung
Henry C. M. Leung
J Butler
JC Dohm
JT Simpson
K Salikhov
M Burrows
MJ Chaisson
MJ Chaisson
N Vyahhi
R Li
RL Warren
RW Holley
RW Holley
S. M. Yiu
W Fiers
W Min Jou
WR Jeck
Y Peng
Y Peng
Y Peng
Y Peng
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Safe and Complete Contig Assembly Through Omnitigs

Author: Medvedev Paul
Tomescu Alexandru I.
Publication venue
Publication date: 01/06/2017
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Disk Compression of k-mer Sets

Author: Chikhi Rayan
Medvedev Paul
Rahman Amatur
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)
Publication date: 01/01/2020
Field of study

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do

Directory of Open Access Journals

Dagstuhl Research Online Publication Server

IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels

Author: Bankevich
Chitsaz
Francis Y. L. Chin
Grabherr
Graveley
Guttman
Henry C. M. Leung
Jiang
Kent
Li
Li
Ming-Ju Lv
Nagalakshmi
Peng
Peng
Robertson
Schulz
Simpson
Siu-Ming Yiu
Surget-Groba
Tanaseichuk
Trapnell
Trapnell
Vyahhi
Xin-Guang Zhu
Yu Peng
Zerbino
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2013
Field of study

published_or_final_versio

Crossref

PubMed Central

HKU Scholars Hub

Data structures and algorithms for analysis of alternative splicing with RNA-Seq data

Author: Schulz M.
Publication venue: Freie Universität
Publication date: 26/08/2010
Field of study

MPG.PuRe

Molecular Mechanisms of Crop Domestication Revealed by Comparative Analysis of the Transcriptomes Between Cultivated and Wild Soybeans

Author: Aci Murat
Publication venue
Publication date: 18/01/2019
Field of study

Soybean is one of the key crops necessary to meet the food requirement of the increasing global population. However, in order to meet this need, the quality and quantity of soybean yield must be greatly enhanced. Soybean yield advancement depends on the presence of favorable genes in the genome pool that have significantly changed during domestication. To make use of those domesticated genes, this study involved seven cultivated, G. max, and four wild-type, G. soja, soybeans. Their genomes were studied from developing pods to decipher the molecular mechanisms underlying crop domestication. Specifically, their transcriptomes were analyzed comparatively to previous related studies, with the intention of contributing further to the literature. For these goals, several bioinformatics applications were utilized, including De novo transcriptome assembly, transcriptome abundance quantification, and discovery of differentially expressed genes (DEGs) and their functional annotations and network visualizations. The results revealed 1,247 DEGs, 916 of which were upregulated in the cultivated soybean in comparison to wild type. Findings were mostly corresponded to literature review results, especially regarding genes affecting two focused, domesticated-related pod-shattering resistance and seed size traits. These traits were shown to be upregulated in cultivated soybeans and down-regulated in wild type. However, the opposite trend was shown in disease-related genes, which were down-regulated or not even present in the cultivated soybean genome. Further, 47 biochemical functions of the identified DEGs at the cellular level were revealed, providing some knowledge about the molecular mechanisms of genes related to the two aforementioned subjected traits. While our findings provide valuable insight about the molecular mechanisms of soybean domestication attributed to annotation of differentially expressed genes and transcripts, these results must be dissected further and/or reprocessed with a higher number of samples in order to advance the field

Texas A&M Repository

Focus: A Graph Approach for Data-Mining and Domain-Specific Assembly of Next Generation Sequencing Data

Author: Sommer Julia
Publication venue: DigitalCommons@UNMC
Publication date: 15/12/2017
Field of study

Next Generation Sequencing (NGS) has emerged as a key technology leading to revolutionary breakthroughs in numerous biomedical research areas. These technologies produce millions to billions of short DNA reads that represent a small fraction of the original target DNA sequence. These short reads contain little information individually but are produced at a high coverage of the original sequence such that many reads overlap. Overlap relationships allow for the reads to be linearly ordered and merged by computational programs called assemblers into long stretches of contiguous sequence called contigs that can be used for research applications. Although the assembly of the reads produced by NGS remains a difficult task, it is the process of extracting useful knowledge from these relatively short sequences that has become one of the most exciting and challenging problems in Bioinformatics. The assembly of short reads is an aggregative process where critical information is lost as reads are merged into contigs. In addition, the assembly process is treated as a black box, with generic assembler tools that do not adapt to input data set characteristics. Finally, as NGS data throughput continues to increase, there is an increasing need for smart parallel assembler implementations. In this dissertation, a new assembly approach called Focus is proposed. Unlike previous assemblers, Focus relies on a novel hybrid graph constructed from multiple graphs at different levels of granularity to represent the assembly problem, facilitating information capture and dynamic adjustment to input data set characteristics. This work is composed of four specific aims: 1) The implementation of a robust assembly and analysis tool built on the hybrid graph platform 2) The development and application of graph mining to extract biologically relevant features in NGS data sets 3) The integration of domain specific knowledge to improve the assembly and analysis process. 4) The construction of smart parallel computing approaches, including the application of energy-aware computing for NGS assembly and knowledge integration to improve algorithm performance. In conclusion, this dissertation presents a complete parallel assembler called Focus that is capable of extracting biologically relevant features directly from its hybrid assembly graph

University of Nebraska Medical Center Research: DigitalCommons@UNMC

A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms

Author: Jones Corbin
Pimsler Meaghan L
Sze Sing-Hoi
Tarone Aaron M
Tomberlin Jeffery K
Publication venue: BioMed Central
Publication date: 24/05/2017
Field of study

Abstract Background With increased availability of de novo assembly algorithms, it is feasible to study entire transcriptomes of non-model organisms. While algorithms are available that are specifically designed for performing transcriptome assembly from high-throughput sequencing data, they are very memory-intensive, limiting their applications to small data sets with few libraries. Results We develop a transcriptome assembly algorithm that recovers alternatively spliced isoforms and expression levels while utilizing as many RNA-Seq libraries as possible that contain hundreds of gigabases of data. New techniques are developed so that computations can be performed on a computing cluster with moderate amount of physical memory. Conclusions Our strategy minimizes memory consumption while simultaneously obtaining comparable or improved accuracy over existing algorithms. It provides support for incremental updates of assemblies when new libraries become available

Carolina Digital Repository