Search CORE

111 research outputs found

Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels

Author: Birney Ewan
Schulz Marcel H.
Vingron Martin
Zerbino Daniel R.
Publication venue: Oxford University Press
Publication date: 01/04/2012
Field of study

Motivation: High-throughput sequencing has made the analysis of new model organisms more affordable. Although assembling a new genome can still be costly and difficult, it is possible to use RNA-seq to sequence mRNA. In the absence of a known genome, it is necessary to assemble these sequences de novo, taking into account possible alternative isoforms and the dynamic range of expression values

PubMed Central

MPG.PuRe

Representing and decomposing genomic structural variants as balanced integer flows on sequence graphs

Author: Ballinger Tracy
Haussler David
Hickey Glenn
Paten Benedict
Zerbino Daniel R.
Publication venue
Publication date: 03/09/2015
Field of study

The study of genomic variation has provided key insights into the functional role of mutations. Predominantly, studies have focused on single nucleotide variants (SNV), which are relatively easy to detect and can be described with rich mathematical models. However, it has been observed that genomes are highly plastic, and that whole regions can be moved, removed or duplicated in bulk. These structural variants (SV) have been shown to have significant impact on the phenotype, but their study has been held back by the combinatorial complexity of the underlying models. We describe here a general model of structural variation that encompasses both balanced rearrangements and arbitrary copy-numbers variants (CNV). In this model, we show that the space of possible evolutionary histories that explain the structural differences between any two genomes can be sampled ergodically

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler

Author: Birney Ewan
Margulies Elliott H.
McEwen Gayle K.
Zerbino Daniel R.
Publication venue: Public Library of Science
Publication date
Field of study

BACKGROUND: Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. PRINCIPAL FINDINGS: We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly. CONCLUSIONS: These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler

Directory of Open Access Journals

PubMed Central

SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data

Author: A Martínez-Alcántara
Daniel A Peterson
DR Zerbino
GJ Hannon
J Rougemont
JC Dohm
ML Metzker
Murray P Cox
P Pavlidis
Patrick J Biggs
PC Dolan
PJA Cock
R Development Core Team
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Illumina's second-generation sequencing platform is playing an increasingly prominent role in modern DNA and RNA sequencing efforts. However, rapid, simple, standardized and independent measures of run quality are currently lacking, as are tools to process sequences for use in downstream applications based on read-level quality data. Results We present SolexaQA, a user-friendly software package designed to generate detailed statistics and at-a-glance graphics of sequence data quality both quickly and in an automated fashion. This package contains associated software to trim sequences dynamically using the quality scores of bases within individual reads. Conclusion The SolexaQA package produces standardized outputs within minutes, thus facilitating ready comparison between flow cell lanes and machine runs, as well as providing immediate diagnostic information to guide the manipulation of sequence data for downstream analyses.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A Unifying Model of Genome Evolution Under Parsimony

Author: A Bergeron
A Caprara
AE Darling
AW Xu
B Paten
B Paten
B Paten
B Raphael
Benedict Paten
C Chauve
D Bienstock
Daniel R Zerbino
David Haussler
E Tannier
G Bourque
Glenn Hickey
I Elias
J Edmonds
J Felsenstein
J Kim
J Ma
L Chindelevitch
LL Wang
M Alekseyev
M Bader
M Blanchette
M Shao
MD Braga
N El-Mabrouk
N El-Mabrouk
O Westesson
P Medvedev
S Hannenhalli
S Yancopoulos
S Yancopoulos
W Day
W Miller
YS Song
Publication venue
Publication date: 12/05/2014
Field of study

We present a data structure called a history graph that offers a practical basis for the analysis of genome evolution. It conceptually simplifies the study of parsimonious evolutionary histories by representing both substitutions and double cut and join (DCJ) rearrangements in the presence of duplications. The problem of constructing parsimonious history graphs thus subsumes related maximum parsimony problems in the fields of phylogenetic reconstruction and genome rearrangement. We show that tractable functions can be used to define upper and lower bounds on the minimum number of substitutions and DCJ rearrangements needed to explain any history graph. These bounds become tight for a special type of unambiguous history graph called an ancestral variation graph (AVG), which constrains in its combinatorial structure the number of operations required. We finally demonstrate that for a given history graph

G

, a finite set of AVGs describe all parsimonious interpretations of

G

, and this set can be explored with a few sampling moves.Comment: 52 pages, 24 figure

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

Ensembl regulation resources

Author: Anne Parker
Bethan Pritchard
Damian Keefe
Dan Sheppard
Daniel R. Zerbino
Daniel Sobral
Emily Perry
Ewan Birney
Herrero
Ian Dunham
Ikhlak Ahmed
Ilias Lavidas
Michael Nuhn
Nathan Johnson
Paul Flicek
Quentin Raffaillac-Desfosses
Rhoda Kinsella
Ridwan Amode
Simon Brent
Stefan Gräf
Steven P. Wilder
Steven Trevanion
Thomas Juetteman
Publication venue: 'Oxford University Press (OUP)'
Publication date: 24/11/2015
Field of study

New experimental techniques in epigenomics allow researchers to assay a diversity of highly dynamic features such as histone marks, DNA modifications or chromatin structure. The study of their fluctuations should provide insights into gene expression regulation, cell differentiation and disease. The Ensembl project collects and maintains the Ensembl regulation data resources on epigenetic marks, transcription factor binding and DNA methylation for human and mouse, as well as microarray probe mappings and annotations for a variety of chordate genomes. From this data, we produce a functional annotation of the regulatory elements along the human and mouse genomes with plans to expand to other species as data becomes available. Starting from well-studied cell lines, we will progressively expand our library of measurements to a greater variety of samples. Ensembl's regulation resources provide a central and easy-to-query repository for reference epigenomes. As with all Ensembl data, it is freely available at http://www.ensembl.org, from the Perl and REST APIs and from the public Ensembl MySQL database server at ensembldb.ensembl.org.Database URL: http://www.ensembl.org.Wellcome Trust grant: (WT098051); National Human Genome Research Institute grants: (U41HG007234, 1U01 HG004695); Biotechnology and Biological Sciences Research Council grant: (BB/L024225/1); European Molecular Biology Laboratory; European Union’s Seventh Framework Programme; European Research Council

Access to Research and Communications Annals

Crossref

PubMed Central

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Author: A Edwards
A Edwards
B Ewing
D Hernandez
DA Wheeler
Daniel S. Rokhsar
DR Bentley
DR Bentley
DR Smith
DR Zerbino
DR Zerbino
ES Lander
EW Myers
EW Myers
EW Myers
Gary P. Schroth
GG Sutton
I Maccallum
Isaac Ho
J Butler
Jarrod A. Chapman
JC Roach
JL Weber
JT Simpson
K Hayashi
M Chaisson
M Margulies
M Pop
M Pop
MJ Chaisson
MJ Chaisson
ML Metzker
P Flicek
PA Pevzner
R Li
R Li
RL Warren
RM Idury
SC Schuster
SF Altschul
Shujun Luo
Sirisha Sunkara
Steven L. Salzberg
TW Jeffries
TW Jeffries
Publication venue: Public Library of Science
Publication date: 01/08/2011
Field of study

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

UNT Digital Library

The Ensembl Regulatory Build

Author: A Barski
A Visel
AL Dixon
AR Quinlan
B Ren
BE Stranger
CM Koch
CY McLean
Daniel R Zerbino
DR Zerbino
DS Johnson
E Lieberman-Aiden
FANTOM The
GA Maston
H Li
H Xu
I Keshet
J Dostie
J Ernst
J Ernst
J Severin
JD Buenrostro
M Esteller
M Kellis
M Levine
M Weber
MJ Fullwood
ML Freedman
MM Hoffman
MM Hoffman
Nathan Johnson
P Flicek
P Fraser
Paul R Flicek
PG Giresi
R Andersson
R Jaenisch
R Lister
R Margueron
RE Thurman
RJ Klose
RM Kuhn
SIS Grewal
Steven P Wilder
T Jenuwein
The ENCODE project consortium
Thomas Juettemann
TS Mikkelsen
V Curwen
Y Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Genome-wide SNP identification by high-throughput sequencing and selective mapping allows sequence assembly positioning using a framework genetic linkage map

Author: Alan Christoffels
B Sobrino
D Hernandez
D Jasper G Rees
Daniel J Sargent
DJ Sargent
DJ Sargent
DR Zerbino
GAL Broggini
H Li
J Butler
Jean-Marc Celton
JJ Doyle
KM Folta
R De La Rosa
R Ming
S DiGuistini
S Huang
The Bovine Genome Sequencing and Analysis Consortium
TJ Vision
Vladimir Shulaev
W Howad
WE MacHardy
X Xu
Xiangming Xu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Determining the position and order of contigs and scaffolds from a genome assembly within an organism's genome remains a technical challenge in a majority of sequencing projects. In order to exploit contemporary technologies for DNA sequencing, we developed a strategy for whole genome single nucleotide polymorphism sequencing allowing the positioning of sequence contigs onto a linkage map using the bin mapping method. Results The strategy was tested on a draft genome of the fungal pathogen <it>Venturia inaequalis</it>, the causal agent of apple scab, and further validated using sequence contigs derived from the diploid plant genome <it>Fragaria vesca</it>. Using our novel method we were able to anchor 70% and 92% of sequences assemblies for <it>V. inaequalis </it>and <it>F. vesca</it>, respectively, to genetic linkage maps. Conclusions We demonstrated the utility of this approach by accurately determining the bin map positions of the majority of the large sequence contigs from each genome sequence and validated our method by mapping single sequence repeat markers derived from sequence contigs on a full mapping population.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of the Western Cape Research Repository

LOCAS – A Low Coverage Assembly Tool for Resequencing Projects

Author: A Doring
AR Quinlan
B Langmead
C Nusbaum
D Hernandez
D Weigel
Daniel H. Huson
DC Richter
Detlef Weigel
DR Zerbino
EW Myers
H Li
H Li
I Birol
JD Kececioglu
JO Korbel
JT Simpson
Juliane D. Klein
K Schneeberger
K Schneeberger
Korbinian Schneeberger
LE Palmer
M Pop
M Pop
MC Wendl
MJ Chaisson
PA Pevzner
R Li
R Li
RM Durbin
S Ossowski
SL Salzberg
SM Rumble
SQ Le
Stephan Ossowski
T Rausch
Ying Xu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Motivation: Next Generation Sequencing (NGS) is a frequently applied approach to detect sequence variations between highly related genomes. Recent large-scale re-sequencing studies as the Human 1000 Genomes Project utilize NGS data of low coverage to afford sequencing of hundreds of individuals. Here, SNPs and micro-indels can be detected by applying an alignment-consensus approach. However, computational methods capable of discovering other variations such as novel insertions or highly diverged sequence from low coverage NGS data are still lacking. Results: We present LOCAS, a new NGS assembler particularly designed for low coverage assembly of eukaryotic genomes using a mismatch sensitive overlap-layout-consensus approach. LOCAS assembles homologous regions in a homologyguided manner while it performs de novo assemblies of insertions and highly polymorphic target regions subsequently to an alignment-consensus approach. LOCAS has been evaluated in homology-guided assembly scenarios with low sequence coverage of Arabidopsis thaliana strains sequenced as part of the Arabidopsis 1001 Genomes Project. While assembling the same amount of long insertions as state-of-the-art NGS assemblers, LOCAS showed best results regarding contig size, error rate and runtime. Conclusion: LOCAS produces excellent results for homology-guided assembly of eukaryotic genomes with short reads and low sequencing depth, and therefore appears to be the assembly tool of choice for the detection of novel sequenc

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

MPG.PuRe

ScholarBank@NUS