Search CORE

25 research outputs found

Compression of Structured High-Throughput Sequencing Data

Author: ER Mardis
Fabien Campagne
Frederique Lisacek
H Li
H Li
James T. Robinson
Jill P. Mesirov
JK Pickrell
JR Shearstone
JT Robinson
Kevin C. Dorff
L Skrabanek
M Hsi-Yang Fritz
M Mangone
N Agrawal
N Popitsch
Nyasha Chambwe
SM Kielbasa
TD Wu
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/11/2012
Field of study

Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.National Center for Research Resources (U.S.) (Grant UL1 RR024996)Leukemia & Lymphoma Society of America (Translational Research Program Grant LLS 6304-11)National Institute of Mental Health (U.S.) (R01 MH086883

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

DSpace@MIT

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

FigShare

ReCoil - an algorithm for compression of extremely large datasets of dna data

Author: Adam L Buchsbaum
Alok Aggarwal
Bin Ma
Christos Kozanitis
Daniel D Sommer
David Eppstein
M Waterman
Markus Fritz Hsi-Yang
P Ferragina
Paolo Ferragina
R Dementiev
Roman Dementiev
Scott Christley
Veli Mäkinen
Vladimir Yanovsky
W Timothy White
Wenyu Zhang
Xin Chen
Z Ning
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

The growing volume of generated DNA sequencing data makes the problem of its long term storage increasingly important. In this work we present ReCoil - an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data. Typically each position of DNA sequence is covered by multiple reads of a short read dataset and our algorithm makes use of resulting redundancy to achieve high compression rate

University of Toronto Research Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

CGGBP1 mitigates cytosine methylation at repetitive DNA sequences

Author: B Langmead
Bengt Westermark
BH Ramsahoye
D Biniszkiewicz
D Blankenberg
D Cortazar
DC Hancks
DM Messerschmidt
EL Fritz
F Butter
F Fuks
F Krueger
F Naumann
H Deissler
H Deissler
H Gowher
H Muller-Hartmann
Helena Jernberg Wiklund
KD Robertson
KD Robertson
KD Robertson
KI Tatematsu
LS Chuang
M Fatemi
M Okano
Markus Hsi-Yang Fritz
MR Rountree
P Rice
Paul Collier
Prasoon Agarwal
R Schipper
S Cortellino
S Pradhan
S Tempel
U Singh
U Singh
Umashankar Singh
Vladimir Benes
W Guo
X Zhang
Y Shimooka
ZD Smith
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Metagenomics - a guide from sampling to data analysis

Metagenomics applies a suite of genomic technologies and bioinformatics tools to directly access the genetic content of entire communities of organisms. The field of metagenomics has been responsible for substantial advances in microbial ecology, evolution, and diversity over the past 5 to 10 years, and many research laboratories are actively engaged in it now. With the growing numbers of activities also comes a plethora of methodological knowledge and expertise that should guide future developments in the field. This review summarizes the current opinions in metagenomics, and provides practical guidance and advice on sample processing, sequencing technology, assembly, binning, annotation, experimental design, statistical analysis, data storage, and data sharing. As more metagenomic datasets are generated, the availability of standardized procedures and shared data storage and analysis becomes increasingly important to ensure that output of individual projects can be assessed and compared

Crossref

Springer - Publisher Connector

PubMed Central

Recommended from our members

Computational solutions for omics data

Author: A Butte
A Chatr-aryamontri
A Franceschini
A Joshi
A Lan
A Mortazavi
A Subramanian
A Tanay
AC Jungkamp
AJ Pinho
AK Wong
AR Whitney
B Langmead
B Langmead
B Paten
Bonnie Berger
BP Kelley
C Huttenhower
C Kingsford
C Trapnell
C Trapnell
C Trapnell
C Wang
CH Yeang
CJ Vaske
CS Liao
D Croft
D Earl
D Kim
D Kim
D Park
DB Allison
DB Jaffe
DR Zerbino
E Banks
E Banks
E Cerami
E Nabieva
E Segal
E Yeger-Lotem
EJ Rossin
ER Mardis
ES Lander
ET Wang
F Hach
F Hach
F Markowetz
F Ozsolak
F Vandin
F Vandin
F Vezzi
GE Zinman
H Li
H Li
I Ulitsky
I Ulitsky
IA Adzhubei
J Butler
J Clarke
J Flannick
J Goecks
J Lamb
J Pandey
JC Marioni
JC Venter
Jian Peng
JT Dudley
JT Leek
JT Simpson
JT Simpson
K Rhrissorrakrai
KI Goh
KY Yeung
L Parts
LD Stein
LH Hartwell
LM Heiser
LR Meyer
M Ascano
M Burrows
M Garber
M Gross
M Gstaiger
M Hafner
M Hsi-Yang Fritz
M Kircher
M Koyuturk
M Narayanan
M Reich
M Schatz
M Schmid
M Sirota
M Steffen
M Yandell
MB Gerstein
MB Gerstein
MC Brandon
MC Schatz
MG Grabherr
MH Maathuis
ML Metzker
Mona Singh
N Atias
N de Souza
N Tuncbag
NP Palmer
NT Ingolia
O Hirose
O Litvin
O Ogasawara
O Stegle
O Vanunu
P Ferragina
P Flicek
P Jiang
P Kumar
P Lu
P Shannon
PA Pevzner
PE Compeau
PG Doyle
PO Brown
PR Loh
PR Schmid
R Colak
R Gaujoux
R Li
R Li
R Li
R Singh
RC Gentleman
S Anders
S Batzoglou
S Christley
S Deorowicz
S Erten
S Kohler
S Levy
S Navlakha
S Ng
S Suthram
SA Chowdhury
SD Kahn
SF Altschul
SG Tringe
SL Salzberg
SS Huang
SS Shen-Orr
T Barrett
T Ideker
T Michoel
TS Furey
U Manber
UD Akavia
W Ali
W Li
W Tembe
WJ Kent
X Liu
X Wang
X Zhou
Y Prat
Y Wang
Y Zhang
YA Kim
Z Tu
Z Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2013
Field of study

High-throughput experimental technologies are generating increasingly massive and complex genomic data sets. The sheer enormity and heterogeneity of these data threaten to make the arising problems computationally infeasible. Fortunately, powerful algorithmic techniques lead to software that can answer important biomedical questions in practice. In this Review, we sample the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data. We spotlight specific examples that have facilitated and enriched analyses of sequence, transcriptomic and network data sets.National Institutes of Health (U.S.) (Grant GM081871

Princeton University Open Access Repository

DSpace@MIT

Crossref

PubMed Central

The real cost of sequencing: scaling computation to keep pace with data generation

Author: A Auton
A Dobin
A Sboner
A Sood
AC English
AG Levine
AJG Hey
B Langmead
C Walter
CS Chin
D Greenbaum
D Greenbaum
D Kleftogiannis
Daifeng Wang
Daniel J Spakowicz
DC Brock
DG George
DJ Lipman
F Sanger
Farren Isaacs
George M. Weinstock
H Li
H Li
H Li
H Stevens
J Dean
Jing Zhang
JN Weinstein
Joel Rozowsky
KR Bradnam
LD Stein
Leonidas Salichos
M Armbrust
M Hsi-Yang Fritz
M Massie
M Zaharia
Mark Gerstein
MI Kanehisa
MJ Chaisson
P Gouet
Paul Muir
PE Ross
R Cattell
R Larson
R Leinonen
R Leinonen
R Staden
S Koren
SB Needleman
SF Altschul
Shantao Li
Shaoke Lou
TF Smith
V Kuleshov
W Isaacson
W Zhang
WJ Kent
Z Zhu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Genetic code expansion for multiprotein complex engineering

Author: A Bianco
A Chatterjee
A Goldhirsch
A Maiolica
AC Wolff
Attila Gyenesei
Bence Galik
C Bieniossek
Carsten Schultz
CC Liu
Christine Koehler
DJ Fitzgerald
E Provenzano
E Sisamakis
EA Lemke
Edward A Lemke
G Hernandez Jr.
Gemma Estrada Girona
Giancarlo Pruneri
Hueseyin Besir
I Berger
I Nikić
Imre Berger
J Cox
J Rappsilber
Jan O Korbel
Jan-Erik Hoffmann
Jonathan J M Landry
JT Simpson
Juan Zou
Juri Rappsilber
JW Chin
JY Axup
Kapil Gupta
Ksenija Radic
M Zhang
Markus Hsi-Yang Fritz
Martin Jechlinger
Mirella Wawryszyn
MM Robinson
Moritz Bosse Biskup
MY Polley
Paul F Sauter
Peggy Stolt-Bergner
Piau Siong Tan
PR Chen
R Luo
RJ Tomko Jr.
S Milles
S Milles
S Tyagi
Sini Junttila
SM Hancock
SM Kraemer
SS Thakur
Stefan Braese
T Crépin
T Magoč
T Mukai
T Mukai
T Plass
T Plass
Vladimir Benes
ZA Chen
Zhuo A Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2016
Field of study

We present a baculovirus-based protein engineering method that enables site-specific introduction of unique functionalities in a eukaryotic protein complex recombinantly produced in insect cells. We demonstrate the versatility of this efficient and robust protein production platform, \u2018MultiBacTATAG\u2019, (i) for the fluorescent labeling of target proteins and biologics using click chemistries, (ii) for glycoengineering of antibodies, and (iii) for structure\u2013function studies of novel eukaryotic complexes using single-molecule F\uf6rster resonance energy transfer as well as site-specific crosslinking strategies

Crossref

AIR Universita degli studi di Milano

Edinburgh Research Explorer

Explore Bristol Research

Sequence squeeze: an open contest for sequence compression

Author: D Earl
M Hsi-Yang Fritz
The 1000 Genomes Project Consortium
Y Kodama
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Primate genome architecture influences structural variation mechanisms and functional consequences.

Author: Benes Vladimir
Fritz Markus Hsi-Yang
Gokcumen Omer
Iskow Rebecca C
Korbel Jan O
Langdon Amy
Lee Charles
Lee Eunjung
Mills Ryan E
Park Peter J
Pavlidis Pavlos
Stütz Adrian M
Tica Jelena
Tischler Verena
Zhu Qihui
Publication venue: The Mouseion at the JAXlibrary
Publication date: 06/09/2013
Field of study

Although nucleotide resolution maps of genomic structural variants (SVs) have provided insights into the origin and impact of phenotypic diversity in humans, comparable maps in nonhuman primates have thus far been lacking. Using massively parallel DNA sequencing, we constructed fine-resolution genomic structural variation maps in five chimpanzees, five orang-utans, and five rhesus macaques. The SV maps, which are comprised of thousands of deletions, duplications, and mobile element insertions, revealed a high activity of retrotransposition in macaques compared with great apes. By comparison, nonallelic homologous recombination is specifically active in the great apes, which is correlated with architectural differences between the genomes of great apes and macaque. Transcriptome analyses across nonhuman primates and humans revealed effects of species-specific whole-gene duplication on gene expression. We identified 13 gene duplications coinciding with the species-specific gain of tissue-specific gene expression in keeping with a role of gene duplication in the promotion of diversification and the acquisition of unique functions. Differences in the present day activity of SV formation mechanisms that our study revealed may contribute to ongoing diversification and adaptation of great ape and Old World monkey lineages. Proc Natl Acad Sci U S A 2013 Sep 24; 110(39):15764-15769

The Jackson Laboratory: The Mouseion at the JAXlibrary

PubMed Central