Search CORE

12 research outputs found

Comments and Suggestions for Improvement of the Archon Genomics X PRIZE Validation Protocol

Author: Alexander Wait Zaranek
Joseph V. Thakuria
Tom Clegg
Ward Vandewege
Publication venue
Publication date: 07/03/2011
Field of study

This document is a comment on the X PRIZE validation protocol written by Kedes et al. (2011). We propose several modifications which we think will improve the fairness and transparency of the contest while keeping the cost of the validation process under control

Crossref

Nature Precedings

Swift: primary data analysis for the Illumina Solexa sequencing platform

Author: Alexander Wait Zaranek
Andrea Löhr
Bentley
Brown
Castro
Christina Curtis
Clive Brown
Cope
Erlich
Ewing
Frigo
Holloway
Irina Abnizova
Li
Lin
Matt E. Ritchie
Nava Whiteford
Quail
Ritchie
Rougemont
Serra
Tom Skelly
Zaranek
Publication venue: Oxford University Press
Publication date: 01/09/2009
Field of study

Motivation: Primary data analysis methods are of critical importance in second generation DNA sequencing. Improved methods have the potential to increase yield and reduce the error rates. Openly documented analysis tools enable the user to understand the primary data, this is important for the optimization and validity of their scientific work

Crossref

Harvard University - DASH

PubMed Central

University of Melbourne Institutional Repository

Harvard Personal Genome Project: lessons from participatory public research

Author: Ball Madeleine P
Bobe Jason R
Chou Michael F
Church George M
Clegg Tom
Estep Preston W
Lunshof Jeantine E
Vandewege Ward
Zaranek Alexander Wait
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Background: Since its initiation in 2005, the Harvard Personal Genome Project has enrolled thousands of volunteers interested in publicly sharing their genome, health and trait data. Because these data are highly identifiable, we use an ‘open consent’ framework that purposefully excludes promises about privacy and requires participants to demonstrate comprehension prior to enrollment. Discussion Our model of non-anonymous, public genomes has led us to a highly participatory model of researcher-participant communication and interaction. The participants, who are highly committed volunteers, self-pursue and donate research-relevant datasets, and are actively engaged in conversations with both our staff and other Personal Genome Project participants. We have quantitatively assessed these communications and donations, and report our experiences with returning research-grade whole genome data to participants. We also observe some of the community growth and discussion that has occurred related to our project. Summary We find that public non-anonymous data is valuable and leads to a participatory research model, which we encourage others to consider. The implementation of this model is greatly facilitated by web-based tools and methods and participant education. Project results are long-term proactive participant involvement and the growth of a community that benefits both researchers and participants

Crossref

Harvard University - DASH

Springer - Publisher Connector

Recommended from our members

Accurate Whole-Genome Sequencing and Haplotyping from 10 to 20 Human Cells

Author: Alexeev Andrei
Alferov Oleg
Baccash Jonathan
Ball Madeleine Price
Chen Linsu
Church George McDonald
Dahl Fredrik
Drmanac Radoje
Ebert Jessica C.
Haas Juergen
Halpern Aaron L.
Hong Peter
Jiang Yuan
Kennemer Michael I.
Kermani Bahram G.
Konvicka Karel
Lee Je-Hyuk
Liu Jia
Nilsen Geoffrey B.
Pant Krishna P.
Perazich Helena
Peters Brock A.
Peterson Joseph E.
Pothuraju Kaliprasad
Robasky Kimberly J.
Sparks Andrew B.
Tang Y. Tom
Tsoupko-Sitnikov Mike
Yeung George
Zaranek Alexander Wait
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/05/2013
Field of study

Recent advances in whole genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, Long Fragment Read (LFR) technology, similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes. In this study, ten LFR libraries were made using only ~100 pg of human DNA per sample. Up to 97% of the heterozygous single nucleotide variants (SNVs) were assembled into long haplotype contigs. Removal of false positive SNVs not phased by multiple LFR haplotypes resulted in a final genome error rate of 1 in 10 Mb. Cost-effective and accurate genome sequencing and haplotyping from 10-20 human cells, as demonstrated here, will enable comprehensive genetic studies and diverse clinical applications

Harvard University - DASH

A Survey of Genomic Traces Reveals a Common Sequencing Error, RNA Editing, and DNA Editing

Author: A Athanasiadis
A Jarmuz
A Mehta
A Zaranek
AD Scadden
Alexander Wait Zaranek
AM Sheehy
B Ewing
B Ewing
B Mangeat
B Teng
BL Bass
BL Bass
C Esnault
D Kimelman
DD Kim
Dirk Schübeler
DL Wheeler
DR Bentley
E Eisenberg
E Tuzun
Erez Y. Levanon
EY Levanon
George M. Church
H Lellek
JB Li
JE Wedekind
JP Vartanian
KA Lehmann
KJ McKernan
L Saccomanno
LD Hillier
LP Keegan
M Blow
M Muramatsu
M Muramatsu
MA O'Connell
N Navaratnam
P Revy
Q Yu
R Mariani
RS Harris
RS Harris
S Maas
SG Conticello
SK Wong
SR Hurst
T Melcher
Tom Clegg
Tomer Zecharia
U Kim
WJ Kent
Y Neeman
YL Chiu
YL Chiu
YN Lee
Z Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

While it is widely held that an organism's genomic information should remain constant, several protein families are known to modify it. Members of the AID/APOBEC protein family can deaminate DNA. Similarly, members of the ADAR family can deaminate RNA. Characterizing the scope of these events is challenging. Here we use large genomic data sets, such as the two billion sequences in the NCBI Trace Archive, to look for clusters of mismatches of the same type, which are a hallmark of editing events caused by APOBEC3 and ADAR. We align 603,249,815 traces from the NCBI trace archive to their reference genomes. In clusters of mismatches of increasing size, at least one systematic sequencing error dominates the results (G-to-A). It is still present in mismatches with 99% accuracy and only vanishes in mismatches at 99.99% accuracy or higher. The error appears to have entered into about 1% of the HapMap, possibly affecting other users that rely on this resource. Further investigation, using stringent quality thresholds, uncovers thousands of mismatch clusters with no apparent defects in their chromatograms. These traces provide the first reported candidates of endogenous DNA editing in human, further elucidating RNA editing in human and mouse and also revealing, for the first time, extensive RNA editing in Xenopus tropicalis. We show that the NCBI Trace Archive provides a valuable resource for the investigation of the phenomena of DNA and RNA editing, as well as setting the stage for a comprehensive mapping of editing events in large-scale genomic datasets

CiteSeerX

Public Library of Science (PLOS)

Crossref

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

Lightning: the first component of the Arvados project to be (re)written in "Go"

Author: Alexander Wait Zaranek (248624)
Publication venue
Publication date: 02/10/2013
Field of study

Arvados is an open-source platform for managing and analyzing genomic and biomedical big data. We are focusing on the Go language for back-end systems within Arvados. We hope Arvados will become a valuable new member of the growing Go language community

ZENODO

FigShare

An Analysis of Public Phenotype/Genotype Data with Arvados

Author: Abram Connelly
Alexander Wait Zaranek
Kevin Fang
Sarah Wait Zaranek
Publication venue
Publication date
Field of study

It can be difficult to gain credentials to perform analysis on sensitive data as a researcher, especially as a student. Furthermore, with specific regard to genomic data, it is potentially identifiable, therefore individuals often do not wish to make these data available to bioinformaticians. The Harvard Personal Genome Project and the 1000 Genomes Project curate the genomes of volunteers who willing are to share it with biomedical researchers to aid the future of biology and genetics. Curoverse develops an open-source data analysis tool, Arvados; Arvados allows complex analysis on large datasets using a cluster of computers through “pipelines,” written in Common Workflow Language. With regard to this project, a team at the Università Degli Studi Di Padova in Italy developed a tool titled “BOOGIE” [BOOGIE: Predicting Blood Groups from High Throughput Sequencing Data, Giollo, M. et al.], used to analyze genomes and predict a blood type, and BOOGIE claims to be 94% accurate. The goal of this project was to use Arvados to run BOOGIE on genomes available from the Personal Genome Project and the 1000 Genomes Project and compare the results to ethnicity data provided in genomic surveys, ultimately determining if these data match readily-available ethnicity and blood type information. A pipeline was written in Arvados incorporating BOOGIE through a Docker image to analyze the datasets. In under 10 hours, the tool was able to run BOOGIE on all 606 genomes available. This included 173 Genomes from the Personal Genome Project and 433 Genomes from the 1000 Genomes Project. After downloading all the data from Arvados and comparing it to the survey data provided from the Personal Genome Project using a Python script, BOOGIE was rated at an 86.67% accuracy, having correctly guessed 39/45 blood types from the Personal Genome Project. Through survey data, each genome analyzed had a blood type and ethnicity, and these data were used to compare the people who had each blood type to their ethnicity. The Personal Genome Project and the 1000 Genomes Project allow genomic data to be accessible and easily available for everyone to use. The Arvados Project records work and simplifies the process of doing so by using Docker images and pipelines. In addition, the Arvados Project allows analysis of massive data sets containing gigabytes to petabytes of information, aiming to create an efficient, common solution for data management across many platforms

ZENODO

FigShare

Multiplex padlock targeted sequencing reveals human hypermutable CpG variations

Author: Aach John
Ahlford Annika
Church George M.
Gao Yuan
Kryukov Gregory V.
LeProust Emily
Li Jin Billy
Rosenbaum Abraham M.
Sunyaev Shamil R.
Xie Bin
Yoon Jung-Ki
Zaranek Alexander Wait
Zhang Kun
Publication venue: Cold Spring Harbor Laboratory Press
Publication date
Field of study

Utilizing the full power of next-generation sequencing often requires the ability to perform large-scale multiplex enrichment of many specific genomic loci in multiple samples. Several technologies have been recently developed but await substantial improvements. We report the 10,000-fold improvement of a previously developed padlock-based approach, and apply the assay to identifying genetic variations in hypermutable CpG regions across human chromosome 21. From ∼3 million reads derived from a single Illumina Genome Analyzer lane, ∼94% (∼50,500) target sites can be observed with at least one read. The uniformity of coverage was also greatly improved; up to 93% and 57% of all targets fell within a 100- and 10-fold coverage range, respectively. Alleles at >400,000 target base positions were determined across six subjects and examined for single nucleotide polymorphisms (SNPs), and the concordance with independently obtained genotypes was 98.4%–100%. We detected >500 SNPs not currently in dbSNP, 362 of which were in targeted CpG locations. Transitions in CpG sites were at least 13.7 times more abundant than non-CpG transitions. Fractions of polymorphic CpG sites are lower in CpG-rich regions and show higher correlation with human–chimpanzee divergence within CpG versus non-CpG sites. This is consistent with the hypothesis that methylation rate heterogeneity along chromosomes contributes to mutation rate variation in humans. Our success suggests that targeted CpG resequencing is an efficient way to identify common and rare genetic variations. In addition, the significantly improved padlock capture technology can be readily applied to other projects that require multiplex sample preparation

Crossref

PubMed Central

Recommended from our members

The whole genome sequences and experimentally phased haplotypes of over 100 personal genomes

Author: Agarwal Misha R.
Ball Madeleine P.
Barua Nina
Carnevali Paolo
Chin Robert
Church George M.
Ciotlos Serban
Clegg Tom
Connelly Abram
Drmanac Radoje
Estep Preston W.
Mao Qing
Nguyen Staci
Peters Brock A.
Vandewege Ward
Zaranek Alexander Wait
Zhang Rebecca Yu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Background: Since the completion of the Human Genome Project in 2003, it is estimated that more than 200,000 individual whole human genomes have been sequenced. A stunning accomplishment in such a short period of time. However, most of these were sequenced without experimental haplotype data and are therefore missing an important aspect of genome biology. In addition, much of the genomic data is not available to the public and lacks phenotypic information. Findings: As part of the Personal Genome Project, blood samples from 184 participants were collected and processed using Complete Genomics’ Long Fragment Read technology. Here, we present the experimental whole genome haplotyping and sequencing of these samples to an average read coverage depth of 100X. This is approximately three-fold higher than the read coverage applied to most whole human genome assemblies and ensures the highest quality results. Currently, 114 genomes from this dataset are freely available in the GigaDB repository and are associated with rich phenotypic data; the remaining 70 should be added in the near future as they are approved through the PGP data release process. For reproducibility analyses, 20 genomes were sequenced at least twice using independent LFR barcoded libraries. Seven genomes were also sequenced using Complete Genomics’ standard non-barcoded library process. In addition, we report 2.6 million high-quality, rare variants not previously identified in the Single Nucleotide Polymorphisms database or the 1000 Genomes Project Phase 3 data. Conclusions: These genomes represent a unique source of haplotype and phenotype data for the scientific community and should help to expand our understanding of human genome evolution and function. Electronic supplementary material The online version of this article (doi:10.1186/s13742-016-0148-z) contains supplementary material, which is available to authorized users

Harvard University - DASH

Springer - Publisher Connector

A public resource facilitating clinical use of genomes

Author: Abraham M. Rosenbaum
Alberto Labarga
Alexander Wait Zaranek
Anugraha M. Raman
Athurva Gore
Brock A. Peters
Byung Chul Kim
Carlos Cano
Christine E. Seidman
Church
Daniel B. Vorhaus
Euan A. Ashley
Fitting
Geoffrey B. Nilsen
George M. Church
Heidi L. Rehm
Hougs
Hugh Y. Rienhoff
Jason Bobe
Je-Hyuk Lee
Jeantine E. Lunshof
Jeong-Sun Seo
Jin Billy Li
John Aach
Jong Bhak
Jong-Il Kim
Joseph V. Thakuria
Joyce L. Yang
Kim
Kimberly Robasky
Klein
Kun Zhang
Leonid Peshkin
Luhan Yang
Madeleine P. Ball
Matthew J. Callow
Matthew T. Wheeler
Michael F. Chou
Michael F. Murray
Misha Angrist
Pagon
Peter Hulick
Preston W. Estep
Radoje Drmanac
Seong-Jin Kim
Shawn M. Douglas
Sullivan
Tom Clegg
Ward Vandewege
Wendy K. Chung
Xiaodi Wu
Zaranek
Zhe Li
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 01/01/2012
Field of study

Rapid advances in DNA sequencing promise to enable new diagnostics and individualized therapies. Achieving personalized medicine, however, will require extensive research on highly reidentifiable, integrated datasets of genomic and health information. To assist with this, participants in the Personal Genome Project choose to forgo privacy via our institutional review boardapproved "open consent" process. The contribution of public data and samples facilitates both scientific discovery and standardization of methods. We present our findings after enrollment of more than 1,800 participants, including whole-genome sequencing of 10 pilot participant genomes (the PGP-10).We introduce the Genome-Environment-Trait Evidence (GET-Evidence) system. This tool automatically processes genomes and prioritizes both published and novel variants for interpretation. In the process of reviewing the presumed healthy PGP-10 genomes, we find numerous literature references implying serious disease. Although it is sometimes impossible to rule out a late-onset effect, stringent evidence requirements can address the high rate of incidental findings. To that end we develop a peer production system for recording and organizing variant evaluations according to standard evidence guidelines, creating a public forum for reaching consensus on interpretation of clinically relevant variants. Genome analysis becomes a two-step process: using a prioritized list to record variant evaluations, then automatically sorting reviewed variants using these annotations. Genome data, health and trait information, participant samples, and variant interpretations are all shared in the public domain - we invite others to review our results using our participant samples and contribute to our interpretations. We offer our public resource and methods to further personalized medical research.close555

CiteSeerX

Maastricht University Research Portal

eScholarship - University of California