Search CORE

149 research outputs found

The variant call format and VCFtools

Author: A. Auton
C. A. Albers
Durbin
E. Banks
G. Abecasis
G. Lunter
G. McVean
G. T. Marth
M. A. DePristo
P. Danecek
R. Durbin
R. E. Handsaker
S. T. Sherry
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API

Oxford University Research Archive

GenomeVIP: A cloud platform for genomic variant discovery and interpretation

Author: Chen Ken
DeNardo Erin
Ding Li
Fenyö David
Handsaker Robert E
Huang Kuan-lin
Koboldt Daniel C
Mashl R. Jay
Niu Beifang
Raphael Benjamin J
Scott Adam D
Wendl Michael C
Wyczalkowski Matthew A
Ye Kai
Yellapantula Venkata D
Yoon Christopher J
Publication venue: Digital Commons@Becker
Publication date: 01/01/2017
Field of study

Identifying genomic variants is a fundamental first step toward the understanding of the role of inherited and acquired variation in disease. The accelerating growth in the corpus of sequencing data that underpins such analysis is making the data-download bottleneck more evident, placing substantial burdens on the research community to keep pace. As a result, the search for alternative approaches to the traditional “download and analyze” paradigm on local computing resources has led to a rapidly growing demand for cloud-computing solutions for genomics analysis. Here, we introduce the Genome Variant Investigation Platform (GenomeVIP), an open-source framework for performing genomics variant discovery and annotation using cloud- or local high-performance computing infrastructure. GenomeVIP orchestrates the analysis of whole-genome and exome sequence data using a set of robust and popular task-specific tools, including VarScan, GATK, Pindel, BreakDancer, Strelka, and Genome STRiP, through a web interface. GenomeVIP has been used for genomic analysis in large-data projects such as the TCGA PanCanAtlas and in other projects, such as the ICGC Pilots, CPTAC, ICGC-TCGA DREAM Challenges, and the 1000 Genomes SV Project. Here, we demonstrate GenomeVIP's ability to provide high-confidence annotated somatic, germline, and de novo variants of potential biological significance using publicly available data sets.</jats:p

Crossref

Digital Commons@Becker

Contribution of retrotransposition to developmental disorders.

Author: Chandler Kate E
Clement Emma
Danecek Petr
Firth Helen V
FitzPatrick David R
Gallone Giuseppe
Gardner Eugene J
Gerety Sebastian S
Handsaker Juliet
Hurles Matthew E
Ironfield Holly
Lachlan Katherine L
Prescott Katrina
Prigmore Elena
Rosser Elisabeth
Samocha Kaitlin E
Short Patrick J
Sifrim Alejandro
Singh Tarjinder
Publication venue: Nat Commun
Publication date: 01/12/2019
Field of study

Mobile genetic Elements (MEs) are segments of DNA which can copy themselves and other transcribed sequences through the process of retrotransposition (RT). In humans several disorders have been attributed to RT, but the role of RT in severe developmental disorders (DD) has not yet been explored. Here we identify RT-derived events in 9738 exome sequenced trios with DD-affected probands. We ascertain 9 de novo MEs, 4 of which are likely causative of the patient's symptoms (0.04%), as well as 2 de novo gene retroduplications. Beyond identifying likely diagnostic RT events, we estimate genome-wide germline ME mutation rate and selective constraint and demonstrate that coding RT events have signatures of purifying selection equivalent to those of truncating mutations. Overall, our analysis represents a comprehensive interrogation of the impact of retrotransposition on protein coding genes and a framework for future evolutionary and disease studies

Southampton (e-Prints Soton)

Apollo (Cambridge)

Recommended from our members

Mapping Copy Number Variation by Population Scale Genome Sequencing

Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.Organismic and Evolutionary Biolog

Harvard University - DASH

Processing and analyzing multiple genomes alignments with MafFilter

Author: A Scally
Aaron E. Darling
CC Chang
Danecek P Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group
DG Higgins
EH Stukenbrock
EH Stukenbrock
J Casper
J Felsenstein
JB Lack
K Katoh
K Prüfer
L Duret
M Blanchette
M Hasegawa
M Hasegawa
M Slatkin
O Gascuel
S Guindon
S Kurtz
S Myers
S Schiffels
S Schwartz
SM Kiełbasa
SV Angiuoli
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/01/2020
Field of study

As the number of available genome sequences from both closely related species and individuals withinspecies increased, theoretical and methodological convergences between the fields of phylogenomics andpopulation genomics emerged. Population genomics typically focuses on the analysis of variants, whilephylogenomics heavily relies on genome alignments. However, these are playing an increasingly importantrole in studies at the population level. Multiple genome alignments of individuals are used when structuralvariation is of primary interest and when genome architecture permits to assemblede novogenomesequences. Here I describe MafFilter, a command-line-driven program allowing to process genome align-ments in the Multiple Alignment Format (MAF). Using concrete examples based on publicly availabledatasets, I demonstrate how MafFilter can be used to develop efficient and reproducible pipelines withquality assurance for downstream analyses. I further show how MafFilter can be used to perform both basicand advanced population genomic analyses in order to infer the patterns of nucleotide diversity alonggenomes

Crossref

MPG.PuRe

Mutations causing medullary cystic kidney disease type 1 lie in a large VNTR in MUC1 missed by massively parallel sequencing

Author: A Fuchshuber
AF Castro
AJ Bleyer
Andreas Gnirke
Andrew Kirby
Anthony J Bleyer
AP Spicer
Aviv Regev
AW Horne
Brendan Blumenstiel
Carrie Sougnez
Chad Nusbaum
Christine Stevens
Chun Ye
Corinne Antignac
Daniel Aird
Danielle Perrin
David B Jaffe
E Lander
Edward Kelliher
Elizabeth Rossin
Eric S Lander
F Levitin
GR Abecasis
Helena Hůlková
Irit Gat-Viks
James T Robinson
Jana Sovová
JC Fowler
JM Korn
K Christodoulou
Kerstin Lindblad-Toh
KI Al-Romaih
Kristian Cibulskis
M Auranen
M Brayman
M Choi
M Legendre
Mark J Daly
Martin R Pollak
Matthew DeFelice
Melissa Parkin
Michael C Zody
Mitchell Guttman
Moran N Cabili
MT Wolf
MTF Wolf
Nathalie Pochet
P Suzanne Hart
Petr Vylet'al
R Gemayel
Ramnik J Xavier
RE Handsaker
Riza Daza
RL Kiser
Robert E Handsaker
S Purcell
Scott Steelman
Seth L Alper
Snaevar Sigurdsson
Stacey Gabriel
Stanislav Kmoch
Steven J Scheinman
Todd Green
Veronika Barešová
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2012
Field of study

Although genetic lesions responsible for some mendelian disorders can be rapidly discovered through massively parallel sequencing of whole genomes or exomes, not all diseases readily yield to such efforts. We describe the illustrative case of the simple mendelian disorder medullary cystic kidney disease type 1 (MCKD1), mapped more than a decade ago to a 2-Mb region on chromosome 1. Ultimately, only by cloning, capillary sequencing and de novo assembly did we find that each of six families with MCKD1 harbors an equivalent but apparently independently arising mutation in sequence markedly under-represented in massively parallel sequencing data: the insertion of a single cytosine in one copy (but a different copy in each family) of the repeat unit comprising the extremely long (~1.5–5 kb), GC-rich (>80%) coding variable-number tandem repeat (VNTR) sequence in the MUC1 gene encoding mucin 1. These results provide a cautionary tale about the challenges in identifying the genes responsible for mendelian, let alone more complex, disorders through massively parallel sequencing.National Institutes of Health (U.S.) (Intramural Research Program)National Human Genome Research Institute (U.S.)Charles University (program UNCE 204011)Charles University (program PRVOUK-P24/LF1/3)Czech Republic. Ministry of Education, Youth, and Sports (grant NT13116-4/2012)Czech Republic. Ministry of Health (grant NT13116-4/2012)Czech Republic. Ministry of Health (grant LH12015)National Institutes of Health (U.S.) (Harvard Digestive Diseases Center, grant DK34854

DSpace@MIT

Crossref

Ghent University Academic Bibliography

PubMed Central

eScholarship - University of California

Using population admixture to help complete maps of the human genome

Author: A Kong
A Sırmacı
AG Hinch
AL Price
Alkes L Price
Amelia M Lindgren
AP Reiner
Bogdan Pasaniuc
C Alkan
CA Winkler
Cynthia C Morton
D Botstein
D Reich
D Wegmann
DA Benson
David Reich
DM Church
DP Ryan
EE Eichler
EE Eichler
ES Lander
G Golfier
Giulio Genovese
H Donis-Keller
H Lango Allen
H Li
H Li
H Li
H Stefansson
HA Taylor Jr.
HC Mefford
Heng Li
J Christiansen
J Martin
J Weissenbach
J Zhang
JA Bailey
JA Bailey
JA Bailey
James G Wilson
JC Venter
JI Kim
JK Pickrell
JM Kidd
JM Korn
JT Robinson
K Musunuru
Kimberly Chambert
M Guipponi
M Ruault
MA DePristo
Martin R Pollak
MF Seldin
MM Mahtani
MY Dennis
N Brunetti-Pierri
NA Doggett
Nicolas Altemose
PH Sudmant
R Li
R Lyle
RE Handsaker
Robert E Handsaker
RV Samonte
S Gnerre
S Kirsch
S Levy
Steven A McCarroll
X She
YS Ju
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2013
Field of study

Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces by utilizing the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning four million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified eight large novel inter-chromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed in RNA and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies

Crossref

Harvard University - DASH

PubMed Central

eScholarship - University of California

The University of Manchester - Institutional Repository

Population genetic analysis of bi-allelic structural variants from low-coverage sequence data with an expectation-maximization algorithm

Author: A Abyzov
A Martínez-Fundichely
AR Quinlan
BS Weir
C Stewart
CA Buerkle
Cristina Aguado
CW Whelan
David Vicente-Salvador
E Gazave
E Karakoc
ES Lander
F Hormozdiari
G Bhatia
GR Abecasis
H Li
H Li
H Li
H Shao
HYK Lam
J Berglund
J Wang
JC Venter
JJ Michaelson
JM Kidd
José Ignacio Lucas-Lledó
K Chen
KJ McKernan
M Cáceres
M Muñoz Amatriaín
M Nei
Mario Cáceres
PD Keightley
PH Sudmant
R Li
R Nielsen
R Xi
RB Corbett-Detig
RE Handsaker
RE Mills
S Girirajan
S Levy
SM Ahn
SS Sindi
SY Kim
T Zichner
V Guryev
W Huang
X Li
Y Wang
Z Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Background Population genetics and association studies usually rely on a set of known variable sites that are then genotyped in subsequent samples, because it is easier to genotype than to discover the variation. This is also true for structural variation detected from sequence data. However, the genotypes at known variable sites can only be inferred with uncertainty from low coverage data. Thus, statistical approaches that infer genotype likelihoods, test hypotheses, and estimate population parameters without requiring accurate genotypes are becoming popular. Unfortunately, the current implementations of these methods are intended to analyse only single nucleotide and short indel variation, and they usually assume that the two alleles in a heterozygous individual are sampled with equal probability. This is generally false for structural variants detected with paired ends or split reads. Therefore, the population genetics of structural variants cannot be studied, unless a painstaking and potentially biased genotyping is performed first. Results We present svgem, an expectation-maximization implementation to estimate allele and genotype frequencies, calculate genotype posterior probabilities, and test for Hardy-Weinberg equilibrium and for population differences, from the numbers of times the alleles are observed in each individual. Although applicable to single nucleotide variation, it aims at bi-allelic structural variation of any type, observed by either split reads or paired ends, with arbitrarily high allele sampling bias. We test svgem with simulated and real data from the 1000 Genomes Project. Conclusions svgem makes it possible to use low-coverage sequencing data to study the population distribution of structural variants without having to know their genotypes. Furthermore, this advance allows the combined analysis of structural and nucleotide variation within the same genotype-free statistical framework, thus preventing biases introduced by genotype imputation

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Repositori d'Objectes Digitals per a l'Ensenyament la Recerca i la Cultura

PubMed Central

Diposit Digital de Documents de la UAB