Search CORE

655 research outputs found

Genotype imputation using the Positional Burrows Wheeler Transform.

Author: Delaneau O.
Marchini J.
Rubinacci S.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/11/2020
Field of study

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost

Serveur académique lausannois

Directory of Open Access Journals

Phasing for medical sequencing using rare variants and large haplotype reference panels.

Author: Delaneau O
Kretzschmar W
Marchini J
Sharp K
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

Motivation: There is growing recognition that estimating haplotypes from high coverage sequencing of single samples in clinical settings is an important problem. At the same time very large datasets consisting of tens and hundreds of thousands of high-coverage sequenced samples will soon be available. We describe a method that takes advantage of these huge human genetic variation resources and rare variant sharing patterns to estimate haplotypes on single sequenced samples. Sharing rare variants between two individuals is more likely to arise from a recent common ancestor and, hence, also more likely to indicate similar shared haplotypes over a substantial flanking region of sequence.Results: Our method exploits this idea to select a small set of highly informative copying states within a Hidden Markov Model (HMM) phasing algorithm. Using rare variants in this way allows us to avoid iterative MCMC methods to infer haplotypes. Compared to other approaches that do not explicitly use rare variants we obtain significant gains in phasing accuracy, less variation over phasing runs and improvements in speed. For example, using a reference panel of 7420 haplotypes from the UK10K project, we are able to reduce switch error rates by up to 50% when phasing samples sequenced at high-coverage. In addition, a single step rephasing of the UK10K panel, using rare variant information, has a downstream impact on phasing performance. These results represent a proof of concept that rare variant sharing patterns can be utilized to phase large high-coverage sequencing studies such as the 100 000 Genomes Project dataset.</br

Crossref

Serveur académique lausannois

PubMed Central

Oxford University Research Archive

Expression estimation and eQTL mapping for HLA genes with a personalized pipeline.

Author: Aguiar VRC
César J.
Delaneau O.
Dermitzakis E.T.
Meyer D.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

The HLA (Human Leukocyte Antigens) genes are well-documented targets of balancing selection, and variation at these loci is associated with many disease phenotypes. Variation in expression levels also influences disease susceptibility and resistance, but little information exists about the regulation and population-level patterns of expression. This results from the difficulty in mapping short reads originated from these highly polymorphic loci, and in accounting for the existence of several paralogues. We developed a computational pipeline to accurately estimate expression for HLA genes based on RNA-seq, improving both locus-level and allele-level estimates. First, reads are aligned to all known HLA sequences in order to infer HLA genotypes, then quantification of expression is carried out using a personalized index. We use simulations to show that expression estimates obtained in this way are not biased due to divergence from the reference genome. We applied our pipeline to the GEUVADIS dataset, and compared the quantifications to those obtained with reference transcriptome. Although the personalized pipeline recovers more reads, we found that using the reference transcriptome produces estimates similar to the personalized pipeline (r ≥ 0.87) with the exception of HLA-DQA1. We describe the impact of the HLA-personalized approach on downstream analyses for nine classical HLA loci (HLA-A, HLA-C, HLA-B, HLA-DRA, HLA-DRB1, HLA-DQA1, HLA-DQB1, HLA-DPA1, HLA-DPB1). Although the influence of the HLA-personalized approach is modest for eQTL mapping, the p-values and the causality of the eQTLs obtained are better than when the reference transcriptome is used. We investigate how the eQTLs we identified explain variation in expression among lineages of HLA alleles. Finally, we discuss possible causes underlying differences between expression estimates obtained using RNA-seq, antibody-based approaches and qPCR

Serveur académique lausannois

Directory of Open Access Journals

The Francis Crick Institute

Archive ouverte UNIGE

Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks.

Author: Appadurai V.
Buil A.
Bybjerg-Grauholm J.
Børglum A.D.
Delaneau O.
Hougaard D.M.
Ingason A.
Krebs M.D.
Mors O.
Mortensen P.B.
Nordentoft M.
Rosengren A.
Schork A.J.
Werge T.
Publication venue
Publication date: 01/01/2023
Field of study

Sample recruitment for research consortia, biobanks, and personal genomics companies span years, necessitating genotyping in batches, using different technologies. As marker content on genotyping arrays varies, integrating such datasets is non-trivial and its impact on haplotype estimation (phasing) and whole genome imputation, necessary steps for complex trait analysis, remains under-evaluated. Using the iPSYCH dataset, comprising 130,438 individuals, genotyped in two stages, on different arrays, we evaluated phasing and imputation performance across multiple phasing methods and data integration protocols. While phasing accuracy varied by choice of method and data integration protocol, imputation accuracy varied mostly between data integration protocols. We demonstrate an attenuation in imputation accuracy within samples of non-European origin, highlighting challenges to studying complex traits in diverse populations. Finally, imputation errors can bias association tests, reduce predictive utility of polygenic scores. Carefully optimized data integration strategies enhance accuracy and replicability of complex trait analyses in complex biobanks

Serveur académique lausannois

PubMed Central

Copenhagen University Research Information System

Scanning and filling : ultra-dense SNP genotyping combining genotyping-by-sequencing, SNP array and whole-genome resequencing data

Author: AE Lipka
B Howie
BN Howie
D Ellinghaus
D Jarquín
Davoud Torkamaneh
Francois Belzile
H Li
H Li
H Sonah
HD Daetwyler
J Crossa
J Poland
J Schmutz
J Zheng
JE Rutkoski
K Hao
KG Ardlie
LR Porto-Neto
M Wang
MA Gore
MD Donato
MH Santana
Nicholas A. Tinker
NT Ha
O Delaneau
O Delaneau
P Scheet
Q Song
Q Zhu
RJ Elshire
S Browning
S He
S Kim
S Purcell
S Shifman
X Huang
X Xu
Y Li
YB Fu
YB Fu
YB Fu
YB Fu
YF Pei
Z Yang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 10/07/2015
Field of study

Genotyping-by-sequencing (GBS) represents a highly cost-effective high-throughput genotyping approach. By nature, however, GBS is subject to generating sizeable amounts of missing data and these will need to be imputed for many downstream analyses. The extent to which such missing data can be tolerated in calling SNPs has not been explored widely. In this work, we first explore the use of imputation to fill in missing genotypes in GBS datasets. Importantly, we use whole genome resequencing data to assess the accuracy of the imputed data. Using a panel of 301 soybean accessions, we show that over 62,000 SNPs could be called when tolerating up to 80% missing data, a five-fold increase over the number called when tolerating up to 20% missing data. At all levels of missing data examined (between 20% and 80%), the resulting SNP datasets were of uniformly high accuracy (96– 98%). We then used imputation to combine complementary SNP datasets derived from GBS and a SNP array (SoySNP50K). We thus produced an enhanced dataset of >100,000 SNPs and the genotypes at the previously untyped loci were again imputed with a high level of accuracy (95%). Of the >4,000,000 SNPs identified through resequencing 23 accessions (among the 301 used in the GBS analysis), 1.4 million tag SNPs were used as a reference to impute this large set of SNPs on the entire panel of 301 accessions. These previously untyped loci could be imputed with around 90% accuracy. Finally, we used the 100K SNP dataset (GBS + SoySNP50K) to perform a GWAS on seed oil content within this collection of soybean accessions. Both the number of significant marker-trait associations and the peak significance levels were improved considerably using this enhanced catalog of SNPs relative to a smaller catalog resulting from GBS alone at 20% missing data. Our results demonstrate that imputation can be used to fill in both missing genotypes and untyped loci with very high accuracy and that this leads to more powerful genetic analyses

Crossref

Directory of Open Access Journals

PubMed Central

CorpusUL

Genome-wide association study identifies loci associated with liability to alcohol and drug dependence that is associated with variability in reward-related ventral striatum activity in African- and European-Americans.

Author: Agrawal A
Almasy L
Andrews MM
Association AP
Barbeira AN
Barrett JC
Battle A
Biernacka JM
Bierut LJ
Bierut LJ
Blum K
Boyle AP
Bucholz KK
Bulik‐Sullivan BK
Büchel C
Canela‐Xandri O
Carey CE
Chen MH
Das S
Degenhardt L
Delaneau O
Delgado MR
Dick DM
Duncan LE
Edenberg HJ
Edenberg HJ
Falk D
Gamazon ER
Gelernter J
Gelernter J
Gelernter J
Grant BF
Grucza RA
Han S
Hancock DB
Hariri AR
Hasin DS
Heitzeg MM
Hesselbrock M
Kanai M
Kendler KS
Kendler KS
Koob GF
Lopez‐Quintero C
Luczak SE
Marquez‐Luna C
Martin AR
Martin‐Soelch C
Merikangas KR
Meyers JL
Need AC
Nelson EC
Nikolova YS
Nikolova YS
O'Connell JR
Pasman JA
Peacock A
Pizzagalli DA
Reich T
Rentzsch P
Sartor CE
Schuckit MA
Sherva R
Sherva R
Sobota RS
Spear LP
Stanaway JD
Tsuang MT
Verhulst B
Volkow ND
Vrieze SI
Walters RK
Watanabe K
Wetherill L
Wetherill L
Willer CJ
Wu L‐T
Yang J
Zhou H
Publication venue: eScholarship, University of California
Publication date: 19/05/2019
Field of study

Genetic influences on alcohol and drug dependence partially overlap, however, specific loci underlying this overlap remain unclear. We conducted a genome-wide association study (GWAS) of a phenotype representing alcohol or illicit drug dependence (ANYDEP) among 7291 European-Americans (EA; 2927 cases) and 3132 African-Americans (AA: 1315 cases) participating in the family-based Collaborative Study on the Genetics of Alcoholism. ANYDEP was heritable (h 2 in EA = 0.60, AA = 0.37). The AA GWAS identified three regions with genome-wide significant (GWS; P < 5E-08) single nucleotide polymorphisms (SNPs) on chromosomes 3 (rs34066662, rs58801820) and 13 (rs75168521, rs78886294), and an insertion-deletion on chromosome 5 (chr5:141988181). No polymorphisms reached GWS in the EA. One GWS region (chromosome 1: rs1890881) emerged from a trans-ancestral meta-analysis (EA + AA) of ANYDEP, and was attributable to alcohol dependence in both samples. Four genes (AA: CRKL, DZIP3, SBK3; EA: P2RX6) and four sets of genes were significantly enriched within biological pathways for hemostasis and signal transduction. GWS signals did not replicate in two independent samples but there was weak evidence for association between rs1890881 and alcohol intake in the UK Biobank. Among 118 AA and 481 EA individuals from the Duke Neurogenetics Study, rs75168521 and rs1890881 genotypes were associated with variability in reward-related ventral striatum activation. This study identified novel loci for substance dependence and provides preliminary evidence that these variants are also associated with individual differences in neural reward reactivity. Gene discovery efforts in non-European samples with distinct patterns of substance use may lead to the identification of novel ancestry-specific genetic markers of risk

Crossref

IUPUIScholarWorks

eScholarship - University of California

Scans for signatures of selection in Russian cattle breed genomes reveal new candidate genes for environmental adaptation and acclimation

Author: A Talenti
A Yurchenko
A Zrhidri
AGT Pereira
AK Lindholm-Perry
AR Boyko
AS Wilkins
B Cannon
B Dorshorst
B Grisart
B Haase
B Loureiro
B Loureiro
B Loureiro
BG Oliver
BS Weir
CB Kaelin
D Boruszewska
D Wright
D Yang
DR Schrider
EA Ostrander
EM Ibeagha-Awemu
F Li
F Schlamp
F Tajima
FB Axelrod
G Valverde
H Li
H Li
H Mannen
H Pausch
H Yamada
H Zhang
HD Daetwyler
HD Daetwyler
HP Jedema
I Kurth
I Mathieson
I Naka
I Urbinati
J Kim
J Martin-Tereso
J Queiros
JD Jensen
JD Storey
JE Decker
JJ Simoni Gouveia de
JK Pickrell
K Kim
K Konczol
K Soini
K Wimmers
KC Wollenberg Valero
KE Lotterhos
L Ma
LA Raven
M Cohen-Zinder
M Knoll
M Nei
M Nizon
M Saatchi
MI Fariello
MJ Emmett
MN Weedon
MR Upadhyay
MRS Fortes
NA Mandal
O Delaneau
O Tange
P Danecek
P Scheet
Q Qiu
QL Meng
R Verity
R Weikard
R Xiang
RL Minster
RR Mota
S Boitard
S Bolormaa
S Bongiorni
S Fan
S Makvandi-Nejad
S Moon
S Purcell
S Roth
S Roy
S Sasaki
S Wu
SD Berry
SH Carroll
SJ Yue
SR Grossman
T Iso-Touru
T Nishimaki
TY Yeh
W Barendse
X Zheng
X Zheng
XL Wang
Y Gao
Y Gao
Y Liu
Y Ma
Y Qin
Y Wang
YT Utsunomiya
Z Gu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2018
Field of study

Domestication and selective breeding has resulted in over 1000 extant cattle breeds. Many of these breeds do not excel in important traits but are adapted to local environments. These adaptations are a valuable source of genetic material for efforts to improve commercial breeds. As a step toward this goal we identified candidate regions to be under selection in genomes of nine Russian native cattle breeds adapted to survive in harsh climates. After comparing our data to other breeds of European and Asian origins we found known and novel candidate genes that could potentially be related to domestication, economically important traits and environmental adaptations in cattle. The Russian cattle breed genomes contained regions under putative selection with genes that may be related to adaptations to harsh environments (e.g., AQP5, RAD50, and RETREG1). We found genomic signatures of selective sweeps near key genes related to economically important traits, such as the milk production (e.g., DGAT1, ABCG2), growth (e.g., XKR4), and reproduction (e.g., CSF2). Our data point to candidate genes which should be included in future studies attempting to identify genes to improve the extant breeds and facilitate generation of commercial breeds that fit better into the environments of Russia and other countries with similar climates

Crossref

ZENODO

Directory of Open Access Journals

Dryad Digital Repository (Duke University)

Electronic Archiving System

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Enlighten

Improved Statistics for Genome-Wide Interaction Analysis

Author: A Brown
B Mukherjee
CR Weinberg
DG Clayton
DJ Balding
DV Zaykin
E Zeggini
GF Mells
Heather J. Cordell
HJ Cordell
HJ Cordell
J Chapman
J Ciampa
J Siemiatycki
J Todd
J Yang
J Yang
JC Barrett
JL McClay
Masao Ueki
N Chatterjee
Nicholas J. Schork
O Delaneau
P Kraft
P Sasieni
PC Phillips
PC Phillips
Q Yang
R Lewontin
S Bhattacharjee
S Purcell
S Wellek
T Kam-Thong
TA Manolio
TM Frayling
WD Thompson
WJ Gauderman
WW Piegorsch
X Wang
X Wu
Publication venue: Public Library of Science
Publication date: 05/04/2012
Field of study

Recently, Wu and colleagues [1] proposed two novel statistics for genome-wide interaction analysis using case/control or case-only data. In computer simulations, their proposed case/control statistic outperformed competing approaches, including the fast-epistasis option in PLINK and logistic regression analysis under the correct model; however, reasons for its superior performance were not fully explored. Here we investigate the theoretical properties and performance of Wu et al.'s proposed statistics and explain why, in some circumstances, they outperform competing approaches. Unfortunately, we find minor errors in the formulae for their statistics, resulting in tests that have higher than nominal type 1 error. We also find minor errors in PLINK's fast-epistasis and case-only statistics, although theory and simulations suggest that these errors have only negligible effect on type 1 error. We propose adjusted versions of all four statistics that, both theoretically and in computer simulations, maintain correct type 1 error rates under the null hypothesis. We also investigate statistics based on correlation coefficients that maintain similar control of type 1 error. Although designed to test specifically for interaction, we show that some of these previously-proposed statistics can, in fact, be sensitive to main effects at one or both loci, particularly in the presence of linkage disequilibrium. We propose two new “joint effects” statistics that, provided the disease is rare, are sensitive only to genuine interaction effects. In computer simulations we find, in most situations considered, that highest power is achieved by analysis under the correct genetic model. Such an analysis is unachievable in practice, as we do not know this model. However, generally high power over a wide range of scenarios is exhibited by our joint effects and adjusted Wu statistics. We recommend use of these alternative or adjusted statistics and urge caution when using Wu et al.'s originally-proposed statistics, on account of the inflated error rate that can result

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

The Francis Crick Institute

Methylation QTLs in the developing brain and their enrichment in schizophrenia risk loci

Author: A Ramasamy
AA Shabalin
AE Locke
AF McRae
AH Olsson
AK Maunakea
AL Teh
AP Morris
AW Drong
B Howie
C Giambartolomei
CC Chang
CG Spilianakis
Claire Troakes
CT Ong
DL Nicolae
DR Weinberger
Eilis Hannon
EL Meaburn
ER Gamazon
G Elliott
Gustavo Turecki
H Spiers
H Wang
Helen Spiers
HJ Kang
J Dekker
J Ernst
JC Lambert
Joana Viana
Joe Burrage
Jonathan Mill
JR Gibbs
JR Wagner
KR van Eijk
Leonard C Schalkwyk
M Gutierrez-Arcelus
M Gutierrez-Arcelus
M Lemire
ME Price
Michael C O'Donovan
MJ Aryee
MJ Hill
MJ Ziller
MT Maurano
Nicholas J Bray
O Delaneau
O Delaneau
P Danecek
PA Jones
R Pidsley
R Pidsley
R Tao
RC Slieker
Ruth Pidsley
S Purcell
SH Fatemi
Therese M Murphy
YA Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/11/2015
Field of study

We characterized DNA methylation quantitative trait loci (mQTLs) in a large collection (n = 166) of human fetal brain samples spanning 56-166 d post-conception, identifying >16,000 fetal brain mQTLs. Fetal brain mQTLs were primarily cis-acting, enriched in regulatory chromatin domains and transcription factor binding sites, and showed substantial overlap with genetic variants that were also associated with gene expression in the brain. Using tissue from three distinct regions of the adult brain (prefrontal cortex, striatum and cerebellum), we found that most fetal brain mQTLs were developmentally stable, although a subset was characterized by fetal-specific effects. Fetal brain mQTLs were enriched amongst risk loci identified in a recent large-scale genome-wide association study (GWAS) of schizophrenia, a severe psychiatric disorder with a hypothesized neurodevelopmental component. Finally, we found that mQTLs can be used to refine GWAS loci through the identification of discrete sites of variable fetal brain methylation associated with schizophrenia risk variants

University of Essex Research Repository

Crossref

RD&E Research Repository

Online Research @ Cardiff

PubMed Central

Open Research Exeter

King's Research Portal

Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel

Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF) >= 5% and low-frequency variants (0.5Peer reviewe

University of Liverpool Repository

DSpace@MIT

Crossref

Harvard University - DASH

Helsingin yliopiston digitaalinen arkisto