Search CORE

651 research outputs found

Genotype imputation using the Positional Burrows Wheeler Transform.

Author: Delaneau O.
Marchini J.
Rubinacci S.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/11/2020
Field of study

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost

Serveur académique lausannois

Directory of Open Access Journals

Phasing for medical sequencing using rare variants and large haplotype reference panels.

Author: Delaneau O
Kretzschmar W
Marchini J
Sharp K
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

Motivation: There is growing recognition that estimating haplotypes from high coverage sequencing of single samples in clinical settings is an important problem. At the same time very large datasets consisting of tens and hundreds of thousands of high-coverage sequenced samples will soon be available. We describe a method that takes advantage of these huge human genetic variation resources and rare variant sharing patterns to estimate haplotypes on single sequenced samples. Sharing rare variants between two individuals is more likely to arise from a recent common ancestor and, hence, also more likely to indicate similar shared haplotypes over a substantial flanking region of sequence.Results: Our method exploits this idea to select a small set of highly informative copying states within a Hidden Markov Model (HMM) phasing algorithm. Using rare variants in this way allows us to avoid iterative MCMC methods to infer haplotypes. Compared to other approaches that do not explicitly use rare variants we obtain significant gains in phasing accuracy, less variation over phasing runs and improvements in speed. For example, using a reference panel of 7420 haplotypes from the UK10K project, we are able to reduce switch error rates by up to 50% when phasing samples sequenced at high-coverage. In addition, a single step rephasing of the UK10K panel, using rare variant information, has a downstream impact on phasing performance. These results represent a proof of concept that rare variant sharing patterns can be utilized to phase large high-coverage sequencing studies such as the 100 000 Genomes Project dataset.</br

Crossref

Serveur académique lausannois

PubMed Central

Oxford University Research Archive

The molecular basis, genetic control and pleiotropic effects of local gene co-expression.

Author: Delaneau O.
Dermitzakis E.T.
Hofmeister R.J.
Ramisch A.
Ribeiro D.M.
Rubinacci S.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2021
Field of study

Nearby genes are often expressed as a group. Yet, the prevalence, molecular mechanisms and genetic control of local gene co-expression are far from being understood. Here, by leveraging gene expression measurements across 49 human tissues and hundreds of individuals, we find that local gene co-expression occurs in 13% to 53% of genes per tissue. By integrating various molecular assays (e.g. ChIP-seq and Hi-C), we estimate the ability of several mechanisms, such as enhancer-gene interactions, in distinguishing gene pairs that are co-expressed from those that are not. Notably, we identify 32,636 expression quantitative trait loci (eQTLs) which associate with co-expressed gene pairs and often overlap enhancer regions. Due to affecting several genes, these eQTLs are more often associated with multiple human traits than other eQTLs. Our study paves the way to comprehend trait pleiotropy and functional interpretation of QTL and GWAS findings. All local gene co-expression identified here is available through a public database ( https://glcoex.unil.ch/ )

Serveur académique lausannois

Directory of Open Access Journals

PubMed Central

The effect of genetic variation on promoter usage and enhancer activity.

Author: Antonarakis S.E.
Carninci P.
Delaneau O.
Dermitzakis E.T.
Fish R.J.
Fort A.
Garieri M.
Mull D.
Santoni F.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

The identification of genetic variants affecting gene expression, namely expression quantitative trait loci (eQTLs), has contributed to the understanding of mechanisms underlying human traits and diseases. The majority of these variants map in non-coding regulatory regions of the genome and their identification remains challenging. Here, we use natural genetic variation and CAGE transcriptomes from 154 EBV-transformed lymphoblastoid cell lines, derived from unrelated individuals, to map 5376 and 110 regulatory variants associated with promoter usage (puQTLs) and enhancer activity (eaQTLs), respectively. We characterize five categories of genes associated with puQTLs, distinguishing single from multi-promoter genes. Among multi-promoter genes, we find puQTL effects either specific to a single promoter or to multiple promoters with variable effect orientations. Regulatory variants associated with opposite effects on different mRNA isoforms suggest compensatory mechanisms occurring between alternative promoters. Our analyses identify differential promoter usage and modulation of enhancer activity as molecular mechanisms underlying eQTLs related to regulatory elements

Crossref

Serveur académique lausannois

Directory of Open Access Journals

Archive ouverte UNIGE

Expression estimation and eQTL mapping for HLA genes with a personalized pipeline.

Author: Aguiar VRC
César J.
Delaneau O.
Dermitzakis E.T.
Meyer D.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

The HLA (Human Leukocyte Antigens) genes are well-documented targets of balancing selection, and variation at these loci is associated with many disease phenotypes. Variation in expression levels also influences disease susceptibility and resistance, but little information exists about the regulation and population-level patterns of expression. This results from the difficulty in mapping short reads originated from these highly polymorphic loci, and in accounting for the existence of several paralogues. We developed a computational pipeline to accurately estimate expression for HLA genes based on RNA-seq, improving both locus-level and allele-level estimates. First, reads are aligned to all known HLA sequences in order to infer HLA genotypes, then quantification of expression is carried out using a personalized index. We use simulations to show that expression estimates obtained in this way are not biased due to divergence from the reference genome. We applied our pipeline to the GEUVADIS dataset, and compared the quantifications to those obtained with reference transcriptome. Although the personalized pipeline recovers more reads, we found that using the reference transcriptome produces estimates similar to the personalized pipeline (r ≥ 0.87) with the exception of HLA-DQA1. We describe the impact of the HLA-personalized approach on downstream analyses for nine classical HLA loci (HLA-A, HLA-C, HLA-B, HLA-DRA, HLA-DRB1, HLA-DQA1, HLA-DQB1, HLA-DPA1, HLA-DPB1). Although the influence of the HLA-personalized approach is modest for eQTL mapping, the p-values and the causality of the eQTLs obtained are better than when the reference transcriptome is used. We investigate how the eQTLs we identified explain variation in expression among lineages of HLA alleles. Finally, we discuss possible causes underlying differences between expression estimates obtained using RNA-seq, antibody-based approaches and qPCR

Serveur académique lausannois

Directory of Open Access Journals

The Francis Crick Institute

Archive ouverte UNIGE

Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks.

Author: Appadurai V.
Buil A.
Bybjerg-Grauholm J.
Børglum A.D.
Delaneau O.
Hougaard D.M.
Ingason A.
Krebs M.D.
Mors O.
Mortensen P.B.
Nordentoft M.
Rosengren A.
Schork A.J.
Werge T.
Publication venue
Publication date: 01/01/2023
Field of study

Sample recruitment for research consortia, biobanks, and personal genomics companies span years, necessitating genotyping in batches, using different technologies. As marker content on genotyping arrays varies, integrating such datasets is non-trivial and its impact on haplotype estimation (phasing) and whole genome imputation, necessary steps for complex trait analysis, remains under-evaluated. Using the iPSYCH dataset, comprising 130,438 individuals, genotyped in two stages, on different arrays, we evaluated phasing and imputation performance across multiple phasing methods and data integration protocols. While phasing accuracy varied by choice of method and data integration protocol, imputation accuracy varied mostly between data integration protocols. We demonstrate an attenuation in imputation accuracy within samples of non-European origin, highlighting challenges to studying complex traits in diverse populations. Finally, imputation errors can bias association tests, reduce predictive utility of polygenic scores. Carefully optimized data integration strategies enhance accuracy and replicability of complex trait analyses in complex biobanks

Serveur académique lausannois

PubMed Central

Copenhagen University Research Information System

Genome-wide association scan in HIV-1-infected individuals identifying variants influencing disease course.

Author: Boeser-Nunnink BD
Bol SM
Burger JA
Delaneau O
Kootstra NA
Limou S
Moerland PD
Schuitemaker H
van 't Slot R
van Manen D
Zagury JF
Zwinderman AH
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/07/2011
Field of study

Serveur académique lausannois

Differentially expressed genes reflect disease-induced rather than disease-causing changes in the transcriptome.

Author: Auwerx C.
Bandinelli S.
Delaneau O.
Frayling T.
Kutalik Z.
Lepik K.
Metspalu A.
Nauck M.
Porcu E.
Reymond A.
Ribeiro D.M.
Sadler M.C.
Santoni F.A.
Sleiman MSB
Tanaka T.
Teumer A.
Völker U.
Weihs A.
Wood A.R.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/09/2021
Field of study

Comparing transcript levels between healthy and diseased individuals allows the identification of differentially expressed genes, which may be causes, consequences or mere correlates of the disease under scrutiny. We propose a method to decompose the observational correlation between gene expression and phenotypes driven by confounders, forward- and reverse causal effects. The bi-directional causal effects between gene expression and complex traits are obtained by Mendelian Randomization integrating summary-level data from GWAS and whole-blood eQTLs. Applying this approach to complex traits reveals that forward effects have negligible contribution. For example, BMI- and triglycerides-gene expression correlation coefficients robustly correlate with trait-to-expression causal effects (r BMI = 0.11, P BMI = 2.0 × 10 -51 and r TG = 0.13, P TG = 1.1 × 10 -68 ), but not detectably with expression-to-trait effects. Our results demonstrate that studies comparing the transcriptome of diseased and healthy subjects are more prone to reveal disease-induced gene expression changes rather than disease causing ones

Infoscience - École polytechnique fédérale de Lausanne

Serveur académique lausannois

Scanning and filling : ultra-dense SNP genotyping combining genotyping-by-sequencing, SNP array and whole-genome resequencing data

Author: AE Lipka
B Howie
BN Howie
D Ellinghaus
D Jarquín
Davoud Torkamaneh
Francois Belzile
H Li
H Li
H Sonah
HD Daetwyler
J Crossa
J Poland
J Schmutz
J Zheng
JE Rutkoski
K Hao
KG Ardlie
LR Porto-Neto
M Wang
MA Gore
MD Donato
MH Santana
Nicholas A. Tinker
NT Ha
O Delaneau
O Delaneau
P Scheet
Q Song
Q Zhu
RJ Elshire
S Browning
S He
S Kim
S Purcell
S Shifman
X Huang
X Xu
Y Li
YB Fu
YB Fu
YB Fu
YB Fu
YF Pei
Z Yang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 10/07/2015
Field of study

Genotyping-by-sequencing (GBS) represents a highly cost-effective high-throughput genotyping approach. By nature, however, GBS is subject to generating sizeable amounts of missing data and these will need to be imputed for many downstream analyses. The extent to which such missing data can be tolerated in calling SNPs has not been explored widely. In this work, we first explore the use of imputation to fill in missing genotypes in GBS datasets. Importantly, we use whole genome resequencing data to assess the accuracy of the imputed data. Using a panel of 301 soybean accessions, we show that over 62,000 SNPs could be called when tolerating up to 80% missing data, a five-fold increase over the number called when tolerating up to 20% missing data. At all levels of missing data examined (between 20% and 80%), the resulting SNP datasets were of uniformly high accuracy (96– 98%). We then used imputation to combine complementary SNP datasets derived from GBS and a SNP array (SoySNP50K). We thus produced an enhanced dataset of >100,000 SNPs and the genotypes at the previously untyped loci were again imputed with a high level of accuracy (95%). Of the >4,000,000 SNPs identified through resequencing 23 accessions (among the 301 used in the GBS analysis), 1.4 million tag SNPs were used as a reference to impute this large set of SNPs on the entire panel of 301 accessions. These previously untyped loci could be imputed with around 90% accuracy. Finally, we used the 100K SNP dataset (GBS + SoySNP50K) to perform a GWAS on seed oil content within this collection of soybean accessions. Both the number of significant marker-trait associations and the peak significance levels were improved considerably using this enhanced catalog of SNPs relative to a smaller catalog resulting from GBS alone at 20% missing data. Our results demonstrate that imputation can be used to fill in both missing genotypes and untyped loci with very high accuracy and that this leads to more powerful genetic analyses

Crossref

Directory of Open Access Journals

PubMed Central

CorpusUL

HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data

Author: A Efros
A Williams
BL Browning
Bonnie Berger
D Aguiar
D Aguiar
D He
Deniz Yorukoglu
E Berger
Emily Berger
F Geraci
G Abecasis
Isidore Rigoutsos
Jian Peng
K Zhang
M Stephens
O Delaneau
P Scheet
R Lippert
SR Browning
V Bansal
V Bansal
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/10/2013
Field of study

As the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as the studies of the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes. In this paper, we describe a novel maximum-likelihood estimation framework, HapTree, for polyploid haplotype assembly of an individual genome using NGS read datasets. We evaluate the performance of HapTree on simulated polyploid sequencing read data modeled after Illumina sequencing technologies. For triploid and higher ploidy genomes, we demonstrate that HapTree substantially improves haplotype assembly accuracy and efficiency over the state-of-the-art; moreover, HapTree is the first scalable polyplotyping method for higher ploidy. As a proof of concept, we also test our method on real sequencing data from NA12878 (1000 Genomes Project) and evaluate the quality of assembled haplotypes with respect to trio-based diplotype annotation as the ground truth. The results indicate that HapTree significantly improves the switch accuracy within phased haplotype blocks as compared to existing haplotype assembly methods, while producing comparable minimum error correction (MEC) values. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.National Science Foundation (U.S.) (NSF/NIH BIGDATA Grant R01GM108348-01)National Science Foundation (U.S.) (Graduate Research Fellowship)Simons Foundatio

Public Library of Science (PLOS)

DSpace@MIT

Crossref

Directory of Open Access Journals

PubMed Central