35 research outputs found

    Efficient Haplotype Block Matching in Bi-Directional PBWT

    Get PDF
    Efficient haplotype matching search is of great interest when large genotyped cohorts are becoming available. Positional Burrows-Wheeler Transform (PBWT) enables efficient searching for blocks of haplotype matches. However, existing efficient PBWT algorithms sweep across the haplotype panel from left to right, capturing all exact matches. As a result, PBWT does not account for mismatches. It is also not easy to investigate the patterns of changes between the matching blocks. Here, we present an extension to PBWT, called bi-directional PBWT that allows the information about the blocks of matches to be present at both sides of each site. We also present a set of algorithms to efficiently merge the matching blocks or examine the patterns of changes on both sides of each site. The time complexity of the algorithms to find and merge matching blocks using bi-directional PBWT is linear to the input size. Using real data from the UK Biobank, we demonstrate the run time and memory efficiency of our algorithms. More importantly, our algorithms can identify more blocks by enabling tolerance of mismatches. Moreover, by using mutual information (MI) between the forward and the reverse PBWT matching block sets as a measure of haplotype consistency, we found the MI derived from European samples in the 1000 Genomes Project is highly correlated (Spearman correlation r=0.87) with the deCODE recombination map

    Rapid detection of identity-by-descent tracts for mega-scale datasets

    Get PDF
    The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to current leading methods and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for millions of individuals. We apply iLASH to the PAGE dataset of ~52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, identifying IBD segments in ~3 minutes per chromosome compared to over 6 days for a state-of-the-art algorithm. iLASH enables efficient analysis of very large-scale datasets, as we demonstrate by computing IBD across the UK Biobank (~500,000 individuals), detecting 12.9 billion pairwise connections

    GWAS meta-analysis of over 29,000 people with epilepsy identifies 26 risk loci and subtype-specific genetic architecture

    Get PDF
    Epilepsy is a highly heritable disorder affecting over 50 million people worldwide, of which about one-third are resistant to current treatments. Here we report a multi-ancestry genome-wide association study including 29,944 cases, stratified into three broad categories and seven subtypes of epilepsy, and 52,538 controls. We identify 26 genome-wide significant loci, 19 of which are specific to genetic generalized epilepsy (GGE). We implicate 29 likely causal genes underlying these 26 loci. SNP-based heritability analyses show that common variants explain between 39.6% and 90% of genetic risk for GGE and its subtypes. Subtype analysis revealed markedly different genetic architectures between focal and generalized epilepsies. Gene-set analyses of GGE signals implicate synaptic processes in both excitatory and inhibitory neurons in the brain. Prioritized candidate genes overlap with monogenic epilepsy genes and with targets of current antiseizure medications. Finally, we leverage our results to identify alternate drugs with predicted efficacy if repurposed for epilepsy treatment

    A broad overview of genotype imputation: Standard guidelines, approaches, and future investigations in genomic association studies

    Get PDF
    The advent of genomic big data and the statistical need for reaching significant results have led genome-wide association studies to be ravenous of a huge number of genetic markers scattered along the whole genome. Since its very beginning, the so-called genotype imputation served this purpose; this statistical and inferential procedure based on a known reference panel opened the theoretical possibility to extend association analyses to a greater number of polymorphic sites which have not been previously assayed by the used technology. In this review, we present a broad overview of the genotype imputation process, showing the most known methods and presenting the main areas of interest, with a closer look to the most up-to-date approaches and a deeper understanding of its usage in the present-day genomic landscape, shedding a light on its future developments and investigation areas

    GWAS meta-analysis of over 29,000 people with epilepsy identifies 26 risk loci and subtype-specific genetic architecture

    Get PDF
    Epilepsy is a highly heritable disorder affecting over 50 million people worldwide, of which about one-third are resistant to current treatments. Here we report a multi-ancestry genome-wide association study including 29,944 cases, stratified into three broad categories and seven subtypes of epilepsy, and 52,538 controls. We identify 26 genome-wide significant loci, 19 of which are specific to genetic generalized epilepsy (GGE). We implicate 29 likely causal genes underlying these 26 loci. SNP-based heritability analyses show that common variants explain between 39.6% and 90% of genetic risk for GGE and its subtypes. Subtype analysis revealed markedly different genetic architectures between focal and generalized epilepsies. Gene-set analyses of GGE signals implicate synaptic processes in both excitatory and inhibitory neurons in the brain. Prioritized candidate genes overlap with monogenic epilepsy genes and with targets of current antiseizure medications. Finally, we leverage our results to identify alternate drugs with predicted efficacy if repurposed for epilepsy treatment

    Functional Analysis of Genomic Variation and Impact on Molecular and Higher Order Phenotypes

    Get PDF
    Reverse genetics methods, particularly the production of gene knockouts and knockins, have revolutionized the understanding of gene function. High throughput sequencing now makes it practical to exploit reverse genetics to simultaneously study functions of thousands of normal sequence variants and spontaneous mutations that segregate in intercross and backcross progeny generated by mating completely sequenced parental lines. To evaluate this new reverse genetic method we resequenced the genome of one of the oldest inbred strains of mice—DBA/2J—the father of the large family of BXD recombinant inbred strains. We analyzed ~100X wholegenome sequence data for the DBA/2J strain, relative to C57BL/6J, the reference strain for all mouse genomics and the mother of the BXD family. We generated the most detailed picture of molecular variation between the two mouse strains to date and identified 5.4 million sequence polymorphisms, including, 4.46 million single nucleotide polymorphisms (SNPs), 0.94 million intersections/deletions (indels), and 20,000 structural variants. We systematically scanned massive databases of molecular phenotypes and ~4,000 classical phenotypes to detect linked functional consequences of sequence variants. In majority of cases we successfully recovered known genotype-to-phenotype associations and in several cases we linked sequence variants to novel phenotypes (Ahr, Fh1, Entpd2, and Col6a5). However, our most striking and consistent finding is that apparently deleterious homozygous SNPs, indels, and structural variants have undetectable or very modest additive effects on phenotypes

    Genetic architecture of glycomic and lipidomic phenotypes in isolated populations

    Get PDF
    Understanding how genetics contributes to the variation of complex traits and diseases is one of the key objectives of current medical studies. To date, a large portion of this genetic variation still needs to be identified, especially considering the contribution of low-frequency and rare variants. Omics data, such as proteomics and metabolomics, are extensively employed in genetic association studies as ‘proxies’ for traits or diseases of interest. They are regarded as “intermediate” traits: measurable manifestations of more complex phenotypes (e.g., cholesterol levels for cardiovascular diseases), often more strongly associated with genetic variation and having a clearer functional link than the endpoint or disease of interest. Accordingly, the genetics of omics have the potential to offer insights into relevant biological mechanisms and pathways and point to new drug targets or diagnostic biomarkers. The main goal of this thesis is to expand the current knowledge about the genetic architecture of protein glycomics and bile acid lipidomics, two under-studied omic traits, but which are involved in several common diseases. First, in Chapter 2 I compared genetic regulation of glycosylation of two different proteins, transferrin and immunoglobulin G (IgG). By performing a genome-wide association study (GWAS) of ~2000 European samples, I identified 10 loci significantly associated with transferrin glycosylation, 9 of which were previously not reported as being related with the glycosylation of this protein. Comparing these with IgG glycosylation-associated genes, I noted both protein-specific and shared associations. These shared associations are likely regulated by different causal variants, suggesting that glycosylation of transferrin and IgG is genetically regulated by both shared and protein-specific mechanisms. Next, in Chapter 3 I investigated the effect of rare (MAF<5%) predicted loss-of-function (pLOF) and missense variants on the glycome of transferrin and IgG in ~3000 samples of European ancestry. Using multiple gene-based aggregation tests, I identified 16 significant gene-based associations for transferrin and 32 for IgG glycan traits,located in 6 genes already known to have a biological link to protein glycosylation but also in 2 genes which have not been previously reported. Finally, in Chapter 4 I applied a similar approach to bile acid lipidomics, exploring the genetic contribution of both common and rare variants. Despite more than double the sample size (N = ~5000) compared to protein glycomics analysis, I identified only 2 loci, near the SLCO1B1 and PRKG1 genes, significantly associated with bile acid traits., for which I noted a sex-specific effect. Further, I found 3 rare variant gene-based associations, in genes not previously reported as associated with bile acid levels. While the biological mechanisms linking these genes to levels of bile acid is not immediately clear, there is evidence in the literature of their involvement in bile acid synthesis and secretion and in liver diseases. In summary, in my thesis I describe the genetic architecture of the protein glycome and the bile acid lipidome: the former has a higher genetic component, while the latter is largely influenced by environmental factors (e.g., sex, diet, gut flora). Despite the limited sample size, we were able to describe rare variant associations, demonstrating that isolated populations represent a useful strategy to increase statistical power. However, additional statistical power is needed to identify the possible effect of protein glycome and bile acid lipidome on complex disease. A clearer understanding of the genetic architecture of omics traits is crucial to develop informed disease screening tests, to improve disease diagnosis and prognosis, and finally to design innovative and more customised treatment strategies to enhance human health
    corecore