562 research outputs found
Enhanced localization of genetic samples through linkage-disequilibrium correction
Characterizing the spatial patterns of genetic diversity in human populations has a wide range of applications, from detecting genetic mutations associated with disease to inferring human history. Current approaches, including the widely used principal-component analysis, are not suited for the analysis of linked markers, and local and long-range linkage disequilibrium (LD) can dramatically reduce the accuracy of spatial localization when unaccounted for. To overcome this, we have introduced an approach that performs spatial localization of individuals on the basis of their genetic data and explicitly models LD among markers by using a multivariate normal distribution. By leveraging external reference panels, we derive closed-form solutions to the optimization procedure to achieve a computationally efficient method that can handle large data sets. We validate the method on empirical data from a large sample of European individuals from the POPRES data set, as well as on a large sample of individuals of Spanish ancestry. First, we show that by modeling LD, we achieve accuracy superior to that of existing methods. Importantly, whereas other methods show decreased performance when dense marker panels are used in the inference, our approach improves in accuracy as more markers become available. Second, we show that accurate localization of genetic data can be achieved with only a part of the genome, and this could potentially enable the spatial localization of admixed samples that have a fraction of their genome originating from a given continent. Finally, we demonstrate that our approach is resistant to distortions resulting from long-range LD regions; such distortions can dramatically bias the results when unaccounted for
A phylogenetic method to perform genome-wide association studies in microbes
Genome-Wide Association Studies (GWAS) are designed to perform an unbiased search of genetic sequence data with the intent of identifying statistically significant associations with a phenotype or trait of interest. The application of GWAS methods to microbial organisms promises to improve the way we understand, manage, and treat infectious diseases. Yet, while microbial pathogens continue to undermine human health, wealth, and longevity, microbial GWAS methods remain unable to fully capitalise on the growing wealth of bacterial and viral genetic sequence data. Clonal population structure and homologous recombination in microbial organisms make it difficult for existing GWAS methods to achieve both the precision needed to reject false positive findings and the statistical power required to detect genuine associations between microbial genotypic and phenotypic variants. In this thesis, we investigate potential solutions to the most substantial methodological challenges in microbial GWAS, and we introduce a new phylogenetic GWAS approach that has been specifically designed for use in bacterial samples. In presenting our approach, we describe the features that render it robust to the confounding effects of both population structure and recombination, while maintaining high statistical power to detect associations. Our approach is applicable to organisms ranging from purely clonal to frequently recombining, to sequence data from both the core and accessory genome, and to binary, categorical, and continuous phenotypes. We also describe the efforts taken to make our method efficient, scalable, and accessible in its implementation within the open-source R package we have created, called treeWAS. Next, we apply our GWAS method to simulated datasets. We develop multiple frameworks for simulating genotypic and phenotypic data with control over relevant parameters. We then present the results of our simulation study, and we use thorough performance testing to demonstrate the power and specificity of our approach, as compared to the performance of alternative cluster-based and dimension-reduction methods. Our approach is then applied to three empirical datasets, from Neisseria gonorrhoeae and Neisseria meningitidis, where we identify core SNPs associated with binary drug resistance and continuous antibiotic minimum inhibitory concentration phenotypes, as well as both core SNP and accessory genome associations with invasive and commensal phenotypes. These applications illustrate the versatility and potential of our method, demonstrating in each case that our approach is capable of confirming known resistance- or virulence-associated loci and discovering novel associations. Our thesis concludes with a review of the previous chapters and an evaluation of the strengths and limitations displayed by the current implementation of our phylogenetic approach to association testing. We discuss key areas for further development, and we propose potential solutions to advance the development of microbial GWAS in future work.Open Acces
Inference of Biogeographical Ancestry Under Resource Constraints
We study the problem of predicting human biogeographical ancestry using genomic data. While continental level ancestry prediction is relatively simple using genomic information, distinguishing between individuals from closely associated sub-populations (e.g., from the same continent) is still a difficult challenge. In particular, we focus on the case where the analysis is constrained to using single nucleotide polymorphisms (SNPs) from just one chromosome. We thus propose methods to construct ancestry informative SNP panels analyzing variants from a single chromosome, and evaluate the performance of such panels for both continental-level and sub-continental level ancestry prediction.;Efficient selection of ancestry informative SNPs is the key to successful ancestry prediction. The removal of redundant and noisy SNP features is essential prior to applying a learning algorithm. Here we propose two distinct methods of SNP selection: one is correlation-based SNP selection which uses a correlation metric to evaluate the usefulness of SNP features, while the other is random subspace projection based SNP selection which uses the learning algorithm itself to evaluate the worth of the SNP features. Correlation-based SNP selection approach can construct a small panel of useful SNPs for both continental level classification as well as binary classification of sub-populations. Unlike the correlation-based selection, random subspace projection based selection can construct efficient panel of SNP markers to address the difficult task of multinomial classification with multiple closely related sub-populations. We include results that demonstrate the performance of both methods, including comparison with other recently published related methods
Candidate Sequence Variants for Polyautoimmunity and Multiple Autoimmune Syndrome from a Colombian Genetic Isolate: Implications for Population Genetics
Autoimmunity is an immunological disorder whereby patients have
lost immunological tolerance to self-antigen. It has extreme
financial and socioeconomic burden with costs of over 100 billion
dollars in the USA alone, and an estimated prevalence of 9.4%,
and evidence indicates that this estimate has increased at a rate
of 5% per year for the past 3 years. These phenotypes can be
manifested in more severe forms through polyautoimmunity, whereby
patients are carrying 2 or more autoimmune conditions. In
addition to that, there is also the most extreme phenotype of
autoimmunity known as the Multiple Autoimmune Syndrome (MAS),
consisting of cases where patients have 3 or more autoimmune
diseases. These extreme phenotypes are extremely important for
genetic research as will be elaborated upon in this thesis. For
more than 20 years, pedigrees from the worldâs largest known
genetic isolate, from the Paisa region of Colombia have been
ascertained and thoroughly followed by Dr. Juan-Manuel Anaya and
Dr. Mauricio Arcos-Burgos. This population has maintained its
status as a genetic isolate since the 16th century, during the
early colonization by the Spanish Conquistadors.
In this thesis, our attempts in identifying potential candidate
variants potentially underpinning the genetic etiology of
autoimmune conditions in this population is facilitated by the
fact that families are derived from individuals carrying extreme
phenotypes, from familial cohorts where genetic homogeneity is
maximized. Candidates are identified in both sporadic as well as
familial cases. This is primarily achieved through combination of
linkage analysis and association tests for both rare and common
variants, derived from variant-calling pipelines and that had
undergone quality control, filtering and functional annotation,
via bioinformatic anlayses. Genes harbouring variants with
significant evidence of linkage and association were primarily
involved in negative regulation of apoptosis, phagocytosis,
regulation of endopeptidase activity, response to
lipopolysaccharides and plasminogen urokinase receptor activity.
These findings, that were obtained by utilizing the combinations
of statistical as well as network-based analyses have relevant
potential implications in autoimmunity, and can be further
supported with additional studies
Pacific Symposium on Biocomputing 2023
The Pacific Symposium on Biocomputing (PSB) 2023 is an international, multidisciplinary conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2023 will be held on January 3-7, 2023 in Kohala Coast, Hawaii. Tutorials and workshops will be offered prior to the start of the conference.PSB 2023 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. It is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders of research in biocomputing's 'hot topics.' In this way, the meeting provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field
Recommended from our members
The functional impact of copy number variation in the human genome
Copy number variation (CNV) is a class of genetic variation where large segments of the genome vary in copy number among different individuals. It has become clear in the past decade that CNV affects a significant proportion of the human genome and can play an important role in human disease. With array-based copy number detection and the current generation of sequencing technologies, our ability to discover genetic variants is running far ahead of our ability to interpret their functional impact. One approach to close this gap is to explore statistical association between genetic variants and phenotypes. In contrast to the successes of genome-wide association studies for common disease using common single nucleotide polymorphism (SNP) as markers, the majority of disease CNVs discovered so far have low population frequencies and are mainly involved in rare developmental disorders. Another strategy to improve interpretation of genomic variants is to establish a predictive understanding of their functional impact. Large heterozygous deletions are of particular interest, since (i) loss-of-function (LOF) of coding sequences encompassed by large deletions can be relatively unambiguously ascribed and (ii) haploinsufficiency (HI), wherein only one functional copy of a gene is not sufficient to maintain normal phenotype, is a major cause of dominant diseases.
This thesis explored both approaches. Initially, I developed an informatics pipeline for robust discovery of CNVs from large numbers of samples genotyped using the Affymetrix whole-genome SNP array 6.0, to support both the association-based and prediction-based study. For the disease association strategy, I studied the role of both common and rare CNVs in severe early-onset obesity using a case-control design, from which a rare 220kb heterozygous deletion at 16p11.2 that encompasses SH2B1 was found causal for the phenotype and an 8kb common deletion upstream of NEGR1 was found to be significantly associated with the disease, particularly in females. Using the prediction-based approach, I characterized the properties of HI genes by comparing with genes observed to be deleted in apparently healthy individuals and I developed a prediction model to distinguish HI and haplosufficient (HS) genes using the most informative properties identified from these comparisons. An HI-based pathogenicity score was devised to distinguish pathogenic genic CNVs from benign genic CNVs. Finally, I proposed a probabilistic diagnostic framework to incorporate population variation, and integrate other sources of evidence, to enable an improved, and quantitative, identification of causal variants
Methods for large-scale genome-wide association studies
Genome-wide association studies (GWAS) have led to the identification of thousands of associations between genetic polymorphisms and complex traits or diseases, facilitating several downstream applications such as genetic risk prediction and drug target prioritisation. Biobanks containing extensive genetic and phenotypic data continue to grow, creating new opportunities for the study of complex traits, such as the analysis of rare genomic variation across multiple populations. These opportunities are coupled with computational challenges, creating the need for the development of novel methodology.
This thesis develops computational tools to facilitate large-scale association studies of rare and common variation. First, we develop methods to improve the analysis of ultra-rare variants, leveraging the sharing of identical-by-descent (IBD) genomic regions within large biobanks. We compare ⌠400k genotyped UK Biobank (UKBB) samples with 50k exome-sequenced samples and devise a score that quantifies the extent to which a genotyped individual shares IBD segments with carriers of rare loss-of-function mutations. Our approach detects several associations and replicates 11/14 loci of a pilot exome sequencing study. Second, we develop a linear mixed model framework, FMA, that builds on previous techniques and is suitable for scalable and robust association testing. We benchmark FMA and several state-of-the-art approaches using synthetic and UKBB data, evaluating computational performance, statistical power, and robustness to known confounders, such as cryptic relatedness and population stratification. Finally, we integrate FMA with recently developed methods for genealogical analysis of complex traits, enabling it to perform scalable genealogy-based estimation of narrow-sense heritability and association
The road ahead in genetics and genomics
In celebration of the 20th anniversary of Nature Reviews Genetics, we asked 12 leading researchers to reflect on the key challenges and opportunities faced by the field of genetics and genomics. Keeping their particular research area in mind, they take stock of the current state of play and emphasize the work that remains to be done over the next few years so that, ultimately, the benefits of genetic and genomic research can be felt by everyone
- âŠ