47 research outputs found

    High performance computing enabling exhaustive analysis of higher order single nucleotide polymorphism interaction in Genome Wide Association Studies.

    Get PDF
    Genome-wide association studies (GWAS) are a common approach for systematic discovery of single nucleotide polymorphisms (SNPs) which are associated with a given disease. Univariate analysis approaches commonly employed may miss important SNP associations that only appear through multivariate analysis in complex diseases. However, multivariate SNP analysis is currently limited by its inherent computational complexity. In this work, we present a computational framework that harnesses supercomputers. Based on our results, we estimate a three-way interaction analysis on 1.1 million SNP GWAS data requiring over 5.8 years on the full "Avoca" IBM Blue Gene/Q installation at the Victorian Life Sciences Computation Initiative. This is hundreds of times faster than estimates for other CPU based methods and four times faster than runtimes estimated for GPU methods, indicating how the improvement in the level of hardware applied to interaction analysis may alter the types of analysis that can be performed. Furthermore, the same analysis would take under 3 months on the currently largest IBM Blue Gene/Q supercomputer "Sequoia" at the Lawrence Livermore National Laboratory assuming linear scaling is maintained as our results suggest. Given that the implementation used in this study can be further optimised, this runtime means it is becoming feasible to carry out exhaustive analysis of higher order interaction studies on large modern GWAS.This research was partially funded by NHMRC grant 1033452 and was supported by a Victorian Life Sciences Computation Initiative (VLSCI) grant number 0126 on its Peak Computing Facility at the University of Melbourne, an initiative of the Victorian Government, Australia

    A MultiCenter analysis of factors associated with hearing outcome for 2,735 adults with cochlear implants

    Get PDF
    While the majority of cochlear implant recipients benefit from the device, it remains difficult to estimate the degree of benefit for a specific patient prior to implantation. Using data from 2,735 cochlear-implant recipients from across three clinics, the largest retrospective study of cochlear-implant outcomes to date, we investigate the association between 21 preoperative factors and speech recognition approximately one year after implantation and explore the consistency of their effects across the three constituent datasets. We provide evidence of 17 statistically significant associations, in either univariate or multivariate analysis, including confirmation of associations for several predictive factors, which have only been examined in prior smaller studies. Despite the large sample size, a multivariate analysis shows that the variance explained by our models remains modest across the datasets (R2 = 0.12–0.21). Finally, we report a novel statistical interaction indicating that the duration of deafness in the implanted ear has a stronger impact on hearing outcome when considered relative to a candidate’s age. Our multicenter study highlights several real-world complexities that impact the clinical translation of predictive factors for cochlear implantation outcome. We suggest several directions to overcome these challenges and further improve our ability to model patient outcomes with increased accuracy.The collection of the VUMC dataset was supported by a research project grant no. NIH NIDCD R01 DC13117 (principal investigator: Gifford).http://journals.sagepub.com/home/tiadm2022Speech-Language Pathology and Audiolog

    Predictive models for cochlear implant outcomes : performance, generalizability, and the impact of cohort size

    Get PDF
    While cochlear implants have helped hundreds of thousands of individuals, it remains difficult to predict the extent to which an individual’s hearing will benefit from implantation. Several publications indicate that machine learning may improve predictive accuracy of cochlear implant outcomes compared to classical statistical methods. However, existing studies are limited in terms of model validation and evaluating factors like sample size on predictive performance. We conduct a thorough examination of machine learning approaches to predict word recognition scores (WRS) measured approximately 12 months after implantation in adults with post-lingual hearing loss. This is the largest retrospective study of cochlear implant outcomes to date, evaluating 2,489 cochlear implant recipients from three clinics. We demonstrate that while machine learning models significantly outperform linear models in prediction of WRS, their overall accuracy remains limited (mean absolute error: 17.9-21.8). The models are robust across clinical cohorts, with predictive error increasing by at most 16% when evaluated on a clinic excluded from the training set. We show that predictive improvement is unlikely to be improved by increasing sample size alone, with doubling of sample size estimated to only increasing performance by 3% on the combined dataset. Finally, we demonstrate how the current models could support clinical decision making, highlighting that subsets of individuals can be identified that have a 94% chance of improving WRS by at least 10% points after implantation, which is likely to be clinically meaningful. We discuss several implications of this analysis, focusing on the need to improve and standardize data collection.http://journals.sagepub.com/home/tiadm2022Speech-Language Pathology and Audiolog

    Use of a Novel Nonparametric Version of DEPTH to Identify Genomic Regions Associated with Prostate Cancer Risk.

    Get PDF
    BACKGROUND: We have developed a genome-wide association study analysis method called DEPTH (DEPendency of association on the number of Top Hits) to identify genomic regions potentially associated with disease by considering overlapping groups of contiguous markers (e.g., SNPs) across the genome. DEPTH is a machine learning algorithm for feature ranking of ultra-high dimensional datasets, built from well-established statistical tools such as bootstrapping, penalized regression, and decision trees. Unlike marginal regression, which considers each SNP individually, the key idea behind DEPTH is to rank groups of SNPs in terms of their joint strength of association with the outcome. Our aim was to compare the performance of DEPTH with that of standard logistic regression analysis. METHODS: We selected 1,854 prostate cancer cases and 1,894 controls from the UK for whom 541,129 SNPs were measured using the Illumina Infinium HumanHap550 array. Confirmation was sought using 4,152 cases and 2,874 controls, ascertained from the UK and Australia, for whom 211,155 SNPs were measured using the iCOGS Illumina Infinium array. RESULTS: From the DEPTH analysis, we identified 14 regions associated with prostate cancer risk that had been reported previously, five of which would not have been identified by conventional logistic regression. We also identified 112 novel putative susceptibility regions. CONCLUSIONS: DEPTH can reveal new risk-associated regions that would not have been identified using a conventional logistic regression analysis of individual SNPs. IMPACT: This study demonstrates that the DEPTH algorithm could identify additional genetic susceptibility regions that merit further investigation. Cancer Epidemiol Biomarkers Prev; 25(12); 1619-24. ©2016 AACR.National Health and Medical Research Council Australia (Grant ID: 1033452, Senior Principal Research Fellowship, Senior Research Fellowship), Cancer Research UK (Grant IDs: C5047/A7357, C1287/A10118, C1287/A5260, C5047/A3354, C5047/A10692, C16913/A6135 and C16913/A6835), Prostate Research Campaign UK (now Prostate Cancer UK), The Institute of Cancer Research and The Everyman Campaign, The National Cancer Research Network UK, The National Cancer Research Institute (NCRI) UK, National Institute for Health Research funding to the NIHR Biomedical Research Centre at The Institute of Cancer Research and The Royal Marsden NHS Foundation Trust, Prostate Cancer Research Program of Cancer Council Victoria from The National Health and Medical Research Council, Australia (Grant IDs: 126402, 209057, 251533, 396414, 450104, 504700, 504702, 504715, 623204, 940394, 614296), VicHealth, Cancer Council Victoria, The Prostate Cancer Foundation of Australia, The Whitten Foundation, PricewaterhouseCoopers, Tattersall’sThis is the author accepted manuscript. The final version is available from the American Association for Cancer Research via http://dx.doi.org/10.1158/1055-9965.EPI-16-030

    Developing systems for gene normalisation

    Get PDF
    © 2007 Benjamin GoudeyThe rapid growth of biomedical literature has attracted interest from the text mining community to develop methods to help manage the ever-increasing amounts of data. Initiatives such as the BioCreative challenge (Hirschman et al. 2005b) have created standard corpora and tasks in which to evaluate a variety of systems in a common framework. One such task is gene normalisation, in which the problems of synonymy and polysemy in gene name identification are overcome by mapping each mention back to a unique identifier, unambiguously identifying that gene. This task is one of the foundations required for any kind of text mining system working with biomedical literature, where we must be very certain of which genes are being discussed in the text. (For complete abstract open document

    Detection of epistasis in genome-wide association studies

    Get PDF
    © 2016 Dr. Benjamin William GoudeyIn the last decade, single nucleotide polymorphisms (SNPs) have been used as the basis for genome- wide association studies (GWAS); large-scale studies examining hundreds of thousands of SNPs across a large number of individuals for a given condition. Analysis of GWAS typically involves examining each of these markers individually to determine whether they are associated with the condition of interest. While such studies have been successfully able to detected novel associations between genetic variations and certain conditions, few GWAS have been able to fully determine all genetic factors that influence complex traits, traits caused by a combination of multiple genetic and environmental factors. This problem of “missing heritability” illustrates the limited predictive power and biological understanding obtained from GWAS using current analysis techniques. While many hypotheses exist as to why missing heritability is observed, a commonly-held belief is that the univariate analysis typically conducted in GWAS - whereby SNPs are examined one by one - is unlikely to match the complexity of most biological systems. Biologist have long recognised the importance of genetic interactions, commonly referred to as epistasis. From studies of model organisms, epistasis has been shown to play a strong role in influencing many traits with a genetic basis. Furthermore, the existence of physical interactions between molecules involved in gene-regulation and biochemical and metabolic systems has been well documented. It has been suggested that such biological interactions are likely to be reflected by interaction of genetic loci. Thus, further study of genetic interactions may help to detect variants that cannot be detected from a purely univariate approach. This thesis contributes novel methods that overcome several statistical and computational difficulties that arise from exhaustively analysing all pairwise interactions between SNPs in large-scale GWAS and validates these approaches across several GWAS. After exploring the range of definitions for interaction, in both a biological and statistical sense, and describing the statistical approaches that are currently used to detect these in association studies, we introduce a novel family of exact, unconditional statistics designed specifically for the analysis of GWAS. Within this family of statistics, we focus on developing a test for interaction with several advantageous properties compared to current statistical methods. We then explore computational approaches for fast tabulation of genotype frequencies, required for all commonly used interaction statistics, and develop a highly parallelized framework for conducting interaction analysis on the IBM Blue-Gene supercomputer. A comparison with other current methods indicates that the developed interaction framework is faster than the current state-of-the-art. The novel statistics and computational framework developed in this work are then applied to an interaction analysis of five independent celiac studies. We detect numerous interaction effects between SNPs that are statistically significant and replicate across multiple studies under a wide range of criteria. We show that the detected interactions contribute novel signal which is not captured by known risk factors. Finally, we conduct an comparison and characterisation of all statistical tests examined in this work by exploring their performance across twelve GWAS using empirically-driven signals that are typically not represented in simulation studies

    Information theoretic alignment free variant calling

    No full text
    While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of length k as a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence. The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informative for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence
    corecore