180 research outputs found
Cloud Computing for Detecting High-Order Genome-Wide Epistatic Interaction via Dynamic Clustering
Backgroud: Taking the advan tage of high-throughput single nucleotide polymorphism (SNP) genotyping technology, large genome-wide association studies (GWASs) have been considered to hold promise for unravelling complex relationships between genotype and phenotype. At present, traditional single-locus-based methods are insufficient to detect interactions consisting of multiple-locus, which are broadly existing in complex traits. In addition, statistic tests for high order epistatic interactions with more than 2 SNPs propose computational and analytical challenges because the computation increases exponentially as the cardinality of SNPs combinations gets larger. Results: In this paper, we provide a simple, fast and powerful method using dynamic clustering and cloud computing to detect genome-wide multi-locus epistatic interactions. We have constructed systematic experiments to compare powers performance against some recently proposed algorithms, including TEAM, SNPRuler, EDCF and BOOST. Furthermore, we have applied our method on two real GWAS datasets, Age-related macular degeneration (AMD) and Rheumatoid arthritis (RA) datasets, where we find some novel potential disease-related genetic factors which are not shown up in detections of 2-loci epistatic interactions. Conclusions: Experimental results on simulated data demonstrate that our method is more powerful than some recently proposed methods on both two- and three-locus disease models. Our method has discovered many novel high-order associations that are significantly enriched in cases from two real GWAS datasets. Moreover, the running time of the cloud implementation for our method on AMD dataset and RA dataset are roughly 2 hours and 50 hours on a cluster with forty small virtual machines for detecting two-locus interactions, respectively. Therefore, we believe that our method is suitable and effective for the full-scale analysis of multiple-locus epistatic interactions in GWAS
Searching Genome-wide Disease Association Through SNP Data
Taking the advantage of the high-throughput Single Nucleotide Polymorphism (SNP) genotyping technology, Genome-Wide Association Studies (GWASs) are regarded holding promise for unravelling complex relationships between genotype and phenotype. GWASs aim to identify genetic variants associated with disease by assaying and analyzing hundreds of thousands of SNPs. Traditional single-locus-based and two-locus-based methods have been standardized and led to many interesting findings. Recently, a substantial number of GWASs indicate that, for most disorders, joint genetic effects (epistatic interaction) across the whole genome are broadly existing in complex traits. At present, identifying high-order epistatic interactions from GWASs is computationally and methodologically challenging.
My dissertation research focuses on the problem of searching genome-wide association with considering three frequently encountered scenarios, i.e. one case one control, multi-cases multi-controls, and Linkage Disequilibrium (LD) block structure. For the first scenario, we present a simple and fast method, named DCHE, using dynamic clustering. Also, we design two methods, a Bayesian inference based method and a heuristic method, to detect genome-wide multi-locus epistatic interactions on multiple diseases. For the last scenario, we propose a block-based Bayesian approach to model the LD and conditional disease association simultaneously. Experimental results on both synthetic and real GWAS datasets show that the proposed methods improve the detection accuracy of disease-specific associations and lessen the computational cost compared with current popular methods
Evaluation of Existing Methods for High-Order Epistasis Detection
[Abstract]
Finding epistatic interactions among loci when expressing a phenotype is a widely employed strategy to understand the genetic architecture of complex traits in GWAS. The abundance of methods dedicated to the same purpose, however, makes it increasingly difficult for scientists to decide which method is more suitable for their studies. This work compares the different epistasis detection methods published during the last decade in terms of runtime, detection power and type I error rate, with a special emphasis on high-order interactions. Results show that in terms of detection power, the only methods that perform well across all experiments are the exhaustive methods, although their computational cost may be prohibitive in large-scale studies. Regarding non-exhaustive methods, not one could consistently find epistasis interactions when marginal effects are absent. If marginal effects are present, there are methods that perform well for high-order interactions, such as BADTrees, FDHE-IW, SingleMI or SNPHarvester. As for false-positive control, only SNPHarvester, FDHE-IW and DCHE show good results. The study concludes that there is no single epistasis detection method to recommend in all scenarios. Authors should prioritize exhaustive methods when sufficient computational resources are available considering the data set size, and resort to non-exhaustive methods when the analysis time is prohibitive.10.13039/501100010801-Xunta de Galicia (Grant Number: ED431C2016-037, ED431C2017/04 and ED431G2019/01)
10.13039/501100003176-Ministerio de Educacion Cultura y Deporte (Grant Number: FPU16/01333)
10.13039/501100003329-Ministerio de Economia y Competitividad (Grant Number: CGL2016-75482-P, PID2019-104184RB-I00, AEI/FEDER/EU, 10.13039/50110 and TIN2016-75845-P)Xunta de Galicia; ED431C2016-037Xunta de Galicia; ED431G2019/01Xunta de Galicia; ED431C 2017/0
Strategies For Improving Epistasis Detection And Replication
Genome-wide association studies (GWAS) have been extensively critiqued for their perceived inability to adequately elucidate the genetic underpinnings of complex disease. Of particular concern is “missing heritability,” or the difference between the total estimated heritability of a phenotype and that explained by GWAS-identified loci. There are numerous proposed explanations for this missing heritability, but a frequently ignored and potentially vastly informative alternative explanation is the ubiquity of epistasis underlying complex phenotypes.
Given our understanding of how biomolecules interact in networks and pathways, it is not unreasonable to conclude that the effect of variation at individual genetic loci may non-additively depend on and should be analyzed in the context of their interacting partners. It has been recognized for over a century that deviation from expected Mendelian proportions can be explained by the interaction of multiple loci, and the epistatic underpinnings of phenotypes in model organisms have been extensively experimentally quantified. Therefore, the dearth of inspiring single locus GWAS hits for complex human phenotypes (and the inconsistent replication of these between populations) should not be surprising, as one might expect the joint effect of multiple perturbations to interacting partners within a functional biological module to be more important than individual main effects.
Current methods for analyzing data from GWAS are not well-equipped to detect epistasis or replicate significant interactions. The multiple testing burden associated with testing each pairwise interaction quickly becomes nearly insurmountable with increasing numbers of loci. Statistical and machine learning approaches that have worked well for other types of high-dimensional data are appealing and may be useful for detecting epistasis, but potentially require tweaks to function appropriately. Biological knowledge may also be leveraged to guide the search for epistasis candidates, but requires context-appropriate application (as, for example, two loci with significant main effects may not have a significant interaction, and vice versa).
Rather than renouncing GWAS and the wealth of associated data that has been accumulated as a failure, I propose the development of new techniques and incorporation of diverse data sources to analyze GWAS data in an epistasis-centric framework
Combinations of Genetic Variants Occurring Exclusively in Patients
In studies of polygenic disorders, scanning the genetic variants can be used to identify variant combinations. Combinations that are exclusively found in patients can be separated from those combinations occurring in control persons. Statistical analyses can be performed to determine whether the combinations that occur exclusively among patients are significantly associated with the investigated disorder. This research strategy has been applied in materials from various polygenic disorders, identifying clusters of patient-specific genetic variant combinations that are significant associated with the investigated disorders. Combinations from these clusters are found in the genomes of up to 55% of investigated patients, and are not present in the genomes of any control persons. Keywords: Genetic variants, Polygenic disorder, Combinations of genetic variants, Patient-specific combination
Parallelizing Epistasis Detection in GWAS on FPGA and GPU-Accelerated Computing Systems
This is a post-peer-review, pre-copyedit version of an article published in IEEE - ACM Transactions on Computational Biology and Bioinformatics. The final authenticated version is available online at: http://dx.doi.org/10.1109/TCBB.2015.2389958[Abstract] High-throughput genotyping technologies (such as SNP-arrays) allow the rapid collection of up to a few million genetic markers of an individual. Detecting epistasis (based on 2-SNP interactions) in Genome-Wide Association Studies is an important but time consuming operation since statistical computations have to be performed for each pair of measured markers. Computational methods to detect epistasis therefore suffer from prohibitively long runtimes; e.g., processing a moderately-sized dataset consisting of about 500,000 SNPs and 5,000 samples requires several days using state-of-the-art tools on a standard 3 GHz CPU. In this paper, we demonstrate how this task can be accelerated using a combination of fine-grained and coarse-grained parallelism on two different computing systems. The first architecture is based on reconfigurable hardware (FPGAs) while the second architecture uses multiple GPUs connected to the same host. We show that both systems can achieve speedups of around four orders-of-magnitude compared to the sequential implementation. This significantly reduces the runtimes for detecting epistasis to only a few minutes for moderatelysized datasets and to a few hours for large-scale datasets.London. Wellcome Trust; 076113London. Wellcome Trust; 08547
High-Order Epistasis Detection in High Performance Computing Systems
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
Nos últimos anos, os estudos de asociación do xenoma completo (Genome-Wide
Association Studies, GWAS) están a gañar moita popularidade de cara a buscar unha
explicación xenética á presenza ou ausencia de certas enfermidades nos humanos.Hai
un consenso nestes estudos sobre a existencia de interaccións xenéticas que condicionan
a expresión de enfermidades complexas, un fenómeno coñecido como epistasia.
Esta tese céntrase no estudo deste fenómeno empregando a computación de altas
prestacións (High-Performance Computing, HPC) e dende a súa perspectiva estadística:
a desviación da expresión dun fenotipo como a suma dos efectos individuais de
múltiples variantes xenéticas. Con este obxectivo desenvolvemos unha primeira ferramenta,
chamada MPI3SNP, que identifica interaccións de tres variantes a partir dun
conxunto de datos de entrada. MPI3SNP implementa unha busca exhaustiva empregando
un test de asociación baseado na Información Mutua, e explota os recursos de
clústeres de CPUs ou GPUs para acelerar a busca. Coa axuda desta ferramenta avaliamos
o estado da arte da detección de epistasia a través dun estudo que compara o rendemento
de vintesete ferramentas. A conclusión máis importante desta comparativa
é a incapacidade dos métodos non exhaustivos de atopar interacción ante a ausencia
de efectos marxinais (pequenos efectos de asociación das variantes individuais que
participan na epistasia). Por isto, esta tese continuou centrándose na optimización da
busca exhaustiva de epistasia. Por unha parte, mellorouse a eficiencia do test de asociación
a través dunha implantación vectorial do mesmo. Por outro lado, creouse un
algoritmo distribuído que implementa unha busca exhaustiva capaz de atopar epistasia
de calquera orden. Estes dous fitos lógranse en Fiuncho, unha ferramenta que integra
toda a investigación realizada, obtendo un rendemento en clústeres de CPUs que
supera a todas as súas alternativas no estado da arte. Adicionalmente, desenvolveuse
unha libraría para simular escenarios biolóxicos con epistasia chamada Toxo. Esta
libraría permite a simulación de epistasia seguindo modelos de interacción xenética
existentes para orde alto.[Resumen]
En los últimos años, los estudios de asociación del genoma completo (Genome-
Wide Association Studies, GWAS) están ganando mucha popularidad de cara a buscar
una explicación genética a la presencia o ausencia de ciertas enfermedades en los seres
humanos. Existe un consenso entre estos estudios acerca de que muchas enfermedades
complejas presentan interacciones entre los diferentes genes que intervienen en su
expresión, un fenómeno conocido como epistasia. Esta tesis se centra en el estudio de
este fenómeno empleando la computación de altas prestaciones (High-Performance
Computing, HPC) y desde su perspectiva estadística: la desviación de la expresión de
un fenotipo como suma de los efectos de múltiples variantes genéticas. Para ello se
ha desarrollado una primera herramienta, MPI3SNP, que identifica interacciones de
tres variantes a partir de un conjunto de datos de entrada. MPI3SNP implementa una
búsqueda exhaustiva empleando un test de asociación basado en la Información Mutua,
y explota los recursos de clústeres de CPUs o GPUs para acelerar la búsqueda.
Con la ayuda de esta herramienta, hemos evaluado el estado del arte de la detección
de epistasia a través de un estudio que compara el rendimiento de veintisiete herramientas.
La conclusión más importante de esta comparativa es la incapacidad de los
métodos no exhaustivos de localizar interacciones ante la ausencia de efectos marginales
(pequeños efectos de asociación de variantes individuales pertenecientes a una
relación epistática). Por ello, esta tesis continuó centrándose en la optimización de la
búsqueda exhaustiva. Por un lado, se mejoró la eficiencia del test de asociación a través
de una implementación vectorial del mismo. Por otra parte, se diseñó un algoritmo
distribuido que implementa una búsqueda exhaustiva capaz de encontrar relaciones
epistáticas de cualquier tamaño. Estos dos hitos se logran en Fiuncho, una herramienta
que integra toda la investigación realizada, obteniendo un rendimiento en clústeres
de CPUs que supera a todas sus alternativas del estado del arte. A mayores, también se
ha desarrollado una librería para simular escenarios biológicos con epistasia llamada
Toxo. Esta librería permite la simulación de epistasia siguiendomodelos de interacción
existentes para orden alto.[Abstract]
In recent years, Genome-Wide Association Studies (GWAS) have become more and
more popular with the intent of finding a genetic explanation for the presence or absence
of particular diseases in human studies. There is consensus about the presence
of genetic interactions during the expression of complex diseases, a phenomenon
called epistasis. This thesis focuses on the study of this phenomenon, employingHigh-
Performance Computing (HPC) for this purpose and from a statistical definition of the
problem: the deviation of the expression of a phenotype from the addition of the individual
contributions of genetic variants. For this purpose, we first developedMPI3SNP,
a programthat identifies interactions of three variants froman input dataset. MPI3SNP
implements an exhaustive search of epistasis using an association test based on the
Mutual Information and exploits the resources of clusters of CPUs or GPUs to speed up
the search. Then, we evaluated the state-of-the-art methods with the help of MPI3SNP
in a study that compares the performance of twenty-seven tools. The most important
conclusion of this study is the inability of non-exhaustive approaches to locate epistasis
in the absence of marginal effects (small association effects of individual variants
that partake in an epistasis interaction). For this reason, this thesis continued focusing
on the optimization of the exhaustive search. First, we improved the efficiency of
the association test through a vector implementation of this procedure. Then, we developed
a distributed algorithm capable of locating epistasis interactions of any order.
These two milestones were achieved in Fiuncho, a program that incorporates all the
research carried out, obtaining the best performance in CPU clusters out of all the alternatives
of the state-of-the-art. In addition, we also developed a library to simulate
particular scenarios with epistasis called Toxo. This library allows for the simulation of
epistasis that follows existing interaction models for high-order interactions
Modern Computing Techniques for Solving Genomic Problems
With the advent of high-throughput genomics, biological big data brings challenges to scientists in handling, analyzing, processing and mining this massive data. In this new interdisciplinary field, diverse theories, methods, tools and knowledge are utilized to solve a wide variety of problems. As an exploration, this dissertation project is designed to combine concepts and principles in multiple areas, including signal processing, information-coding theory, artificial intelligence and cloud computing, in order to solve the following problems in computational biology: (1) comparative gene structure detection, (2) DNA sequence annotation, (3) investigation of CpG islands (CGIs) for epigenetic studies. Briefly, in problem #1, sequences are transformed into signal series or binary codes. Similar to the speech/voice recognition, similarity is calculated between two signal series and subsequently signals are stitched/matched into a temporal sequence. In the nature of binary operation, all calculations/steps can be performed in an efficient and accurate way. Improving performance in terms of accuracy and specificity is the key for a comparative method. In problem #2, DNA sequences are encoded and transformed into numeric representations for deep learning methods. Encoding schemes greatly influence the performance of deep learning algorithms. Finding the best encoding scheme for a particular application of deep learning is significant. Three applications (detection of protein-coding splicing sites, detection of lincRNA splicing sites and improvement of comparative gene structure identification) are used to show the computing power of deep neural networks. In problem #3, CpG sites are assigned certain energy and a Gaussian filter is applied to detection of CpG islands. By using the CpG box and Markov model, we investigate the properties of CGIs and redefine the CGIs using the emerging epigenetic data. In summary, these three problems and their solutions are not isolated; they are linked to modern techniques in such diverse areas as signal processing, information-coding theory, artificial intelligence and cloud computing. These novel methods are expected to improve the efficiency and accuracy of computational tools and bridge the gap between biology and scientific computing
Discovering Higher-order SNP Interactions in High-dimensional Genomic Data
In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise
- …