19 research outputs found
High-performance epistasis detection in quantitative trait GWAS
epiSNP is a program for identifying pairwise single nucleotide polymorphism (SNP) interactions (epistasis) in quantitative-trait genome-wide association studies (GWAS). A parallel MPI version (EPISNPmpi) was created in 2008 to address this computationally expensive analysis on large data sets with many quantitative traits and SNP markers. However, the falling cost of genotyping has led to an explosion of large-scale GWAS data sets that challenge EPISNPmpi’s ability to compute results in a reasonable amount of time. Therefore, we optimized epiSNP for modern multi-core and highly parallel many-core processors to efficiently handle these large data sets. This paper describes the serial optimizations, dynamic load balancing using MPI-3 RMA operations, and shared-memory parallelization with OpenMP to further enhance load balancing and allow execution on the Intel Xeon Phi coprocessor (MIC). For a large GWAS data set, our optimizations provided a 38.43× speedup over EPISNPmpi on 126 nodes using 2 MICs on TACC’s Stampede Supercomputer. We also describe a Coarray Fortran (CAF) version that demonstrates the suitability of PGAS languages for problems with this computational pattern. We show that the Coarray version performs competitively with the MPI version on the NERSC Edison Cray XC30 supercomputer. Finally, the performance benefits of hyper-threading for this application on Edison (average 1.35× speedup) are demonstrated
A Survey of Processing Systems for Phylogenetics and Population Genetics
The COVID-19 pandemic brought Bioinformatics into the spotlight, revealing that several existing methods, algorithms, and tools were not well prepared to handle large amounts of genomic data efficiently. This led to prohibitively long execution times and the need to reduce the extent of analyses to obtain results in a reasonable amount of time. In this survey, we review available high-performance computing and hardware-accelerated systems based on FPGA and GPU technology. Optimized and hardware-accelerated systems can conduct more thorough analyses considerably faster than pure software implementations, allowing to reach important conclusions in a timely manner to drive scientific discoveries. We discuss the reasons that are currently hindering high-performance solutions from being widely deployed in real-world biological analyses and describe a research direction that can pave the way to enable this
High performance computing enabling exhaustive analysis of higher order single nucleotide polymorphism interaction in Genome Wide Association Studies.
Genome-wide association studies (GWAS) are a common approach for systematic discovery of single nucleotide polymorphisms (SNPs) which are associated with a given disease. Univariate analysis approaches commonly employed may miss important SNP associations that only appear through multivariate analysis in complex diseases. However, multivariate SNP analysis is currently limited by its inherent computational complexity. In this work, we present a computational framework that harnesses supercomputers. Based on our results, we estimate a three-way interaction analysis on 1.1 million SNP GWAS data requiring over 5.8 years on the full "Avoca" IBM Blue Gene/Q installation at the Victorian Life Sciences Computation Initiative. This is hundreds of times faster than estimates for other CPU based methods and four times faster than runtimes estimated for GPU methods, indicating how the improvement in the level of hardware applied to interaction analysis may alter the types of analysis that can be performed. Furthermore, the same analysis would take under 3 months on the currently largest IBM Blue Gene/Q supercomputer "Sequoia" at the Lawrence Livermore National Laboratory assuming linear scaling is maintained as our results suggest. Given that the implementation used in this study can be further optimised, this runtime means it is becoming feasible to carry out exhaustive analysis of higher order interaction studies on large modern GWAS.This research was partially funded by NHMRC grant 1033452 and was supported by a Victorian Life Sciences Computation Initiative (VLSCI) grant number 0126 on its Peak Computing Facility at the University of Melbourne, an initiative of the Victorian Government, Australia
Fiuncho: a program for any-order epistasis detection in CPU clusters
Financiado para publicación en acceso aberto: CRUE/CISUG[Abstract]: Epistasis can be defined as the statistical interaction of genes during the expression of a phenotype. It is believed that it plays a fundamental role in gene expression, as individual genetic variants have reported a very small increase in disease risk in previous Genome-Wide Association Studies. The most successful approach to epistasis detection is the exhaustive method, although its exponential time complexity requires a highly parallel implementation in order to be used. This work presents Fiuncho, a program that exploits all levels of parallelism present in x86_64 CPU clusters in order to mitigate the complexity of this approach. It supports epistasis interactions of any order, and when compared with other exhaustive methods, it is on average 358, 7 and 3 times faster than MDR, MPI3SNP and BitEpi, respectively.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00 / AEI / 10.13039/501100011033), the Xunta de Galicia and FEDER funds of the EU (CITIC-Centro de Investigación de Galicia accreditation 2019–2022, Grant no. ED431G 2019/01), Consolidation Program of Competitive Research (Grant no. ED431C 2021/30), and the FPU Program of the Ministry of Education of Spain (Grant no. FPU16/01333).Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/3
High-Order Epistasis Detection in High Performance Computing Systems
Programa Oficial de Doutoramento en Investigación en TecnoloxÃas da Información. 524V01[Resumo]
Nos últimos anos, os estudos de asociación do xenoma completo (Genome-Wide
Association Studies, GWAS) están a gañar moita popularidade de cara a buscar unha
explicación xenética á presenza ou ausencia de certas enfermidades nos humanos.Hai
un consenso nestes estudos sobre a existencia de interaccións xenéticas que condicionan
a expresión de enfermidades complexas, un fenómeno coñecido como epistasia.
Esta tese céntrase no estudo deste fenómeno empregando a computación de altas
prestacións (High-Performance Computing, HPC) e dende a súa perspectiva estadÃstica:
a desviación da expresión dun fenotipo como a suma dos efectos individuais de
múltiples variantes xenéticas. Con este obxectivo desenvolvemos unha primeira ferramenta,
chamada MPI3SNP, que identifica interaccións de tres variantes a partir dun
conxunto de datos de entrada. MPI3SNP implementa unha busca exhaustiva empregando
un test de asociación baseado na Información Mutua, e explota os recursos de
clústeres de CPUs ou GPUs para acelerar a busca. Coa axuda desta ferramenta avaliamos
o estado da arte da detección de epistasia a través dun estudo que compara o rendemento
de vintesete ferramentas. A conclusión máis importante desta comparativa
é a incapacidade dos métodos non exhaustivos de atopar interacción ante a ausencia
de efectos marxinais (pequenos efectos de asociación das variantes individuais que
participan na epistasia). Por isto, esta tese continuou centrándose na optimización da
busca exhaustiva de epistasia. Por unha parte, mellorouse a eficiencia do test de asociación
a través dunha implantación vectorial do mesmo. Por outro lado, creouse un
algoritmo distribuÃdo que implementa unha busca exhaustiva capaz de atopar epistasia
de calquera orden. Estes dous fitos lógranse en Fiuncho, unha ferramenta que integra
toda a investigación realizada, obtendo un rendemento en clústeres de CPUs que
supera a todas as súas alternativas no estado da arte. Adicionalmente, desenvolveuse
unha librarÃa para simular escenarios biolóxicos con epistasia chamada Toxo. Esta
librarÃa permite a simulación de epistasia seguindo modelos de interacción xenética
existentes para orde alto.[Resumen]
En los últimos años, los estudios de asociación del genoma completo (Genome-
Wide Association Studies, GWAS) están ganando mucha popularidad de cara a buscar
una explicación genética a la presencia o ausencia de ciertas enfermedades en los seres
humanos. Existe un consenso entre estos estudios acerca de que muchas enfermedades
complejas presentan interacciones entre los diferentes genes que intervienen en su
expresión, un fenómeno conocido como epistasia. Esta tesis se centra en el estudio de
este fenómeno empleando la computación de altas prestaciones (High-Performance
Computing, HPC) y desde su perspectiva estadÃstica: la desviación de la expresión de
un fenotipo como suma de los efectos de múltiples variantes genéticas. Para ello se
ha desarrollado una primera herramienta, MPI3SNP, que identifica interacciones de
tres variantes a partir de un conjunto de datos de entrada. MPI3SNP implementa una
búsqueda exhaustiva empleando un test de asociación basado en la Información Mutua,
y explota los recursos de clústeres de CPUs o GPUs para acelerar la búsqueda.
Con la ayuda de esta herramienta, hemos evaluado el estado del arte de la detección
de epistasia a través de un estudio que compara el rendimiento de veintisiete herramientas.
La conclusión más importante de esta comparativa es la incapacidad de los
métodos no exhaustivos de localizar interacciones ante la ausencia de efectos marginales
(pequeños efectos de asociación de variantes individuales pertenecientes a una
relación epistática). Por ello, esta tesis continuó centrándose en la optimización de la
búsqueda exhaustiva. Por un lado, se mejoró la eficiencia del test de asociación a través
de una implementación vectorial del mismo. Por otra parte, se diseñó un algoritmo
distribuido que implementa una búsqueda exhaustiva capaz de encontrar relaciones
epistáticas de cualquier tamaño. Estos dos hitos se logran en Fiuncho, una herramienta
que integra toda la investigación realizada, obteniendo un rendimiento en clústeres
de CPUs que supera a todas sus alternativas del estado del arte. A mayores, también se
ha desarrollado una librerÃa para simular escenarios biológicos con epistasia llamada
Toxo. Esta librerÃa permite la simulación de epistasia siguiendomodelos de interacción
existentes para orden alto.[Abstract]
In recent years, Genome-Wide Association Studies (GWAS) have become more and
more popular with the intent of finding a genetic explanation for the presence or absence
of particular diseases in human studies. There is consensus about the presence
of genetic interactions during the expression of complex diseases, a phenomenon
called epistasis. This thesis focuses on the study of this phenomenon, employingHigh-
Performance Computing (HPC) for this purpose and from a statistical definition of the
problem: the deviation of the expression of a phenotype from the addition of the individual
contributions of genetic variants. For this purpose, we first developedMPI3SNP,
a programthat identifies interactions of three variants froman input dataset. MPI3SNP
implements an exhaustive search of epistasis using an association test based on the
Mutual Information and exploits the resources of clusters of CPUs or GPUs to speed up
the search. Then, we evaluated the state-of-the-art methods with the help of MPI3SNP
in a study that compares the performance of twenty-seven tools. The most important
conclusion of this study is the inability of non-exhaustive approaches to locate epistasis
in the absence of marginal effects (small association effects of individual variants
that partake in an epistasis interaction). For this reason, this thesis continued focusing
on the optimization of the exhaustive search. First, we improved the efficiency of
the association test through a vector implementation of this procedure. Then, we developed
a distributed algorithm capable of locating epistasis interactions of any order.
These two milestones were achieved in Fiuncho, a program that incorporates all the
research carried out, obtaining the best performance in CPU clusters out of all the alternatives
of the state-of-the-art. In addition, we also developed a library to simulate
particular scenarios with epistasis called Toxo. This library allows for the simulation of
epistasis that follows existing interaction models for high-order interactions
A Hybrid-parallel Architecture for Applications in Bioinformatics
Since the advent of Next Generation Sequencing (NGS) technology, the amount of data from whole genome sequencing has been rising fast. In turn, the availability of these resources led to the tapping of whole new research fields in molecular and cellular biology, producing even more data. On the other hand, the available computational power is only increasing linearly. In recent years though, special-purpose high-performance devices started to become prevalent in today’s scientific data centers, namely graphics processing units (GPUs) and, to a lesser extent, field-programmable gate arrays (FPGAs). Driven by the need for performance, developers started porting regular applications to GPU frameworks and FPGA configurations to exploit the special operations only these devices may perform in a timely manner. However, applications using both accelerator technologies are still rare. Major challenges in joint GPU/FPGA application development include the required deep knowledge of associated programming paradigms and the efficient communication both types of devices. In this work, two algorithms from bioinformatics are implemented on a custom hybrid-parallel hardware architecture and a highly concurrent software platform. It is shown that such a solution is not only possible to develop but also its ability to outperform implementations on similar- sized GPU or FPGA clusters in terms of both performance and energy consumption. Both algorithms analyze case/control data from genome- wide association studies to find interactions between two or three genes with different methods. Especially in the latter case, the newly available calculation power and method enables analyses of large data sets for the first time without occupying whole data centers for weeks. The success of the hybrid-parallel architecture proposal led to the development of a high- end array of FPGA/GPU accelerator pairs to provide even better runtimes and more possibilities
Recommended from our members
Laboratory Directed Research and Development Program FY 2007
Report on Ernest Orlando Lawrence Berkeley National Laboratory Laboratory Directed Research and Development Program FY 200
Recommended from our members
Evolutionary algorithms and other metaheuristics in water resources: Current status, research challenges and future directions
Copyright © 2014 Elsevier. NOTICE: this is the author’s version of a work that was accepted for publication in Environmental Modelling and Software. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Environmental Modelling and Software Vol. 62 (2014), DOI: 10.1016/j.envsoft.2014.09.013The development and application of evolutionary algorithms (EAs) and other metaheuristics for the optimisation of water resources systems has been an active research field for over two decades. Research to date has emphasized algorithmic improvements and individual applications in specific areas (e.g. model calibration, water distribution systems, groundwater management, river-basin planning and management, etc.). However, there has been limited synthesis between shared problem traits, common EA challenges, and needed advances across major applications. This paper clarifies the current status and future research directions for better solving key water resources problems using EAs. Advances in understanding fitness landscape properties and their effects on algorithm performance are critical. Future EA-based applications to real-world problems require a fundamental shift of focus towards improving problem formulations, understanding general theoretic frameworks for problem decompositions, major advances in EA computational efficiency, and most importantly aiding real decision-making in complex, uncertain application contexts