Genotype-by-genotype interactions (epistasis) are believed to be a significant source of unexplained genetic variation causing complex chronic diseases but have been ignored in genomewide association studies (GWAS) due to the computational burden of analysis. In this work we show how to benefit from FPGA technology for highly parallel creation of contingency tables in a systolic chain with a subsequent statistical test. We present the implementation for the FPGA-based hardware platform RIVYERA S6-LX150 containing 128 Xilinx Spartan6-LX150 FPGAs. For performance evaluation we compare against the method iLOCi[9]. iLOCi claims to outperform other available tools in terms of accuracy. However, analysis of a dataset from the Wellcome Trust Case Control Consortium (WTCCC) with about 500,000 SNPs and 5,000 samples still takes about 19 hours on a MacPro workstation with two Intel Xeon quad-core CPUs, while our FPGA-based implementation requires only 4 minutes.
Introduction
High-throughput genotyping technologies allow the collection of hundreds of thousands to a few million genetic markers, such as single nucleotide polymorphisms (SNPs), from individual DNA samples. In genome-wide association studies (GWAS) these genotypes are typically measured for several thousand individuals and then linked to a given phenotype of each individual, such as the presence (case) or absence (control) of an associated disease. In classical GWAS each genetic marker is analyzed separately in order to identify markers showing differences in genotype frequencies between cases and controls. Unfortunately, this approach is generally not powerful enough to model complex traits for which the detection of joint genetic effects (epistasis) needs to be considered [5, 7] . In (2-way) statistical epistasis each pair of measured markers is therefore tested in order to discover significant interactions that explain the given phenotype. Consequently, a number of algorithms have been developed to address the problem of detecting epistasis in recent years [1, 13, 17] . The main goal of these approaches is to find pairs of SNPs whose joint values show a statistically significant difference between cases and controls and thus they provide a list with these pairs that could explain a substantial proportion of genetic variation leading to disease.
Computing epistasis is highly time-consuming due to the large number of pairwise tests to be calculated; e.g. already for a moderately-sized dataset consisting of 500,000 SNPs there are about 125 billion pairwise interaction tests to be performed. Thus, many existing tools for calculating epistasis [6, 11, 14] require several days for processing moderately-sized datasets and several weeks to months for processing large-scale datasets on a standard CPU. Since both the availability and size of GWAS datasets are increasing rapidly, finding faster solutions is of high importance to research in this area. In this paper we address this problem by taking advantage of both fine-grained (by using of reconfigurable hardware (FPGAs)) and coarse-grained parallelism (using a number of FPGAs in parallel). Our parallel architecture is based on a systolic chain of processing elements for pairwise contingency table creation and a subsequent statistical test. Since contingency table creation is a common operation, our solution is easily adaptable to accelerate a large variety of epistasis tools by interchanging the statistical test implementation. In this paper we have chosen the test method of iLOCi [9] as a proof-of-concept. We show that this approach leads to an acceleration of between two and three orders of magnitude compared to the CPU-based approach.
Statistical Epistasis

Background
To perform large-scale epistasis studies, a lot of methods for pairwise interaction tests exist [13, 17] , balancing between reducing runtimes and keeping the error rate low. CPU-based approaches, such as BOOST [14, 16] , MDR [11] , MB-MDR [4] , iLOCi [9] etc., often result in long runtimes for exhaustive searches. Therefore prefiltering techniques may be applied to reduce the amount of SNP pairs, as in SIXPAC [10] , SNPRuler [15] , SNPHarvester [22] , TEAM [24] and Screen and Clean [21] . Other methods take advantage from special architectures, such as GPUs, to perform an exhaustive analysis, e.g. GBOOST [23] , SHEsisEpi [3] , EpiGPU [2] and others.
However, most of these methods have one thing in common, they compute contingency tables (see Sect. 2.2) for each SNP pair before calculating a significance value. Our approach presented in this paper focuses on exhaustive analysis of SNP pairs and shows how FPGA technology can be applied to create and prepare contingency tables for significance tests concurrently on a large scale. We target the RIVYERA architecture with 128 FPGAs (see Sect. 3 for details) to significantly speedup pairwise interaction tests. RIVYERA is already successful in accelerating other bioinformatics applications, such as Smith-Waterman alignments or BLAST database searches [18, 19, 20] and hence, shows promise for this target as well.
Furthermore, the calculation of the significance value in our method is exchangeable. Thus, our solution may be applied to many existing methods performing an exhaustive analysis. Other approaches may not directly be adapted to our method, but prefiltering techniques, such as filtering of single SNPs, can trivially be implemented in the preprocessing phase. In our presentation we chose the iLOCi method [9] (see Sect. 2.3) as an example for an exhaustive search.
Contingency Tables
A typical GWAS dataset consists of two groups of samples (cases and controls) which are genotyped at a set of marker positions (SNPs). In this paper we consider biallelic markers for diploid organisms which is the common use case, i.e. genotypes may appear as homozygous wild (w), heterozygous (h) or homozygous variant (v) type. For pairwise interaction analysis a contingency table is created for each pair of SNPs separately for case and control group. Therefore, with n denoting the total number of SNPs, n(n − 1)/2 tables have to be created (instead of n 2 due to the symmetry of SNP pairs). For moderate-size datasets, such as the Wellcome Trust Case Control Consortium (WTCCC) datasets [12] with about 500,000 SNPs, this implies about 125 billion tables. This huge amount of calculations is challenging for any computing system.
Contingency tables in pairwise interaction tests with biallelic markers have dimension 3 × 3, one entry for each possible combination of genotypes. The entries reflect the number of occurrences each combination of genotypes appears in the dataset for the corresponding SNP pair in either case and control group (see Fig. 1 ).
iLOCi
We have chosen iLOCi as an example application to test and compare against our implementation. iLOCi claims to outperform other tools, such as MDR [11] or BOOST [14] , in terms of accuracy. However, according to the authors, analysis of a WTCCC dataset [12] with about 500,000 SNPs and 5,000 samples still takes about 19 hours on a MacPro workstation.
iLOCi computes a significance value p diff for every possible pair of SNPs. This value is based on the previously created contingency tables as in Fig. 1 and is calculated as follows.
Let n case ij denote the genotype counts for the combination i and j of the current SNP pair in all case samples (i, j ∈ {0, 1, 2} corresponding to the three possible genotypes). p 
The same applies analogously to the control samples for calculation of p ctrl . The higher the p diff value the greater is the probability for an interaction [9] . Thus, unlike other algorithms, such as BOOST [14] , which filter the results by a threshold, iLOCi saves the n best results (e.g. n = 1000) and presents the corresponding SNP pairs in a sorted list.
RIVYERA S6-LX150 Architecture
In 2008 the computing platform RIVYERA [8] , originally developed for cryptanalysis, was introduced for problems related to bioinformatics. Here, the specific model RIVYERA S6-LX150 is presented.
The basic structure consists of two elements, a multiple-FPGA system and a server grade mainboard with standard PC components. The FPGA system consists of up to 16 FPGA modules with 8 Xilinx Spartan6-LX150 FPGAs each (upgrades allow up to 16 FPGAs on one module). Furthermore, each FPGA is connected to 256 MB DDR3-SDRAM. The mainboard is equipped with two Intel Xeon E5-2620 CPUs (6 cores @ 2GHz each) with 128 GB of RAM running a Linux OS. This system is later referred to as host.
The bus system implemented on the RIVYERA FPGA computer is organized as a systolic chain, i.e. every FPGA on an FPGA module is connected by fast point-to-point connections to each neighbor forming a ring. An additional member of this ring forms the communication controller. It provides the interconnection of each module to its neighboring modules and, on the first module, the uplink to the host via PCIe. An API hides control of the complete bus system and ensures transparency of the communication to the developer. Besides normal point-to-point transmissions, the API provides broadcast facilities and methods for configuring the FPGAs. A picture and the design structure of the RIVYERA S6-LX150 system is shown in Fig. 2 . 
Parallel Creation of Contingency Tables
Systolic Chain of Processing Elements
We have designed a systolic chain of processing elements (PEs) on each FPGA to concurrently generate as many contingency tables as possible. The genotype data is distributed among all FPGAs such that each FPGA processes two intervals of SNPs with all corresponding genotypes of all samples. The data has to be organized in genotypes grouped by cases and controls for each SNP. The contingency tables are created while the data from both intervals is streamed SNP-wise through the chain.
Each PE contains a local memory to store the complete genotype data for one SNP and a number of counters for the entries of the contingency table. The genotype data stream from the previous PE in the chain is directly provided to the next PE after one clock cycle. However, if the local SNP memory has not been initialized yet, the data of the first SNP is streamed into this memory and not provided to the next PE. After initialization of the local SNP, the next SNP in the stream is directly compared to the stored SNP genotype by genotype. For each pair of genotypes the corresponding counter is incremented. Since the genotypes are ordered by case and control group, the contingency table for each group is ready after processing the last genotype of a SNP for the corresponding group in the stream. The tables are provided to a transport bus afterwards and carried to a postprocessing unit which could be an entity calculating a statistic. After each SNP in the stream, the counters of the PE are reset for the next contingency table.
If the streaming process has finished all SNPs from both initial intervals, it starts all over but leaving out the first k SNPs, if k is the number of PEs in the chain. In the next iteration the first 2k SNPs are left out, and so on. The process stops until all SNPs from the first interval are to be left out. This way all possible SNP pairings in the first interval, i.e. each pair contains only SNPs from the first interval, and between both intervals, i.e. each pair contains a SNP from each interval, are processed. This makes an efficient distribution of the whole set of SNPs among all available FPGAs possible (see Sect. 4.3). Figures 3 and 4 show the design overview of the systolic chain of PEs and the processing sequence of an example dataset of six SNPs and three PEs.
After calculating the statistic, the results are filtered before being provided to the host. The filter could be threshold-based or, as for iLOCi, storage of the n best results. 
Specifics Regarding iLOCi
For the implementation of the iLOCi significance test, we have optimized the overall structure regarding the requirements in Eqs. (1) or n ctrl ij respective to the current group on turn. a = n 00 − n 02 − n 20 + n 22 (5)
These values are calculated on-the-fly by each PE while streaming the genotype data. The calculation of p for p case and p ctrl respectively directly follows from Eq. (4).
The implementation of Eq. (8) requires FPGA resources for one multiplier, one square root extractor and one divider. The multiplication is processed in integer arithmetics while square root extraction and division require double precision floating point arithmetics. Each component is implemented as a pipeline such that in every clock cycle new input data can be processed. This is necessary since all PEs provide their contingency tables at once with only one clock cycle delay. Furthermore, the FPGA resources are optimally utilized. p case is calculated before p ctrl for each concurrently processed SNP pairs. Thus, these values have to be stored in a FIFO buffer. After calculating p ctrl the corresponding p case is extracted from the buffer and the difference according to Eq. (1) is computed. Finally, the n-best results of this FPGA are stored in a buffer and provided to the host which collects the final n-best results from all FPGAs afterwards.
Distribution of Data
To achieve good load balancing for a large number of FPGAs the distribution of genotype data has to follow a certain scheme. Due to the symmetry of the contingency tables only n(n − 1)/2 tables have to be created for cases and controls (with n the number of SNPs). Therefore, we employ a scheme that enables us to create almost redundant-free SNP pairings while keeping the workload balanced. Figure 5 shows an example how SNPs would be distributed if seven FPGAs were available. Each FPGA receives two SNP intervals, one interval on the horizontal and one on the vertical axis. According to Sect. 4.1, a rectangular tile of the table space is calculated with this data. Exploiting the symmetry again, the triangle below the symmetry axis (dashed line) is always processed as well for no extra cost. Unfortunately, it is inevitable that a few of these triangles are calculated multiple times. Hence, duplicate results may be produced by different FPGAs which are filtered by the host software. One FPGA (no. 7 in the figure) is reserved for the remaining triangle which is not covered by the calculation of a rectangle.
This scheme obviously works for a number of available FPGAs in the sequence of triangular numbers (1, 3, 6, 10, 15, 21, 28 , . . . ) plus one for the last triangle. As our RIVYERA system contains 128 FPGAs, the nearest number is 120. Hence, to avoid idle FPGAs at all, we split all available FPGAs in two groups with slightly different interval sizes. With this configuration applied on a WTCCC dataset, each FPGA gets two intervals of about 30, 000 SNPs with all corresponding 5, 000 genotypes for each SNP. Since genotypes are coded in a two bit representation, only 75 MB of memory are required, easily fitting in the local DRAM. Table 1 : iLOCi performance for analysis of a dataset with 500,000 SNPs and 5,000 samples. GPU results are extrapolated.
Performance Evaluation and Results
Our implementation of the iLOCi test on RIVYERA S6-LX150 contains 100 PEs on each FPGA for creation of contingency tables and allows to store up to 1000 best results. The FPGA resources are utilized by about 80% regarding the FPGA's slices, leaving enough buffer for more complex tests. The chip frequency is 100 MHz for IO and 150 MHz for genotype streaming, counting and calculation of the significance. The FPGA specification has been developed using the Xilinx ISE Design Suite and the VHDL programming language. iLOCi [9] takes about 19 hours on a MacPro workstation with two Intel Xeon quad-core CPUs @ 2.4 GHz for a WTCCC dataset [12] with 500,000 SNPs and 5,000 samples. In contrast, our RIVYERA S6-LX150 implementation requires less than 4 minutes, leading to a speedup of more than 285. As iLOCi uses the OpenCL interface, we are also able to measure its performance on a GPU system, specifically an nVidia GeForce GTX Titan. Due to a surprisingly poor performance on this system, we have used the iLOCi example dataset with only 8,000 SNPs and 4,901 samples to extrapolate the runtimes for the WTCCC dataset. We have set the number of stored best results to 1,000 but applied no extra parameters. The runtime is 94 s to calculate the ∼ 8, 000
2 /2 = 32 × 10 6 tests. Extrapolated to the ∼ 500, 000 2 /2 tests of a WTCCC dataset, the runtime would be more than four days. Table 1 lists runtimes of the implementations including their respective power consumption during the operation and their total energy consumption for this task. For the CPU and GPU versions, we have chosen the corresponding thermal design power specification without any peripheral, while we have actually measured the energy consumption of the RIVYERA S6-LX150 system with the on-board IPMI interface.
Furthermore, we have compared the runtime of our iLOCi implementation against other methods that perform an exhaustive creation of contingency tables on the same dataset size (see Table 2 ). To accomplish this, we have determined the speed in tests per second and interpolated from the published results of the corresponding authors. We made an exception for GBOOST [23] . Since it is freely available and performs best in our compared GPU solutions, we measured the performance on two of our own systems as well. All results have to be treated carefully since the comparison is done over different architectures. However, it shows that our implementation on the RIVYERA architecture is able to outperform other architectures by far.
Conclusion
Recent advances in high-throughput genotyping technologies establish the need for fast implementations of statistical epistasis in GWAS. Recent work has shown how GPUs can be used to accelerate such methods. [23] GPU (GeForce GTX285) 2 h 43 m 12.81 EpiGPU [2] GPU (GeForce GTX580) 2 h 55 m 11.90 SHEsisEpi [3] GPU (2x GeForce GTX285) 27 h 1.29 iLOCi [9] CPU (2x Xeon quad-core @ 2.4 GHz) 19 h 1.83 BOOST [14] CPU (@ 3 GHz) 121 h 0.29 Table 2 : Performance comparison of different GWAS methods based on exhaustive creation of contingency tables for analysis of a dataset with 500,000 SNPs and 5,000 samples. Results are interpolated from the publications of the corresponding authors, except those marked with (*) have actually been measured.
and flexible parallel architecture for contingency table creation with a subsequent statistical test. Since contingency table creation is common to most epistasis tools our architecture can thus be used to accelerate a large variety of tools by simply interchanging the statistical test core. As a proof-of-concept we have implemented the iLOCi algorithm. This leads to a speedup of between two and three orders of magnitude on the RIVYERA S6-LX150 system compared to the CPU-based iLOCi implementation. Furthermore, we have shown speedups of one to two orders of magnitude compared to the GPU-based implementations GBOOST and EpiGPU.
The utilized Spartan6-LX150 FPGA is still based on 45 nm technology. Newer technology on the market, e.g. a cutting-edge Xilinx Kintex7 based on 28 nm technology provides around three times more logic resources, 10 times more DSPs for faster floating-point computations and 7 times more BRAM for local storage. Furthermore, the smaller transistors in the 28 nm structure allow faster reaction times and thus, higher frequencies. Extrapolating our results to a Kintex7, we expect a further performance gain of around one order of magnitude. A RIVYERA system equipped with 128 FPGAs of this type would make it possible to compute statistical epistasis of large-scale datasets consisting of around 5 million SNPs and ten thousand individuals in less than one hour.
Our future work includes evaluating our existing architecture to other epistasis algorithms and extending it to calculate 3-way interactions.
